2003-08-27 Extracting evolution of web communities from a series of web archives http://www.ht03.org/papers/pdfs/4.pdf * BACKGROUND web is tracking many social events eg, people talking about 9-11, lots of different pages being linked to, other pages acting as link links these organise into densely clustered groups * as set of pages on a topic can be identified from link structure - a set of good pages on a topic (Authorities), linked together by good link lists (Hubs) Communities represent certain topics * APPROACH - build 4 web archives, 119M docs, 1999 to 2002 - extract web communities: Web Community Chart (HT01) - extract evolution from diffs of communities over time - evolution metrics: growth rate, novelty, etc so for applications: - how many and what kinds of pages have been created? - how have peace movements spread? - how have reputations been formed? * WEB COMMUNITY CHART RPA (related page algorithm) . build a subgraph around the seed page, extract authorities as related pages (a seed is a URL with >3 incoming links) extract pages densely connected weights of edges are number of derivations between communities [the demo is a words on sticks thing. looks like nodes are communities: "canon/epson/ipo", then clicking on that brings up web pages that constitute that community. arcs are weighted.] * EXTRACTING EVOLUTION OF WEB COMMUNITIES four types of change with time: . emerge . dissolve . split . merge [there are mechanisms to do this: a continuing communities is the most urls in a community that are maintained. other bits are splitting off or joining.] so now there are evolution metrics: . growth rate . novelty (new urls) . disappearance rate . split rate . merge rate some interesting points: . the archive... in 2002/2: 45M pages, 1511K seeds, 170K communities . size distribution of communities was roughly power law (tended to curve slightly) - that's interesting - and roughly stable over time [check the paper for this] . emerged and dissolved communities -- size distribution of these follow the power law. both exponents are greater than ones in size distribution of all communities (smaller communities are easier to dissolve or emerge) . split and merged follow power law and have symmetry . growth rate of communities has symmetry too... there's no overall growth or shrinkage over all communities over time [cool evolution of community browser! see the history of a community - list of urls - and how that merges/changes/etc. the communities can be sorted by the various evolution metrics.] == Rumblings: See also: How the text on wiki pages changes over time: http://researchweb.watson.ibm.com/history/ (and their other projects http://domino.watson.ibm.com/cambridge/research.nsf/pages/cue.html?Open )