2003-08-28 Untangling Compound Documents on the Web http://www.ht03.org/papers/pdfs/12.pdf Abstract from http://www.ht03.org/papers/ "Most text analysis is designed to deal with the concept of a 'document', namely a cohesive presentation of thought on a unifying subject. By contrast, individual nodes on the World Wide Web tend to have a much smaller granularity than text documents. We claim that the notions of 'document' and 'web node' are not synonomous, and that authors often tend to deploy documents as collections of URLs, which we call 'compound documents'. In this paper we present new techniques for identifying and working with such compound documents, and the results of some large-scale studies on such web documents. The primary motivation for this work stems from the fact that information retrieval techniques are better suited to working on documents than individual hypertext nodes." hypertext [as the web] is very influential, nodes on everything. but... it's just a sea of nodes out there. the nature of the web, it's huge. we didn't realise it would become this. [ooh, just got a shudder of what the web is. it's a new thing, spreading over the planet, a big new thing that's being built. not just infrastructure, not just natural. but the web!] search engines are now crucial. they're necessary for you to establish your context on a topic [to see the neighbourhood around it, if you can't see where you are. nice!]. example of some of the problems: - multiple URLs make up a single document (eg, a long story split over several pages) what is a document? - cohesive presentation of thought on a single topic. single or multiply authored; with multiple authors each must be aware of the overall structure. [this could be disputed. a document often requires context to be read, but is that the same document? it's nelson's "soft shell". a folded representation of the author's reading.] originally urls were files from the filesystem. but compound documents are a common idea: . KMS has "tree items" . Notecards had Filebnoxes . Dexter reference model had them too compounds common on the web too -- tend to be high-quality content, but aren't easily recognisable. search engines need to be able to recognise them to return them if your keywords are split across urls. some methods to recognise: . metadata . heuristics (eg, "next" anchortext) . linguistic simularity . better algorithms there's hierarchy in urls, and it works for a bit, but: humans process information by associating - bush hierarchical organisation is unnatural - nelson so * humans assimulation info by association however * humans still often organise tectual information hierarchically - letter, word, phrase, sentence, paragraph, subsection, section, chapter, volume, collection [i'm not sure this is even true. sequence is the most *visible* of the organising structure, but a good author will use refrains and oppositions, etc, across an entire work - deliberately! - in order to convey more information. of course that doesn't work if you want to break up text. or rather, if you assume that when you break up text you also break up meaning. you don't: text is a hologram of meaning. if you break it, the meaning gets lower resolution.] [oh good discovery in hypertext:] * existence of a graph structure does not imply existence compound document - fully connected structure is too restrictive - reachability is not strong enough so how else can we identify compounds? . indegrees to a directory will often identify leaders . indegrees can also be used to identify compound document directories . leaders tend to link to many documents in the directory (eg toc) [but not all, sidebars etc] google try to do this too: they collapse results from a single site into a "leader" and parts. two heuristics: * rare links heuristics: if most external links from a directory are shared by most of the pages in a directory, then that directory is likely to be a compound document (this is because compound docs are often templated) * anchor text heuristic: similar anchor texts together these heuristics identify 9.5% or all directories as compond documents a closing thought of his: a frustration of the hypertext community with the www is that it's a collection of systems that don't work together very well. but there's space for high-level navigational tools, like google.