2003-08-28

Untangling Compound Documents on the Web
http://www.ht03.org/papers/pdfs/12.pdf


Abstract from http://www.ht03.org/papers/
"Most text analysis is designed to deal with the concept of a 'document',
namely a cohesive presentation of thought on a unifying subject. By contrast,
individual nodes on the World Wide Web tend to have a much smaller granularity
than text documents. We claim that the notions of 'document' and 'web node' are
not synonomous, and that authors often tend to deploy documents as collections
of URLs, which we call 'compound documents'. In this paper we present new
techniques for identifying and working with such compound documents, and the
results of some large-scale studies on such web documents. The primary
motivation for this work stems from the fact that information retrieval
techniques are better suited to working on documents than individual hypertext
nodes."

hypertext [as the web] is very influential, nodes on everything. but...

it's just a sea of nodes out there. the nature of the web, it's huge. we didn't
realise it would become this.

[ooh, just got a shudder of what the web is. it's a new thing, spreading over
the planet, a big new thing that's being built. not just infrastructure, not
just natural. but the web!]

search engines are now crucial. they're necessary for you to establish your
context on a topic [to see the neighbourhood around it, if you can't see where
you are. nice!].

example of some of the problems:
- multiple URLs make up a single document (eg, a long story split over several
pages)

what is a document?
- cohesive presentation of thought on a single topic. single or multiply
authored; with multiple authors each must be aware of the overall structure.

[this could be disputed. a document often requires context to be read, but is
that the same document? it's nelson's "soft shell". a folded representation of
the author's reading.]

originally urls were files from the filesystem.

but compound documents are a common idea:
. KMS has "tree items"
. Notecards had Filebnoxes
. Dexter reference model had them too

compounds common on the web too -- tend to be high-quality content, but aren't
easily recognisable.

search engines need to be able to recognise them to return them if your
keywords are split across urls.


some methods to recognise:
. metadata
. heuristics (eg, "next" anchortext)
. linguistic simularity
. better algorithms


there's hierarchy in urls, and it works for a bit, but:

humans process information by associating
- bush

hierarchical organisation is unnatural
- nelson

so

* humans assimulation info by association

however

* humans still often organise tectual information hierarchically
- letter, word, phrase, sentence, paragraph, subsection, section, chapter,
volume, collection

[i'm not sure this is even true. sequence is the most *visible* of the
organising structure, but a good author will use refrains and oppositions, etc,
across an entire work - deliberately! - in order to convey more information. of
course that doesn't work if you want to break up text. or rather, if you assume
that when you break up text you also break up meaning. you don't: text is a
hologram of meaning. if you break it, the meaning gets lower resolution.]


[oh good discovery in hypertext:]
* existence of a graph structure does not imply existence compound document
- fully connected structure is too restrictive
- reachability is not strong enough

so how else can we identify compounds?
. indegrees to a directory will often identify leaders
. indegrees can also be used to identify compound document directories
. leaders tend to link to many documents in the directory (eg toc) [but not
all, sidebars etc]

google try to do this too: they collapse results from a single site into a
"leader" and parts.


two heuristics:

* rare links heuristics: if most external links from a directory are shared by
most of the pages in a directory, then that directory is likely to be a
compound document (this is because compound docs are often templated)
* anchor text heuristic: similar anchor texts

together these heuristics identify 9.5% or all directories as compond documents


a closing thought of his:

a frustration of the hypertext community with the www is that it's a collection
of systems that don't work together very well. but there's space for high-level
navigational tools, like google.