2003-04-23

Peer-to-Peer Semantic Search Engines: Building a Memex
http://conferences.oreillynet.com/cs/et2003/view/e_sess/3681


Memex
. coping with growing amount of data
. personal information repository
. random access, extensibility, ability to add new data

He also mentions trails [the fact that people can swap trails is pretty cool.
Trails are just another document. An idea for the www in there?]

on metadata
. very expensive to create (humanities example: one of their team tried to put
a lesson plan online; a programmer putting that online took over 10 times the
amount of time it took to gather it)
. metadata standards can be brittle (xhtml 2.0)
. sometimes you can't trust the markup (html meta tags)
. good metadata is always a good thing
. bad metadata can be worse than none

semantic indexing
. no metadata? no problem
. infer semantic relationshyips from document content
. patterns of word use reflect high level knowledge

steven johnson's notes
. 1146 paragraph length clippings from 15 books (types by an assistant)
. neurobiology, social insects, city plannings
[i'd really like to have a selection of clippings like that]

they tested on these notes:
. searching for photosynthesis not only finds documents including that word,
but documents on the same subject without that word

-> closely related documents share words

"given any two words, what is the distance between them in this document
universe"

they're demoing this in an application running in a browser on mac os x,
indexing SJB's documents.

(incidental coolness: vampire bats and food play tit-for-tat punishing if full
bats don't share food.)

[this is an interesting way of searching. iterate towards what you're after.
could this work with a folder metaphor, or at least like on the desktop? couple
it with some timeline stuff, it could be really good.]

this form of indexing is language agnostic, it just has to be parsed into
words.

[!!! f me] they got it working with non-word datasets, at the bioinformatics
conferences they did it with genome data

[oooh cool graphs]

latent semantic indexing references
. deerwester, 1990
. LSI squishes the document space down into small set of semantic dimensions.
sounds like Principle Component Analysis (his examples was 50,000 dimensions
down to 100. which is more than quasars, say, where it's 1000 to 10).

other things:
. bad for >10000 docs
. patented for txt search
. no way to represent inter-document links
. limited ability to update
. computationally expensive

contextual network graphs:
. simpler, first described in 2003: these guys just published yesterday
. documents and terms are represented as a bipartite graph
. each document and each word is a node. words in documents are connected
together
. see what's related by the water-pipe model. pour water into a node, and see
where it spreads. hey, that's good.
. really easy to update, no calculation. just keep on adding nodes
. no patents
. it doesn't handle hard queries as well as LSI
[ http://javelina.cet.middlebury.edu/cns/Contextual_Network_Graphs.pdf ]

so now it gets to the peer-to-peer bit
. you can sign up for search results as an RSS feed
. develop opencola style clients: pull together results to add in similar
results

peer to peer search
. create your search domain from modular pieces
. the search engine behaves as if it were a single collection
[where can i get this??]

[this is the guy's weblog http://www.idlewords.com/
and here's the software http://www.nitle.org/semantic_search.php
for download!]

the software seems to use RSS quite a lot. [i'm downloading now, so i'll see
later]

[this talk is at http://www.nitle.org/etcon/ ]

some person in the audience:
raymond-parks algorithm. each document is a vector. to compute how similar two
documents are, take a cosine of the two vectors [this was on perl.com a while
ago]

some other bloke:
wavepath.com has 770,000 weblog posts indexed.

[talk to hammersley to talk to bloke about getting the LSI dimension squashing
script. i'd really like that]