2003-04-23 Peer-to-Peer Semantic Search Engines: Building a Memex http://conferences.oreillynet.com/cs/et2003/view/e_sess/3681 Memex . coping with growing amount of data . personal information repository . random access, extensibility, ability to add new data He also mentions trails [the fact that people can swap trails is pretty cool. Trails are just another document. An idea for the www in there?] on metadata . very expensive to create (humanities example: one of their team tried to put a lesson plan online; a programmer putting that online took over 10 times the amount of time it took to gather it) . metadata standards can be brittle (xhtml 2.0) . sometimes you can't trust the markup (html meta tags) . good metadata is always a good thing . bad metadata can be worse than none semantic indexing . no metadata? no problem . infer semantic relationshyips from document content . patterns of word use reflect high level knowledge steven johnson's notes . 1146 paragraph length clippings from 15 books (types by an assistant) . neurobiology, social insects, city plannings [i'd really like to have a selection of clippings like that] they tested on these notes: . searching for photosynthesis not only finds documents including that word, but documents on the same subject without that word -> closely related documents share words "given any two words, what is the distance between them in this document universe" they're demoing this in an application running in a browser on mac os x, indexing SJB's documents. (incidental coolness: vampire bats and food play tit-for-tat punishing if full bats don't share food.) [this is an interesting way of searching. iterate towards what you're after. could this work with a folder metaphor, or at least like on the desktop? couple it with some timeline stuff, it could be really good.] this form of indexing is language agnostic, it just has to be parsed into words. [!!! f me] they got it working with non-word datasets, at the bioinformatics conferences they did it with genome data [oooh cool graphs] latent semantic indexing references . deerwester, 1990 . LSI squishes the document space down into small set of semantic dimensions. sounds like Principle Component Analysis (his examples was 50,000 dimensions down to 100. which is more than quasars, say, where it's 1000 to 10). other things: . bad for >10000 docs . patented for txt search . no way to represent inter-document links . limited ability to update . computationally expensive contextual network graphs: . simpler, first described in 2003: these guys just published yesterday . documents and terms are represented as a bipartite graph . each document and each word is a node. words in documents are connected together . see what's related by the water-pipe model. pour water into a node, and see where it spreads. hey, that's good. . really easy to update, no calculation. just keep on adding nodes . no patents . it doesn't handle hard queries as well as LSI [ http://javelina.cet.middlebury.edu/cns/Contextual_Network_Graphs.pdf ] so now it gets to the peer-to-peer bit . you can sign up for search results as an RSS feed . develop opencola style clients: pull together results to add in similar results peer to peer search . create your search domain from modular pieces . the search engine behaves as if it were a single collection [where can i get this??] [this is the guy's weblog http://www.idlewords.com/ and here's the software http://www.nitle.org/semantic_search.php for download!] the software seems to use RSS quite a lot. [i'm downloading now, so i'll see later] [this talk is at http://www.nitle.org/etcon/ ] some person in the audience: raymond-parks algorithm. each document is a vector. to compute how similar two documents are, take a cosine of the two vectors [this was on perl.com a while ago] some other bloke: wavepath.com has 770,000 weblog posts indexed. [talk to hammersley to talk to bloke about getting the LSI dimension squashing script. i'd really like that]