2003-08-29 Refinement of TF-IDF Schemes for Web Pages using their Hyperlinked Neighboring Pages http://www.ht03.org/papers/pdfs/33.pdf Abstract from http://www.ht03.org/papers/ "In IR (Information Retrieval) systems based on the vector space model, the tf-idf scheme is widely used to characterize documents. However, in the case of documents with hyperlink structures such as Web pages, we believe that a technique for representing the contents of Web pages more accurately is required by exploiting the contents of their hyperlinked neighboring pages. In this paper, we first propose three methods for refining the tf-idf scheme for a target Web page by using the contents of its hyperlinked neighboring pages, and then compare retrieval accuracy of our proposed methods. Experimental results show that, generally, more accurate feature vectors of a target Web page can be generated in the case of utilizing the contents of its hyperlinked neighboring pages at levels up to second in the backward direction from the target page." [ the maths is far too hard for me. this is what i think is being said: proposed methods for altering pagerank: in web space, there are groups of links that are defined in the backwards or forwards link direction. so method 2 has layers: 1 link away forwards, 2 links away forwards, n links away forwards, and these define the groups method 3 has: 1 link away forwards 1 *and* 2 links away forwards 1 and 2 *and* 3 links away forwards (similarly for backlinks) these groups are averaged into a centroid which has a weight, and moderated by the distance of that vector from the main page. the groups are then summed, to get a rank for the page. i'm not sure how the transition from links in webspace to vectorspace goes. now from the questions i understand they're not just using the hypertext structure, and not just using the anchortext, but expanding to the content of the entire document? not sure. apparently this system weights up documents that *link forwards* to highly ranked documents. this is good, because it means documents that start a good browsing trail are recommended. ] [the HITS algorithm is all about authorities and hubs again. I think it's worth trying hard to understand this page, especially the 'related work' section at the beginning.] one conclusion: the number of topics of web page that point to a target web page tends to be 3 [i'd love to know exactly what this means...]