2003-09-29 Enhanced Web Document Summarization Using Hyperlinks http://www.ht03.org/papers/pdfs/34.pdf Abstract from http://www.ht03.org/papers/ "This paper addresses the issue of Web document summarization. As textual content of Web documents is often scarce or irrelevant and existing summarization techniques are based on it, many Web pages and websites cannot be suitably summarized. We consider the context of a Web document by the textual content of all the documents linking to it. To summarize a target Web document, a context-based summarizer has to perform a preprocessing task, during which it will be decided which pieces of information in the source documents are relevant to the content of the target. Then a context-based summarizer faces two issues: first, the selected elements may partially deal with the topic of the target, second they may be related to the target and yet not contain any cues about the content of the target. ewline In this paper we put forward two new summarization by context algorithms. The first one uses both the content and the context of the document and the second one is based only on the elements of the context. It is shown that summaries taking into account the context are usually much more relevant than those made only from the content of the target document. Optimal conditions of the proposed algorithms with respect to the sizes of the content and the context of the document to summarize are studied." summarisation is important for different devices, search engines, content adaptation. but web documents are often multimedia, have few bits of text, have many topics, and are in HTML. traditional systems don't copy well with this. so: summarise documents using their context advantages: the context documents often already contain text summarising the target related work [to look into...]: - the two Menczer's hypothesis - Davison's results - InCommongSense system main issues of doing it by context: . contextualisation . partiality . topicality [this is like Google News. there's an event, but the event doesn't have a URI and it's a real thing anyway rather than text to be summaries. so you summarise the context, and have the same problems.] contextualisation: - find backlinks - filter and keep only things which use paragraphs - tokenise and only keep adjectives, nouns, verbs - vector space model - expand into synonyms (using WordNet) partiality: . decrease the size of the context without loss of information [like the meaning being holographic within writing - sort of, as opposed to being contained, i mean - i was talking about yesterday] topicality: . a context page might say "cars were stolen, CNN reports", but this doesn't say anything about CNN itself a number of algorithms are presented to get around these difficulties.