David Hawking , Web Search Engines
Part 1:
-modern search engines do more than was ever believed possible
-article focus= go behind the scenes and explain how this data processing "miracle" is possible
- search engines must reject as much low-value automated content as possible, its cost effective
- Currently, the amount of Web data that search engines crawl and index is on the order of 400 terabytes
- simple crawling algorithm must be extended to address the issues of speed, politeness, excluded/duplicate/continuous content, and spam rejection
- Engineering a Web-scale crawler is not for the unskilled or fainthearted (tag, im out)
Part 2:
-focus= “reviews the algorithms and data structures required to index 400 terabytes of Web page text and deliver high-quality results in response to hundreds of millions of queries each day.”
- Search engines use an inverted file to rapidly identify indexing terms
-goes over concepts of scaling up, term lookup, compression, phrases, anchor text(kinda interesting), link popularity scores, and query-independent scores
- major problem with the simple-query processor is that it returns poor results
-technology to speed things up= skipping, early termination, assignment of document numbers, caching(something I knew of before this article, yay)
-now interested in suggestions of generating advertisements targeted to the search query and generating spelling suggestions from query logs
Current developments and future trends for the OAI protocol for metadata harvesting: 
-article looks at developing trend of Open Archives Initiative protocol for metadata harvesting, initiated originally for e-print archives community, mention of Mellon Foundation, the article and development is interesting. Though I am more into paper based documents, it seems more likely every day that I will have to know and work with these types of documents and databases, as the archival community continues to shift and develop
The Deep Web: Surfacing Hidden Value: 
-This article was very enjoyable, I found it easy to read, enjoyed the metaphors, graphs and charts (yay pictures). The internet is way to vast for the common person to conceptualize and this article aided in my understanding of a complex tool/resource I typically take for granted
Subscribe to:
Post Comments (Atom)
 
No comments:
Post a Comment