Foundations and Trends® in Information Retrieval > Vol 4 > Issue 3

Web Crawling

By Christopher Olston, Yahoo! Research, USA, olston@yahoo-inc.com | Marc Najork, Microsoft Research, USA, najork@microsoft.com

 
Suggested Citation
Christopher Olston and Marc Najork (2010), "Web Crawling", Foundations and TrendsĀ® in Information Retrieval: Vol. 4: No. 3, pp 175-246. http://dx.doi.org/10.1561/1500000017

Publication Date: 12 Feb 2010
© 2010 C. Olston and M. Najork
 
Subjects
Databases on the Web
 

Free Preview:

Download extract

Share

Download article
In this article:
1 Introduction 
2 Crawler Architecture 
3 Crawl Ordering Problem 
4 Batch Crawl Ordering 
5 Incremental Crawl Ordering 
6 Avoiding Problematic and Undesirable Content 
7 Deep Web Crawling 
8 Future Directions 
References 

Abstract

This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work.

DOI:10.1561/1500000017
ISBN: 978-1-60198-322-0
80 pp. $65.00
Buy book (pb)
 
ISBN: 978-1-60198-323-7
80 pp. $100.00
Buy E-book (.pdf)
Table of contents:
1: Introduction
2: Crawler Architecture
3: Crawl Ordering Problem
4: Batch Crawl Ordering
5: Incremental Crawl Ordering
6: Avoiding Problematic and Undesirable Content
7: Deep Web Crawling
8: Future Directions
References

Web Crawling

The magic of search engines starts with crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures, to theoretical questions such as how often to revisit evolving content sources. Web Crawling outlines the key scientific and practical challenges, describes the state-of-the-art models and solutions, and highlights avenues for future work. Web Crawling is intended for anyone who wishes to understand or develop crawler software, or conduct research related to crawling

 
INR-017