Web Spider (Web spider)

Venue: Calgary, Canada

Location: Calgary, Alberta, Cameroon

Event Date/Time: Apr 23, 2005
Report as Spam

Description

ABSTRACT: A WEB SPIDER is automated tool that crawls through the links of the web and sends back a list of all the links it has traversed.
A web spider crawls the web i.e. it goes from one page and goes to all the links on that page and so on.
So, if you point a web spider to a website, it will automatically download all the pages on that web site and possibly even more if it is specified. So we are trying to make a Web spider that can at least download all pages under a certain URL.
Search engines and other web services primary rely on web spiders to collect large amount of data analysis.
Design of high performance web spiders is a challenged task due to the large scale of the web.
There are two important aspects in designing efficient web spiders, i.e.
Crawling strategy and Crawling performance. Crawling strategy deals with the way to Prioritize documents for downloading. Meanwhile, crawling performance deals with the way to optimize spider performance. Significant features of web spider are scalability, robustness, flexibility and reconfigurability .Web Spider can scale to download several hundred or thousand URLs per second without overwhelming any particular web server. It insubstantially resilient against system crashes due to downloading and can be customized to other various web applications.

Keywords: Crawler Process, Return URLs to main process, Download a page, Extract URLs from downloaded document, Return URLs to crawler process.

1 Introduction


The World Wide Web (WWW) or web can be viewed as a huge distributed database across several million of hosts over the Internet where data entities are stored as web pages on web servers. Web pages are mostly unstructured or poorly structured documents and their logical relationships are represented by hyperlinks. Due to the enormous size of the web, search engines play more and more important role as a primary tool for locating information. Every search engine relies on massive collection mechanism called web spiders or crawlers or robots. These spiders “crawl" across the web, following hyperlinks from site to site, storing downloaded ages they visit to build a searchable web pages index. Most search engines compete against each other with the number of indexed pages, quality of returned pages, and response time. Search engines are one of the primary ways that Internet users find information.





2 Problem Formulation

There are two important aspects in designing efficient web spiders, i.e.
1. Crawling strategy and
2. Crawling performance.

2.1 Crawling strategy

It deals with the way the spider decides to what pages should be downloaded next. Generally, the web spider cannot download all pages on the web due to the limitation of its resources compared to the size of the web.

• Hyperlinked documents can be viewed as nodes in a graph
• Data gathering issues:
– How to visit each node once
– How to gather a representative sample of nodes.


3 Spider Architecture

1. Initialize a page queue with one or a few known sites.
– E.g., http://www.cs.umass.edu
2. Pop an address from the queue
3. Get the page
4. Parse the page to find other URLs
– E.g. Recent News
5. Discard URLs that do not meet requirements
• E.g., images, executables, Postscript, PDF, zipped files, …
• E.g., pages that have been seen before
6. Add the URLs to the queue
7. If not time to stop, go to step 2



Fig.1

A Simple Spider Architecture

Main Process:
• Coordinates the behavior of multiple,
multi-threaded crawlers
• Maintains a DB of unexamined URLs
• Maintains a DB of already examined URLs
• Distributes unexamined URLs to crawler processes
• Receives URLs from crawler processes

Crawler Process:
• Coordinate the behavior of multiple
downloading threads
– The network is very slow & unreliable
• Return URLs to main process
– Harvested URLs
– URLs of pages that couldn’t be downloaded

Downloading Threads:
• Download a page (or timeout)
• Extract URLs from downloaded document (harvested URLs)
• Store document in DB
• Return URLs to crawler process








Search Strategies

Depth-first vs breadth-first
• Depth-first: LIFO queuing strategy
– Produces a “narrow” crawl, results unpredictable
• Breadth-first: FIFO queuing strategy
– Produces a “broad” crawl
– Common choice, because of its simplicity
• Site-based breadth-first: FIFO domain name queuing strategy
– Example: *.umass.edu, *.ibm.com
– Produces a “broader crawl”, because quickly visits more sites
– Spreads downloading activity among more sites (more on this later)
– Common choice, because of its behavior
– Requires a more sophisticated URL queue

4 Conclusion

Web spider performs precise link searching, including JavaScript parsing, and can download up to all files simultaneously.
It will allow you to specify an account name and password to access secure Web sites.
You get all the web pages from a site and then depending upon some search parameters or some tags you collect the data. This will help in saving the pages automatically.

There are many ways the web spider can be further improved. Detection to avoid infinite generated pages, the same pages under different host names (aliases), and mirror pages can greatly reduce waste bandwidth, disk spaces, and the processing time. The current bandwidth usage rely on a number of data collector threads and the delay time; a better bandwidth control mechanism to control the maximum bandwidth usage could ensure that the bandwidth used by the spider will not exceeded a predefine limit. The incremental update of web data could enable the spider to reduce the time and bandwidth required for update the web data.


References:
[1] X1. S. Ganesh, M. Jayaraj, V. Kalyan, Srinivasa Murthy, G. Aghila, Ontology-based Web Crawler, International Conference on Information Technology: Coding and Computing (ITCC'04) , Volume 2 April 05 - 07, 2004 Las Vegas, Nevada pp. 337
[2] X2 Michael Chau, Daniel Zeng, Hsinchun Chen, Personalized Spiders for Web Search and Analysis, First ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'01), June 24 - 28, 2001 Roanoke, Virginia, United States pp. 79-87

Venue

#80, 4500 - 16th Avenue N.W. Calgary
Calgary
Alberta
Cameroon
MORE INFO ON THIS VENUE