| Description |
ABSTRACT: A WEB SPIDER is automated tool that crawls through the links of the web and sends back a list of all the links it has traversed.
A web spider crawls the web i.e. it goes from one page and goes to all the links on that page and so on.
So, if you point a web spider to a website, it will automatically download all the pages on that web site and possibly even more if it is specified. So we are trying to make a Web spider that can at least download all pages under a certain URL.
Search engines and other web services primary rely on web spiders to collect large amount of data analysis.
Design of high performance web spiders is a challenged task due to the large scale of the web.
There are two important aspects in designing efficient web spiders, i.e.
Crawling strategy and Crawling performance. Crawling strategy deals with the way to Prioritize documents for downloading. Meanwhile, crawling performance deals with the way to optimize spider performance. Significant features of web spider are scalability, robustness, flexibility and reconfigurability .Web Spider can scale to download several hundred or thousand URLs per second without overwhelming any particular web server. It insubstantially resilient against system crashes due to downloading and can be customized to other various web applications.
Keywords: Crawler Process, Return URLs to main process, Download a page, Extract URLs from downloaded document, Return URLs to crawler process.
1 Introduction
The World Wide Web (WWW) or web can be viewed as a huge distributed database across several million of hosts over the Internet where data entities are stored as web pages on web servers. Web pages are mostly unstructured or poorly structured documents and their logical relationships are represented by hyperlinks. Due to the enormous size of the web, search engines play more and more important role as a primary tool for locating information. Every search engine relies on massive collection mechanism called web spiders or crawlers or robots. These spiders crawl" across the web, following hyperlinks from site to site, storing downloaded ages they visit to build a searchable web pages index. Most search engines compete against each other with the number of indexed pages, quality of returned pages, and response time. Search engines are one of the primary ways that Internet users find information.
2 Problem Formulation
There are two important aspects in designing efficient web spiders, i.e.
1. Crawling strategy and
2. Crawling performance.
2.1 Crawling strategy
It deals with the way the spider decides to what pages should be downloaded next. Generally, the web spider cannot download all pages on the web due to the limitation of its resources compared to the size of the web.
Hyperlinked documents can be viewed as nodes in a graph
Data gathering issues:
How to visit each node once
How to gather a representative sample of nodes.
3 Spider Architecture
1. Initialize a page queue with one or a few known sites.
E.g., http://www.cs.umass.edu
2. Pop an address from the queue
3. Get the page
4. Parse the page to find other URLs
E.g. <a href="/csinfo/announce/">Recent News
5. Discard URLs that do not meet requirements
E.g., images, executables, Postscript, PDF, zipped files,
E.g., pages that have been seen before
6. Add the URLs to the queue
7. If not time to stop, go to step 2
Fig.1
A Simple Spider Architecture
Main Process:
Coordinates the behavior of multiple,
multi-threaded crawlers
Maintains a DB of unexamined URLs
Maintains a DB of already examined URLs
Distributes unexamined URLs to crawler processes
Receives URLs from crawler processes
Crawler Process:
Coordinate the behavior of multiple
downloading threads
The network is very slow & unreliable
Return URLs to main process
Harvested URLs
URLs of pages that couldnt be downloaded
Downloading Threads:
Download a page (or timeout)
Extract URLs from downloaded document (harvested URLs)
Store document in DB
Return URLs to crawler process
Search Strategies
Depth-first vs breadth-first
Depth-first: LIFO queuing strategy
Produces a narrow crawl, results unpredictable
Breadth-first: FIFO queuing strategy
Produces a broad crawl
Common choice, because of its simplicity
Site-based breadth-first: FIFO domain name queuing strategy
Example: *.umass.edu, *.ibm.com
Produces a broader crawl, because quickly visits more sites
Spreads downloading activity among more sites (more on this later)
Common choice, because of its behavior
Requires a more sophisticated URL queue
4 Conclusion
Web spider performs precise link searching, including JavaScript parsing, and can download up to all files simultaneously.
It will allow you to specify an account name and password to access secure Web sites.
You get all the web pages from a site and then depending upon some search parameters or some tags you collect the data. This will help in saving the pages automatically.
There are many ways the web spider can be further improved. Detection to avoid infinite generated pages, the same pages under different host names (aliases), and mirror pages can greatly reduce waste bandwidth, disk spaces, and the processing time. The current bandwidth usage rely on a number of data collector threads and the delay time; a better bandwidth control mechanism to control the maximum bandwidth usage could ensure that the bandwidth used by the spider will not exceeded a predefine limit. The incremental update of web data could enable the spider to reduce the time and bandwidth required for update the web data.
References:
[1] X1. S. Ganesh, M. Jayaraj, V. Kalyan, Srinivasa Murthy, G. Aghila, Ontology-based Web Crawler, International Conference on Information Technology: Coding and Computing (ITCC'04) , Volume 2 April 05 - 07, 2004 Las Vegas, Nevada pp. 337
[2] X2 Michael Chau, Daniel Zeng, Hsinchun Chen, Personalized Spiders for Web Search and Analysis, First ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'01), June 24 - 28, 2001 Roanoke, Virginia, United States pp. 79-87
|
|