Spiders, Webcrawlers, etc.

Spiders, Webcrawlers, etc.
Prev	How Search Engines Work	Next

Software that crawls the web, following hyperlinks
Visited pages are indexed, catalogued and/or cached
Systematic revisits keep the information fresh
Ranking and rating algorithms are closely guarded secrets
- e.g. Google's PageRank, using backlinks etc to rate a page
- How do we measure a page's “authority” or “usefulness”?

A spider is a computer application whose sole purpose in life is to look at and catalogue online data. It will prowl the Internet, probing web sites and examining their contents, storing key information about them in its catalogue.

Clearly, just randomly crawling across the web won't produce a comprehensive and up-to-date catalogue, so these spiders are intelligent enough to revisit pages systematically, and will examine various things about the page (such as last update time/date) to determine when a revisit will be necessary. They also keep track of how many pages link to that page (called backlinks), to create a popularity score; more popular pages are visited more often.

Many of the algorithms involved in creating the catalogue, and in providing effective indexes to such a huge body of data, are closely guarded secrets. We have some inkling of how Google's algorithms work in articles describing its PageRank algorithm, but the most recent developments are still kept under wraps.