Spiders, Webcrawlers, etc.

A spider is a computer application whose sole purpose in life is to look at and catalogue online data. It will prowl the Internet, probing web sites and examining their contents, storing key information about them in its catalogue.

Clearly, just randomly crawling across the web won't produce a comprehensive and up-to-date catalogue, so these spiders are intelligent enough to revisit pages systematically, and will examine various things about the page (such as last update time/date) to determine when a revisit will be necessary. They also keep track of how many pages link to that page (called backlinks), to create a popularity score; more popular pages are visited more often.

Many of the algorithms involved in creating the catalogue, and in providing effective indexes to such a huge body of data, are closely guarded secrets. We have some inkling of how Google's algorithms work in articles describing its PageRank algorithm, but the most recent developments are still kept under wraps.