04/03/10

Spiders - blocking & encouraging them

Lots of different clients
  - regular ones such as browsers
  - spiders/crawlers/robots
  - web accelerator: prefetches linked to pages

How to block some of these clients
- from parts of the site
- from the whole site

It is, in general, impossible to distinguish spiders, etc from 
regular clients

We mostly rely on spiders to behave well

We will tell them which parts of the site not to visit
(e.g. very dynamic content). Well-bhaved spiders will
honour this agreement.

How to tell them?

1) Page-level: add this to the head of the pages you don't
   want them to index:

<meta name="robots" content="noindex, nofollow" />

2) Site-level: create a robots.txt file in your document
root

The very first file that a well-behaved robot will request
from your site is robots.txt

Examples

A)

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/

B)

User-agent: *
Disallow: /

C)

User-agent: *
Disallow: /

User-agent: FriendlyRobot
Disallow:
Disallow: /cgi-bin/
Disallow: /images/

You'll probably want to add an expiry date to the file:
see later lecture

How to block badly-behaved robots from your site

How to identify them (kinda): find a list or inspect
your log files

How to block them once you know their hostname/IP address

<Location /cgi-bin/>
   Order Allow,Deny
   Allow from all
   Deny from www.extractorpro.com
</Location>

or if you don't know hostname/IP address but you do know
user agent name

BrowserMatchNoCase .*extractorpro.* badrobot
<Location /cgi-bin/>
   Order Allow,Deny
   Allow from all
   Deny from env=badrobot
</Location>

How to encourage spiders
a) submit your site to search engines
b) SEO
- black hat SEO (dirty tricks)
- white hat SEO

E.g. this probably no longer works
<meta name="keywords" content="chocolates,babes, porn,girls,nude,sexy" />

Instead:
i) Good titles, h1s, h2s, h3s

ii) Persuade legitimate sites to link to you