04/03/10
Spiders - blocking & encouraging them
Lots of different clients
- regular ones such as browsers
- spiders/crawlers/robots
- web accelerator: prefetches linked to pages
How to block some of these clients
- from parts of the site
- from the whole site
It is, in general, impossible to distinguish spiders, etc from
regular clients
We mostly rely on spiders to behave well
We will tell them which parts of the site not to visit
(e.g. very dynamic content). Well-bhaved spiders will
honour this agreement.
How to tell them?
1) Page-level: add this to the head of the pages you don't
want them to index:
2) Site-level: create a robots.txt file in your document
root
The very first file that a well-behaved robot will request
from your site is robots.txt
Examples
A)
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
B)
User-agent: *
Disallow: /
C)
User-agent: *
Disallow: /
User-agent: FriendlyRobot
Disallow:
Disallow: /cgi-bin/
Disallow: /images/
You'll probably want to add an expiry date to the file:
see later lecture
How to block badly-behaved robots from your site
How to identify them (kinda): find a list or inspect
your log files
How to block them once you know their hostname/IP address
Order Allow,Deny
Allow from all
Deny from www.extractorpro.com
or if you don't know hostname/IP address but you do know
user agent name
BrowserMatchNoCase .*extractorpro.* badrobot
Order Allow,Deny
Allow from all
Deny from env=badrobot
How to encourage spiders
a) submit your site to search engines
b) SEO
- black hat SEO (dirty tricks)
- white hat SEO
E.g. this probably no longer works
Instead:
i) Good titles, h1s, h2s, h3s
ii) Persuade legitimate sites to link to you