04/03/10 Spiders - blocking & encouraging them Lots of different clients - regular ones such as browsers - spiders/crawlers/robots - web accelerator: prefetches linked to pages How to block some of these clients - from parts of the site - from the whole site It is, in general, impossible to distinguish spiders, etc from regular clients We mostly rely on spiders to behave well We will tell them which parts of the site not to visit (e.g. very dynamic content). Well-bhaved spiders will honour this agreement. How to tell them? 1) Page-level: add this to the head of the pages you don't want them to index: 2) Site-level: create a robots.txt file in your document root The very first file that a well-behaved robot will request from your site is robots.txt Examples A) User-agent: * Disallow: /cgi-bin/ Disallow: /images/ B) User-agent: * Disallow: / C) User-agent: * Disallow: / User-agent: FriendlyRobot Disallow: Disallow: /cgi-bin/ Disallow: /images/ You'll probably want to add an expiry date to the file: see later lecture How to block badly-behaved robots from your site How to identify them (kinda): find a list or inspect your log files How to block them once you know their hostname/IP address Order Allow,Deny Allow from all Deny from www.extractorpro.com or if you don't know hostname/IP address but you do know user agent name BrowserMatchNoCase .*extractorpro.* badrobot Order Allow,Deny Allow from all Deny from env=badrobot How to encourage spiders a) submit your site to search engines b) SEO - black hat SEO (dirty tricks) - white hat SEO E.g. this probably no longer works Instead: i) Good titles, h1s, h2s, h3s ii) Persuade legitimate sites to link to you