Metadata

Web page designers often add extra information, called metadata, to their pages so that search engines can more easily find and sort their contents. The metadata is usually stored in <meta> tags in the head of the HTML file. When the search engine's data gathering spider visits the page, it uses the metadata to determine how to index the page, and which keywords will retrieve it from the index.

Some of the possible combinations of meta tags are shown in Example 5.1, “Meta tags in an HTML document”

Example 5.1. Meta tags in an HTML document

  <head>
    <title>Meta Tags in HTML</title>
    <meta name="author" content="Gary Stringer"/> 1
    <meta name="copyright" content="1999, University of Exeter"/> 2
    <meta name="keywords" content="metadata,spider"/> 3
    <meta name="description" 
             content="describes how metadata is used..."/> 4
    <meta name="robots" content="noindex,nofollow"/> 5
    <meta name="rating" content="mature"/> 6
        <!-- or "general", "restricted", "14 years",etc. -->
  </head>
                    

1

Generally used to identify the author or creator of the document.

2

Copyright

3

Keywords

4

A brief textual description of the page contents and purpose.

5

Indicates how web-searching spiders should treat this page - noindex = don't catalogue this page; nofollow = don't use the links in this page to find other pages.

6

Though not widely used, the rating can give an indication of recommended viewing suitability.


For more on how to provide even greater detail on documents using these meta tags, there's an article on a scheme called Dublin Core.

There is also a movement within the technorati called the Semantic Web, which attempts to create significant machine-readable summaries of documents and the information/knowledge they contain. Using standards such as RDF and OWL, the web author can create a map of a document that is searchable not just by keywords, but by the facts within the document and the way they relate to each other.

It's possible to control how a spider searches whole sites-there's often a file called robots.txt in the root directory of a web server, which contains directions on what should and shouldn't be indexed. This can be especially useful if the site contains dynamically produced pages, where the data changes so rapidly that indexing it would be pointless. You can view Exeter's robot directions by reading http://www.ex.ac.uk/robots.txt.

The Resource Description Framework (RDF) is a W3C standard for describing Web resources, such as the title, author, modification date, content, and copyright information of a Web page.

OWL is a language for processing web information