Whitehouse.gov Robots.txt

 

 

Here's a less technical discussion of the Whitehouse.gov Robots.txt file

Search engines (like Google) find out what is on web pages by having robots (or spiders, or crawlers) run all over the Internet reading and indexing pages. Often, webmasters do not want some parts of their site indexed by Google, and that's where the robots.txt file comes in. Two common reasons for not wanting parts of the site crawled and indexed by robots are:

1) There are pages that have gibberish or which will somehow impair search results. For example, there are directories that mainly have programs (and not web pages) in them: a common such directory is the cgi-bin directory. or the directories that have the search software programs in them. These are often exluded from search.

2) They have many pages that duplicate other pages (like "printable" pages so many sites have), and don't want the duplicated pages indexed.

Webmasters can exclude whole directories from their search by using a file called "robots.txt" which lives in the very base URL of the domain.

For more information about the robots.txt file, look at The Web Robots Page

Here is the the White house robots.txt file. (The analysis on these pages is based on the file as it was on October 24, 2003 -- there's an archive of that date's file here..

The first line of the file identifies the file:

# robots.txt for http://www.whitehouse.gov/

The next written line:

User-agent: *

Just means that the information in the long section below applies to all external robots (or user-agents). There is a short section at the bottom that only applies to the search robot of the whitehouse.gov site itself.

What follows the "user-agent" line is a long list of "disallow" with directories after each one. That tells the robot "Do not look at, record, or index the pages in each of those directories."

Many of the directory names are not unusual. For reasons outlined above, many sites would not want robots to index their cgi-bin, search, or help directories and would have the following lines:

Disallow: /cgi-bin
Disallow: /search
Disallow: /query.html
Disallow: /help

ALso, the white house has many pages that duplicate the text on other pages, but which are just easier to print. The many disallowed directories that end in "text" would be examples of such duplicate pages, whcih are very common to disallow.

However, the above exclusions don't account for all the 1618 directories that are disallowed by whitehouse.gov.

The most glaring exclusions are the 783 directories that have "iraq" in them, almost all of them with /iraq appended at the end.

In looking over the directory structure of whitehouse.gov, this apparently eliminates access to every directory that is exclusively about iraq. (Of course, information about iraq appears in other directories that are not disallowed, but excluding these explicit Iraq directories must eliminate a great deal of information about Iraq on whitehouse.gov from search engine indexes).

Some of the excluded directories do not exist (for instance,

Disallow: /holiday/2002/barney/iraq

but the exclusions that refer to true directories actually exclude more than the directories that are explicitly mentioned.

Excluding a directory also excludes all the directories that are below it (or are "children" of that directory). For instance, excluding:

/infocus/iraq

also disallows

/infocus/iraq/100days
/infocus/iraq/photoessay
/infocus/iraq/news
/infocus/iraq/disarmament
/infocus/iraq/decade
/infocus/iraq/swf
/infocus/iraq/100days/text
/infocus/iraq/photoessay/essay6
/infocus/iraq/photoessay/essay1
/infocus/iraq/photoessay/essay5
/infocus/iraq/photoessay/essay4
/infocus/iraq/photoessay/essay3
/infocus/iraq/photoessay/essay2
/infocus/iraq/disarmament/text
/infocus/iraq/decade/text
/infocus/iraq/decade/text/text

The last old robots.txt file that is archived at the Internet Archive is from April of 2003. It has 10 mentions of "Iraq" and they seem unlike this spasm of Iraq mentions from the current robots.txt file: http://web.archive.org/web/20030416065022/http://www.whitehouse.gov/robots.txt

 

Back to Whitehouse.gov Robots.txt