The White House Responds And a Good Resolution
October 28, 2003
2600 Magazine contacted the White House in the process of writing a story about the robots.txt file. The story also notes that the robots.txt file changed in the past day, so that the current robots.txt file is different from the file archived Friday, Oct 24.
They received the following response from the White House:
As 2600 Magazine notes, there were some disallow statements for which this is true -- duplicate documents were excluded. However, other disallowed directories did hold documents that apparently weren't elsewhere on the site. From 2600 Magazine:
According to the robots.txt of October 24, though, the In Focus: iraq section of the site was blocked from search engines. Some of the information there does not appear to be available anywhere else on the White House site.
Much speculation about the robots.txt file (including on slashdot) has centered around the possibility that it was a kluge to handle some technical issue, such as looking for "shady robots" or other robot-related techie issues. Mr. Orr's statement above indicates that the issue was content-related (more correctly, according to Mr. Orr, design-related) and there was no purely technical reason for disallowing robots crawling the site.
In comparing the October 28 robots.txt with the robots.txt that was on the site on October 24, it appears that they have removed restrictions on most of the existing iraq directories that had been prohibited, and which held the overwhelming bulk of the iraq-related directories.
Just as noted when first writing about this, there are still many of the disallow statements that point to nonexistent directories. Many of those "garbage" disallows still appear, but they have no effect on public access. The 16 directories with "iraq" in the pathname and which still exist are either empty or hold what appears, in a quick sampling, to be duplicates of documents elsewhere on whitehouse.gov.
So whitehouse.gov has restored full public access to the site for search engine robots. Kudos.
Ryan at The Dead Parrot Society has a good post about this. One thing to note in reference to his post: the removal of these disalllow statements will not change search results immediately -- the changes should take place the next time search engines crawl and index the site.
Ryan at Dead Parrot Society also points to this post by an employee of the Internet Archive who writes that whitehouse.gov has recently encouraged them to crawl the site without paying attention to robots.txt, which would indicate they do want full archives of the site.
Now that the most heavily used search engines can get to this information, too, this is a good resolution.
When I first wrote about this issue, I asked "Why is whitehouse.gov (the official White House website) disallowing "iraq" directories from search engine crawling? " I didn't answer this question, but privately figured that human error would be the most likely explanation. That appears to be the case.
In general, I'm sure there are many lessons here for webmasters of governmental websites: I'd like to focus on three
1) Don't unnecessarily scatter hundreds of instances of charged terms on the site.
2) Pay attention to search engines and their spiders.
3) Don't state on your contact page that "The Web Team does not answer or forward e-mail, but all messages pertaining to the technical operation and usability of the White House web site are read. " I'd have emailed whitehouse.gov if they hadn't preemptively told me they would not respond. I'm sure email is a burden for them, but there are ways to reduce garbage comment submissions and still being open to responding to relevant comments.
Back to Home: Whitehouse.gov Robots.txt
Contact Email: click here and change email address to valid email address.