The Inquirer (www.theinquirer.net) announced on October 29 that while George Bush's campaign site has been blocked outside the US, his official White House website www.whitehouse.gov has apparently been configured to prevent Internet search engines from capturing historic snapshots of what is posted on the site.
The technical details are in a file that websites often have in their uppermost directory called robots.txt. It contains directives that Internet search engines, like Google and Yahoo, read to determine what the site owner would like indexed by the search engine.
According to Internet consultant Dave Bender, most website operators, especially those that want their information to reach a large audience, want the search engines to visit most, if not all pages on their site, and consequently want most or all of them included in the search engines' indexes.
Web pages have a "robots exclusion" file so that a website operator can tell the Internet search engines to stay away from certain files. Most websites that use a robots exclusion file usually have a handful of files they want to keep away from the search engine robots. Six, eight or even 10 are not unusual. However, the White House site has 1975 disallow directives. Bender said that it's highly unusual, that he knows of no other website that has such a huge robots exclusion file. Bender said that it is possible to retrieve the current robots exclusion file in an ordinary Web browser by going to www.whitehouse.gov/robots.txt.