A large scale, 24/7 online operation, the White House website offers citizens up-to-date information about the President's activities regarding a range of international issues including defence, the environment and global diplomacy. It's a veritable treasure trove of information for anyone looking for the latest news from the US government, playing a significant role in the American democratic process.
Unfortunately for the Presidential staff, the White House website is currently under threat from a very unlikely source ... itself.
When we refer to the White House website being in danger, we do not mean it is going to get shut down, removed or hacked. Here at bigmouthmedia, our understanding of threat when referring to websites is to do with search engine rankings.
At the moment, the White House website currently has two versions of every page: a normal version and a text only version. This is a common sight for a website that is taking accessibility and the W3C guidelines into consideration. The White House clearly understands the implications of having duplicate content within a website and is using the powerful robots exclusion protocol to block all text versions.
This is currently being achieved by adding the directory of the text version to the robots.txt file held in the root of the White House website like so:
Disallow: /example-directory/text
The above line tells search engine spiders that follow the robots exclusion protocol, such as Google, MSN, Ask and Yahoo!, not to crawl and access the directory within the website. Now consider the following:
/example-directory/text
This means that no files or folders after or included in the directory /text will be crawled or included in a search engine's index.
This is exactly what the White House wants, as they know that having two versions of every page could well have a detrimental effect on their search engine rankings. This is because many search engines prefer to only index and rank the most unique and up to date content.
So what is the problem with the website? At the moment, the robots.txt file on the White House website is large. At the time of this article - January 31 2008 - the file is currently 83326kb. On August 29 2007, the file size was 82574kb.It's all too obvious that the file is getting larger by the minute. But what implications will this have?
In the search engine marketing world, it's well-known that certain search engines (like Google) will only crawl up to 100kb of a robots.txt file then ignore the rest from the cut-off file size. (Source: http://groups.google.com / group / Google_Webmaster_Help-Tools / browse_thread / thread / c497c7371822c403 / 5282673bb0bef886)
This means that if the website keeps on growing, and the need for new lines to be added to the robots.txt file keeps expanding, then eventually search engines like Google will stop reading the additional content after 100kb.
What's more, if the file size keeps growing, search engine spiders could potentially stop crawling as the file gets truncated and, as a result of that, more than the intended area of the site is removed. (Source: http://www.robotstxt.org / norobots-rfc.txt)
This could potentially open up a whole can of worms for the US President's website.
For Example:
If the robots.txt file exceeds the 100kb limit, then search engines will only read up to this size, not specific lines within the file. This inevitably means that if the file size is reached half way through a line in the robots.txt, it could have catastrophic effects.
If the cut-off file sizes happens on the following line, for instance:
Disallow: /news/ - maximum file size exceeded - then the entire directory of /news/ could be blocked from search engine spiders. This could result in the news section dropping out of a search engine's index. That means that these pages would not be returned within search results. This could have a huge impact on the traffic to the White House website, not to mention their brand.
George W. has little time left in office, so fixing this issue could be out of his hands. And it's still a little early to tell which of his potential successors - whether it's Hillary Clinton, Barack Obama, John McCain or any other candidate - has a strong enough policy on web issues to alter the White House website's stance towards search engines.
But how exactly can the White House save itself from a potential rankings crisis?
Fortunately, bigmouthmedia is on hand to help!
With 10 years of experience in the field of search engine marketing, we know how to make the maximum use of such tools as the robots exclusion protocol. And it just so happens that few years ago, a handful of search engines announced that they were going to start to allow 'wildcards' within the robots.txt file.
When referring to a wildcard, any document that contains this 'wildcard' will be blocked from search engine spiders.
The simple, but yet somewhat advanced solution to the White House's robots.txt issue is to use a wildcard block on all text versions of pages. This can be achieved by adding the following line to the robots.txt file:
Disallow: /*text*
This small line highlighted above would achieve the exact same as the current, exhausting method the White House currently have in place.
But that's not all. It's important to test and analyse all any recommendations to ensure they are 110% correct and will not cause any issues with a website's many components.
If you combine bigmouthmedia's savvy knowledge and responsible attitude with Google's greatest offering to webmasters, Google Webmaster Tools, we can quickly log into the 'Google Webmaster Tools' and use one of the fantastic features available to analyze the robots.txt file to ensure everything is correct and valid.
In conclusion, if the White House websites robots.txt eventually got too large, this could seriously damage the rankings of the website. We've been around the block and know all the tips and tools that can be used to aid search engine rankings - this merely highlights one of the many.
The White House's robots.txt problem was first noticed by bigmouth MT while examining their accessibility policy. Other accessibility articles written by MT include "Mobile phone search engine optimisation", "Accessible advertising for the visually impaired" and "Kill three birds with one stone using XHTML and CSS".











