08 January 2010 | Author: M. Thomson SEO & Affiliate ConsultantGooglebot crawling fictional URLs in JavaScript
It's no secret that Googlebot is capable of crawling JavaScript files. In fact, Googlebot will take it that little bit further by trying to understand what the contents of JavaScript files are, in order to discover new webpages.
With
Google evidently trying to increase the amount of content stored in its index, the situation sounds ideal for many
webmasters - however, others are not so enthusiastic. Recently, it has been reported that Googlebot's intrusive crawling of JavaScript has resulted in the
search engine creating fictional URLs - web addresses that do not exist. Analytics packages have been the single biggest contributors to this problem, including Google Analytics itself.
Examplefunction Tracking(){
zreturn("/services/submit/form/node/12345");Google will see what it thinks is a new route (URL) to content, extract it and then apply it to the host its currently crawling - if these webpages don't exist, we have a problem.
If you run a tight ship and try to keep a hygienic set of Webmaster Tools, you will likely have spotted these fictional URLs being reported in the 404 Not Found section. Well, bigmouthmedia and its technical search boffins have devised a number of solutions to this problem, based on unique circumstances.
JavaScript join()If you present a URL - or what looks like a URL - within inline JavaScript, Googlebot will likely try to crawl it or to make assumptions surrounding its existence.
Using the join() function to separate your URL within the page source should present a URL that Google is unlikely to crawl, which can be achieved using the code below (code sequence shortened):
Old: /Services/Submit...
New: '/'+['Services','Submit'....].join('/');This is likely to be the best method of deterring Googlebot, as users will be left with the same functionality as the original URL.
Robots.txtDepending on the pattern of 404 Not Found URLs being generated, you may be able to define a disallow rule within your
robots.txt that will prevent Googlebot from crawling such pages. For example, if all URLs featured _node, you could wildcard disallow access to a URL that contains that string by using:
Disallow: *_node*.Be cautious when using robots.txt however, as you could end up blocking access to URLs that you would prefer were being promoted. Use the robots.txt Webmaster Tools checking abilities to ensure that this is still the case.
Dynamically Generate a FileYou could also take the inline JavaScript out of your markup and dynamically generate a unique file with its contents. Googlebot could then be restricted from accessing such files using robots.txt - however, users are once again advised to tread with caution. It could be argued that dynamically generating files will increase the number of components required to load the page, thus increasing
page load times that Google is introducing as a ranking factor, but we'll let you be the judge.
There you have it: three simple and not so simple ways to prevent Googlebot from crawling fictional URLs within inline JavaScript. Of the three, the JavaScript join() function appears to be the most effective.
The reason behind Googlebot being so intrusive is down to its creators, the Google engineers. Bear in mind that Googlebot and its crawling capabilities follow rules like any other machine based program - meaning that this may be a case where Google has to go back to the drawing board to combat the problem. It should also be pointed out that Googlebot does not have infinite resource to crawl the web, meaning that this intrusive, undesired crawling of fictional URLs could be taking away crawling resource that could be better applied elsewhere.