SEO Guide: The Use And Structure Of Robots.txt Files
By
Many people who are just getting starting in the SEO world have a difficult time wrapping their head around what a Robots.txt file is, and what it does for a website. So what does a Robots.txt file do?
Robots.txt
Search engines often visit websites in order to do a crawl so that they can index the site. These robots are essentially trying to categorize and archive web sites. This is generally a good thing. However, there are on occasion certain parts of your website, or it entirely, that you do not want to be crawled. There are various reasons for this.
Examples Of Why You Wouldn’t Want A Crawl Performed
– Two versions of the same page you don’t want to get hit for duplicate content for.
– Information on-site that may be sensitive or private.
– A website or specific pages are under construction or being worked on and are not ready to be indexed.
– You are doing a site transfer, where there will be two exact versions of your site on the web.
– Demo Sites
– Admin Areas
How Robots.txt Files Work
You use the Robots.txt file to tell search engine robots exactly which pages you would like them to visit, and which ones to avoid. The location of the Robots.txt file is crucial if it is to work correctly. They are always found in the main directory, and if they are not in there, the robots will not find the file as the main directory is the only place they will look for them. As an example it would look like, http://YourSite.com/robots.txt.
There are also many types of crawlers to be aware of for each search engine and other sites, and using their correct name is critical if your Robots.txt file is to work appropriately. Here is a short list of some of the more commonly used ones to consider for Google.
– * (The sign for all crawlers, meaning any and all.)
– Googlebot (Google.com)
– Googlebot-Image (Google Image crawler)
– Googlebot-News (Google News Crawler)
– Googlebot-Video (Google Video Crawler)
– Googlebot-Mobile (Google Mobile Site Crawler)
– Adsbot-Google (Google Ads Crawler)
Structuring A Robots.txt File
The structuring a Robots.txt file is fairly easy and straightforward, however it must be done in a precise manner. In essence, the file is a list of user agents, the targeted disallowed files and directories. The user agents are search engine crawlers. The targeted disallows are the files and directories that are to be kept from being indexed. Another addition to this are comment lines which give a little extra information for future use, denoted by the # sign.
Example Of Proper Robots.txt Structuring
# To disallow the crawl of the /image directory
User-agent: *
Disallow: /temp/
Mistakes To Be Aware Of When Dealing With A Robots.txt File
The example above is relatively easy to understand, however there are often times much more complicated issues to deal with, such as allowing certain users agents access while disallowing others. The most common mistake to be on the look out for are typos and wrong directory names.
Example Of Potential Mistakes
User agent: Googlebot (Problem: Missing the – between User-agent: Googlebot)
User-agent Googlebot (Problem: Missing the : after User-agent)
Disallow: /tem/ (Problem: Missing the p in temp)
Disallow/temp (Problem: Missing the / after /temp/)
Examples For Using Robots.txt Correctly
What follows here are some good examplesof how to use Robots.txt files correctly.
All Robots Visit All Files
User-agent: *
Disallow:
All Robots Disallowed From Entire Site
User-agent: *
Disallow: /
All Robots Disallowed To Three Directories
User-agent: *
Disallow: /images/
Disallow: /private/
Disallow: /tmp/
Specific Robot Disallowed To Enter Site
User-agent: Googlebot
Disallow: /
Two Specific Robots Disallowed To Enter Two Different Directories
User-agent: Googlebot-Images
User-agent: Googlebot-Video
Disallow: /Images/
Disallow: /Video/
All Robots Disallowed To Enter A Specific File
User-agent: *
Disallow: /Directory/file.html
Allow A Single File To Be Allowed Within An Otherwise Disallowed Directory
User-agent: *
Allow: /File/
Disallow: /File/Phile.html
Hopefully this article will help to demystify the whole confusing nature of Robots.txt files. Remember, it’s quite simple to use, but it’s also quite simple to mistype. Meanwhile, the only way you’ll know you’ve messed up is if you notice your disallowed files have been crawled and indexed. At which point it’s a little late, though you can disallow for the next run. But the version that was crawled would be cached and saved. So be sure you watch what you type, and have fun making the Internet work but more importantly… Stay frosty folks!