Jan
15

SEO Guide: The Use And Structure Of Robots.txt Files

By

 

Many people who are just getting starting in the SEO world have a difficult time wrapping their head around what a Robots.txt file is, and what it does for a website. So what does a Robots.txt file do?

Robots.txt

Search engines often visit websites in order to do a crawl so that they can index the site. These robots are essentially trying to categorize and archive web sites. This is generally a good thing. However, there are on occasion certain parts of your website, or it entirely, that you do not want to be crawled. There are various reasons for this.

Examples Of Why You Wouldn’t Want A Crawl Performed

–       Two versions of the same page you don’t want to get hit for duplicate content for.

–       Information on-site that may be sensitive or private.

–       A website or specific pages are under construction or being worked on and are not ready to be indexed.

–       You are doing a site transfer, where there will be two exact versions of your site on the web.

–       Demo Sites

–       Admin Areas

How Robots.txt Files Work

You use the Robots.txt file to tell search engine robots exactly which pages you would like them to visit, and which ones to avoid. The location of the Robots.txt file is crucial if it is to work correctly. They are always found in the main directory, and if they are not in there, the robots will not find the file as the main directory is the only place they will look for them. As an example it would look like, http://YourSite.com/robots.txt.

There are also many types of crawlers to be aware of for each search engine and other sites, and using their correct name is critical if your Robots.txt file is to work appropriately. Here is a short list of some of the more commonly used ones to consider for Google.

–       * (The sign for all crawlers, meaning any and all.)

–       Googlebot (Google.com)

–       Googlebot-Image (Google Image crawler)

–       Googlebot-News (Google News Crawler)

–       Googlebot-Video (Google Video Crawler)

–       Googlebot-Mobile (Google Mobile Site Crawler)

–       Adsbot-Google (Google Ads Crawler)

Structuring A Robots.txt File

The structuring a Robots.txt file is fairly easy and straightforward, however it must be done in a precise manner. In essence, the file is a list of user agents, the targeted disallowed files and directories. The user agents are search engine crawlers. The targeted disallows are the files and directories that are to be kept from being indexed. Another addition to this are comment lines which give a little extra information for future use, denoted by the # sign.

Example Of Proper Robots.txt Structuring

# To disallow the crawl of the /image directory

User-agent: *

Disallow: /temp/

Mistakes To Be Aware Of When Dealing With A Robots.txt File

The example above is relatively easy to understand, however there are often times much more complicated issues to deal with, such as allowing certain users agents access while disallowing others. The most common mistake to be on the look out for are typos and wrong directory names.

Example Of Potential Mistakes

User agent: Googlebot (Problem: Missing the – between User-agent: Googlebot)

User-agent Googlebot (Problem: Missing the :  after User-agent)

Disallow: /tem/ (Problem: Missing the p in temp)

Disallow/temp (Problem: Missing the / after /temp/)

Examples For Using Robots.txt Correctly

What follows here are some good examplesof how to use Robots.txt files correctly.

All Robots Visit All Files

User-agent: *

Disallow:

All Robots Disallowed From Entire Site

User-agent: *

Disallow: /

All Robots Disallowed To Three Directories

User-agent: *

Disallow: /images/

Disallow: /private/

Disallow: /tmp/

Specific Robot Disallowed To Enter Site

User-agent: Googlebot

Disallow: /

Two Specific Robots Disallowed To Enter Two Different Directories

User-agent: Googlebot-Images

User-agent: Googlebot-Video

Disallow: /Images/

Disallow: /Video/

All Robots Disallowed To Enter A Specific File

User-agent: *

Disallow: /Directory/file.html

Allow A Single File To Be Allowed Within An Otherwise Disallowed Directory

User-agent: *

Allow: /File/

Disallow: /File/Phile.html

Hopefully this article will help to demystify the whole confusing nature of Robots.txt files. Remember, it’s quite simple to use, but it’s also quite simple to mistype. Meanwhile, the only way you’ll know you’ve messed up is if you notice your disallowed files have been crawled and indexed. At which point it’s a little late, though you can disallow for the next run. But the version that was crawled would be cached and saved. So be sure you watch what you type, and have fun making the Internet work but more importantly… Stay frosty folks!

 

Did you like this? Share it:
Categories : Rank On Page #1

Comments are closed.

Although I give away a LOT of info, I do promote some third party products that I use and find great value in.
Usually, I will receive a commission when these products are purchased from this site.
But as I said, I NEVER promote anything unless I find it valuable in my own business.