Domination of Robot.txt in the World of SEO

14 years ago
By Navid
In Digital Marketing

Today Robots don’t only serve you in your daily life but they also have important job to do in the internet. However, many don’t have a clue what this is all about.
Robot.txt is text file and probably every website have this text file in their root of the domain which you can check by simply typing www.domain.com/robot.txt.

SEO success is not only about linking and writing, but robot.txt also plays a critical role in your online business. There are some pages that you don’t want search bots to find out, a sort of like dirty closet which you don’t want anybody to look over.

What is a Robots.txt File?
Robot.txt is simply a normal text file and is always located in the root directory of your website. Once you know about Robot Exclusion Standard, it will be so easy for you to create robot.txt

The most important thing before you start creating robot.txt is you should NOT create it in HTML editors like Dreamweaver or FrontPage. It is written only in basic text editor like Notepad or TextEdit. Robot.txt is not an html file. It uses its own language which is completely different than other web related languages. However, unlike other languages, robot.txt is extremely easy.

Two Main Section of Robot.txt

User Agent:
All of the items in the robot.txt file are declared by using ‘User-agent’ which specifies the command refers to:
Example 1:
User-agent: googlebot

Example 2:
User-agent: *

Here some major bots:
Googlebot
Yahoo! Slurp
MSNbot
Teoma
Mediapartners-Google (Google AdSense Robot)
Xenu Link Sleuth

Disallow.
This is second most important section of robot.txt which specifies bots not to crawl through the webpage and index it. One thing you should remember is adding ‘disallow’ doesn’t completely disallow the specified bots going through your website. You have to choose which page of riles should be disallowed for search bots.
Disallowing bots to crawl your private policy page

User-agent: *
Disallow: privacy.html

Specifying the entire directories
User-agent: *
Disallow: /cgi-bin/

The above command will disallow all bots to crawl the entire directory. If you want to be specific on bots, then you can put bot name instead of *

Robots.txt Trick
Security is a huge issue online. Naturally, some webmasters are nervous about listing the directories that they want to keep private thinking that they’ll be handing the hackers and black-hat-ness-doers a roadmap to their most secret stuff.

New webmasters tend to get nervous they want to keep some stuffs private in their website. Security is most crucial part of every online business. Hackers and guys with blackhats are always looking for some loopholes to get secret data.

Here is what you should do. If you want to block a private directory, then all you have to do is arbitrate it and add asterisk to the end.

For example
User-agent: *
Disallow: /sec*

Make sure that the abbreviation is unique. You do want to name your private directory that you want protected ‘/secrethotcakemystery/’ and then add above robot.txt

The above robot.txt will now disallow search engine spiders from indexing and crawling directories that has the name which begin with “sec”. However, you need to make sure you are not disallowing other directories which you want to get crawled but has the name such as “secondary” because the robot.txt has the command /sec* which will stop all directories start with ‘sec’ from being crawled by spiders.

Tips for Robot Domination

Make sure put robot.txt ONLY in your root directory of your website like this www.mywebsite.com/robots.txt

IF you leave the disallow command line blank, then it would indicate that ALL files are retrievable. For example:
User-agent:*
Disallow:

You may add as many disallow directives to a singer user agent. Just remember all user agents must be followed by disallow directive.

Although there are no such rules but if you want to practice safe and best SEO then use at least one disallow line for every user agent directive. If you put too many, it could become confusing for you as well as for search bots, and seriously you don’t want to confuse the search bots. If search bots find incorrect format of robot.txt, they simply ignores the page which is not good sign for healthy SEO practice.

To check whether your robot.txt is working or not, just go to Google Webmaster Account and you can test it.

Use at least the basic directive for allowing search bots to entire site if you are planning to put empty robots.

You can add comments to a robots. You can do that by adding a # in front and the entire line will be ignored. However, you must not put comments on the end because it is not appropriate for some search bots.

What kind of stuffs you would want to disallow in your robot.txt?
- Folders that you don’t want visitors to find and those pages which are not password protected.
- Pages that have printer friendly version to avoid duplicate content issues.
- Image directories
- CGI-Bin which contains programming codes

Some useful robot.txt commands

Use of robots:-
Use of robot can allow the bots to crawl and index everything on your page. SO use it carefully.
Here * defines ALL robots and the open disallow gives open door to ANY bot.

User-agent: *
Disallow:

The following command is used to prevent ALL bot from indexing and crawling your webpage.

User-agent: *
Disallow: /

• To deny Ask’s bot, Teoma.

User-agent: Teoma
Disallow: /

• To keep bots out of your cgi-bin and your image directory:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/

• Disallow Google from indexing your images. However other bots can access that page.

User-agent: Googlebot-Image
Disallow: /images/

• Disallow Google for pages favored by Yahoo

User-Agent: Googlebot
Disallow: /yahoo-page.html
#NEVER use user agents or robots.txt for cloaking.

Conclusion
A good written robot.txt allow your webpage to get indexed well and also deep pages to get indexed. You can control your content to make its footprint cleaner and indexable.

A well written robots.txt file helps your site get indexed up to 15% deeper for most sites. It also allows you to control your content so that your site’s SEO footprint is clean and indexable and literal fodder for search engines. That is worth the effort.