An Introduction to the Robots.txt File

What is a Robots.txt file

A robots.txt file is a text file that is stored on your web server. Its purpose is to inform the crawlers of a search engine whether or not they can access a particular part of your website. Your robots.txt file should be stored in the root directory of your site.

For example, it should be found here: https://myawesomewebsite.com/robots.txt

A search engine (specifically the search engine ‘spider’ or ‘bot’) will access your robots.txt file before crawling your site. It assesses the information in the file to determine which URLs on the site they are permitted to access.

 

Do I need a Robots.txt file

You may not necessarily need a robots.txt file. If there are no files on your website that you want to prevent search engines from accessing there is no need to have one. If you don’t have one, search engines will simply be able to access all of your site.

If there is any content that you wish to block from search engines, you will need a robots.txt file. You may wish to use this to prevent access to a site that is under construction for example or to disallow access to certain search engines.

Another use of the robots.txt is to maximise your crawl budget. A crawl budget refers to the the amount of pages that google will crawl on your site in any day. Your crawl budget is not definitive and depends on a variety of factors, however, blocking certain URLs in your robots.txt file can free up your crawl budget for your important pages.

 

How do I write a robots.txt file

All you need is a simple text editor to create your file, you can use something like notepad or any other plain text editor.  A robots.txt file will either allow full access, allow no access or allow conditional access.

The first part of every derivative refers to the user-agent. The user agent is a specific bot or crawler. For example, googlebot is the general user agent for google, bingbot for Bing. I have used an asterisk in the examples below as this refers to all robots rather than defining a specific user agent. If you want to specify a particular user agent, simply replace the asterisk with the relevant details. You can find a list of user agents here.

 

Some common examples:

 

Allow full access:

User-agent: *
Disallow:

You can also write the text like this to allow full indexing:

User-agent: *
Allow: /

 

Block all access:

User-agent: *
Disallow: /

 

Block access to one folder:

User-agent: *
Disallow: /folder-name/

 

Block access to everything in a folder apart from one file:

In this example, everything in the images folder will be blocked except the file entitled mypic.jpg

User-agent: *
Disallow: /images/
Allow: /images/mypic.jpg

 

Block access to specific web pages:

User-agent: *
Disallow: /legal-info.html
Disallow: /disclaimer.html
Disallow: /blog/my-article

 

Anything this is disallowed in the robots.txt file may still be displayed in the search results. This is because a search engine can still list a URL that has not been crawled. This may happen if there are backlinks pointing to the page or if the page is the only page that answers a particular search query. This shouldn’t be used for blocking sensitive information or to prevent pages from displaying in the serps.

If you want to make sure that a page does not display in the search results, you should NOINDEX the page. It is important to be aware that the search engine must be allowed to crawl the page in order for it to read the noindex command! If it is set to disallow in the robot.txt file, the noindex cannot be accessed.

 

Noindex vs Robots.txt

The robots.txt should be used to prevent crawlers accessing web pages, it is particularly useful when restricting access to entire directories or specific folders.

If you want to prevent a web page from being indexed, you should use the META Robots tag instead. The META robots tag will noindex a page which means that it will not be displayed in the search results.

The noindex tag looks like this: <meta name=”robots” content=”noindex”>

Mistakes can easily be made when using robots.txt, you should proceed with caution as a simple error could mean you prevent search engines from accessing your entire site, although, it can be a very useful tool when used correctly.