General Information About robots.txt
The robots.txt file is located in the root directory of a website and tells search robots which files and pages they should or should not access.
In general, website owners want search bots to discover their website, but there are cases where this is unnecessary—for example, when valuable information is stored on a website or when bandwidth needs to be conserved by preventing pages with large amounts of data or high-resolution images from being indexed.
When a search robot encounters a webpage, the first thing it looks for is the robots.txt file. Once found, the robot checks the indexing instructions contained within the file.
Important to know: Each website can have only one robots.txt file. For an added domain, this file must be created in the appropriate location.
A robots.txt file consists of lines containing two fields: one line specifying a user-agent name (for search engines) and one or more lines beginning with the following directive:
Disallow:
The robots.txt file must be created in UNIX format.
Robots.txt Syntax Basics
A typical robots.txt file contains something like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~different/
In this example, the indexing of three folders (‘/cgi-bin/’, ‘/tmp/’, and ‘/~different/’) is disabled.
Important to note: Each command must be written on a separate line.
An asterisk (*) in the User-agent field means “any search robot”. Logically, “Disallow:*.gif” or “User-agent: Mozilla*” is not supported. Logical errors of this kind should be avoided, as they are among the most common mistakes.
Other common errors include misspelled directories, incorrect software identifiers, missing colons after User-agent and Disallow, etc. As the robots.txt file becomes more complex, it becomes easier to make these kinds of mistakes.
Examples of Its Use
Disable indexing of the entire site for all search bots:
User-agent: *
Disallow: /
Allow all search robots to index the entire site:
User-agent: *
Disallow:
Only disallow certain directories from being indexed:
User-agent: *
Disallow: /cgi-bin/
Disallow indexing of the site for a specific search robot:
User-agent: Bot1
Disallow: /
Allow indexing for a specific search bot while disallowing others:
User-agent: Opera 9
Disallow:
User-agent: *
Disallow: /
Disable indexing of all files except one:
This can be a bit cumbersome, as the “Allow” command does not exist in robots.txt. Instead, all the files that you do not want indexed should be placed in a subfolder, except for the one you want to remain accessible:
User-agent: *
Disallow: /docs/
Robots.txt and SEO
Remove Image-indexing Disallow:
For some content management systems (CMS), the robots.txt file may inadvertently block the images folder from being indexed.
This issue does not occur with newer CMS versions, but older versions should be checked.
Blocking image indexing means that your images will not appear in Google Image Search, which can negatively impact SEO.
To allow image indexing, you need to remove the following line from robots.txt:
Disallow: /images/
Specify a path to the sitemap.xml file:
If you have a sitemap.xml file (and you should), it is useful to include the following line in your robots.txt file:
Sitemap: http://www.domain.uk/sitemap.xml
Other Information
- Do not block CSS, JavaScript, or similar scripts by default. This can prevent Googlebot from properly rendering the page and recognising that it is mobile-optimised.
- The robots.txt file can be used to prevent certain pages from being indexed, such as login pages or 404 error pages, but this is better managed using the robots meta tag.
- Adding a Disallow directive in robots.txt does not remove data from search engines; it only prevents search robots from indexing the specified pages. If you want to remove content from search results, it is better to use a meta noindex tag.
- As a general rule, you should not use robots.txt to handle duplicate content. There are more effective solutions, such as the rel=canonical tag, which should be placed in the HTML head section.
- Always keep in mind that robots.txt is a crucial file. In many cases, you may find more powerful tools than those provided by Bing and Google Search Console to manage indexing and crawling effectively.
Robots.txt for WordPress
When you create content in WordPress for the first time, a robots.txt file is automatically generated. However, if a real (non-virtual) robots.txt file already exists on the server, this will not happen. A virtual robots.txt file does not physically exist on the server; it can only be accessed via the following link: http://www.yourpage.uk/robots.txt
By default, Google Mediabot is enabled, while many spambots and certain essential WordPress folders and files are blocked.
If you have not yet created a real robots.txt file, you can do so using any text editor and then upload it to the server’s root directory via FTP.
Block Main WordPress Directories
For all WordPress installations, there are three standard directories (wp-content, wp-admin, wp-includes) that do not need to be indexed.
However, the entire wp-content folder should not be blocked, as it contains an uploads folder where website media files are stored, which should remain accessible. Therefore, the following approach should be used:
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/.
Disallow: /wp-content/themes/
Blocking Based on Website Structure
Each blog can be restricted in several ways:
a) By category
b) By tags
c) Based on both, or neither
d) By database archives
II. If the website is category-structured, indexing of tag archives is unnecessary.
The tag database can be accessed by clicking on the Options tab and then the Permalinks tab. If the field is empty, the tag is simply identified as “tag”:
Disallow: /tag/
II. If the website is tag-structured, block the category archive. Find the category section and apply the following command:
Disallow: /category/
III. If the website uses both categories and tags, no specific instructions are required. If neither is used, both should be disabled:
Disallow: /tags/
Disallow: /category/
IV. If the website is database-structured, archives can be blocked as follows:
Disallow: /2010/
Disallow: /2011/
Disallow: /2012/
Disallow: /2013/
Important to know: The command “Disallow: /20/”* cannot be used, as it would block all posts or pages starting with “20”.