Robots txt prevent page indexing. How to disable the indexing of the necessary pages

Robots.txt file— a text file in .txt format that restricts search robots access to content on the http server. How definition, Robots.txt- this robot exception standard, which was adopted by the W3C on January 30, 1994, and is voluntarily used by most search engines. The robots.txt file consists of a set of instructions for crawlers to prevent certain files, pages, or directories from being indexed on a site. Consider the description of robots.txt for the case when the site does not restrict access to the site by robots.

A simple robots.txt example:

User-agent: * Allow: /

Here, robots completely allows the indexing of the entire site.

The robots.txt file must be uploaded to the root directory of your website so that it is available at:

Your_site.ru/robots.txt

Placing a robots.txt file at the root of a site usually requires FTP access. However, some management systems (CMS) allow you to create robots.txt directly from the site's control panel or through the built-in FTP manager.

If the file is available, then you will see the contents of robots.txt in the browser.

What is robots.txt for?

Roots.txt for the site is an important aspect. Why robots.txt is needed? For example, in SEO robots.txt is needed in order to exclude from indexing pages that do not contain useful content and much more. How, what, why and why it is excluded has already been described in the article about, we will not dwell on this here. Do I need a robots.txt file all sites? Yes and no. If the use of robots.txt implies the exclusion of pages from the search, then for small sites with a simple structure and static pages, such exclusions may be unnecessary. However, even for a small site, some robots.txt directives, such as the Host or Sitemap directive, but more on that below.

How to create robots.txt

Since robots.txt is a text file, and to create a robots.txt file, you can use any text editor, for example notepad. As soon as you opened a new text document, you have already started creating robots.txt, it remains only to compose its content, depending on your requirements, and save it as text file called robots in txt format. It's simple, and creating a robots.txt file should not cause problems even for beginners. Below I will show you how to write robots.txt and what to write in robots.

Create robots.txt online

Option for the lazy create robots online and download robots.txt file already ready. Creating robots txt online offers many services, the choice is yours. The main thing is to clearly understand what will be prohibited and what is allowed, otherwise creating a robots.txt file online can turn into a tragedy which can then be difficult to correct. Especially if something that should have been closed gets into the search. Be careful - check your robots file before uploading it to the site. Yet custom robots.txt file more accurately reflects the structure of restrictions than the one that was automatically generated and downloaded from another site. Read on to know what to pay special attention to when editing robots.txt.

Editing robots.txt

Once you have managed to create a robots.txt file online or by hand, you can edit robots.txt. You can change its content as you like, the main thing is to follow some rules and syntax of robots.txt. In the process of working on the site, the robots file may change, and if you edit robots.txt, then do not forget to upload an updated, up-to-date version of the file with all the changes on the site. Next, consider the rules for setting up a file in order to know how to change robots.txt file and "do not chop wood."

Proper setting of robots.txt

Proper setting of robots.txt allows you to avoid getting private information in the search results of major search engines. However, do not forget that robots.txt commands are nothing more than a guide to action, not a defense. Reliable search engine robots like Yandex or Google follow robots.txt instructions, but other robots can easily ignore them. Proper understanding and use of robots.txt is the key to getting results.

To understand how to make correct robots txt, first you need to understand the general rules, syntax and directives of the robots.txt file.

Correct robots.txt starts with User-agent directive, which indicates to which robot the specific directives are addressed.

User-agent examples in robots.txt:

# Specifies directives for all robots simultaneously User-agent: * # Specifies directives for all Yandex robots User-agent: Yandex # Specifies directives for only the main Yandex robot User-agent: YandexBot # Specifies directives for all Google robots User-agent: Googlebot

Please note that such setting up the robots.txt file tells the robot to use only directives that match the user-agent with its name.

Robots.txt example with multiple User-agent entries:

# Will be used by all Yandex robots User-agent: Yandex Disallow: /*utm_ # Will be used by all Google robots User-agent: Googlebot Disallow: /*utm_ # Will be used by all robots except Yandex robots and Google User-agent: * Allow: / *utm_

User agent directive creates only an indication to a specific robot, and immediately after the User-agent directive there should be a command or commands with a direct indication of the condition for the selected robot. The example above uses the disable directive "Disallow", which has the value "/*utm_". Thus, we close everything. Proper setting of robots.txt prevents the presence of empty line breaks between the "User-agent", "Disallow" directives and directives following "Disallow" within the current "User-agent".

An example of an incorrect line feed in robots.txt:

An example of a correct line feed in robots.txt:

User-agent: Yandex Disallow: /*utm_ Allow: /*id= User-agent: * Disallow: /*utm_ Allow: /*id=

As you can see from the example, instructions in robots.txt come in blocks, each of which contains instructions either for a specific robot or for all robots "*".

In addition, it is important to follow correct order and sorting of commands in robots.txt when sharing directives such as "Disallow" and "Allow". The "Allow" directive is the permissive directive and is the opposite of the robots.txt "Disallow" command, which is a disallow directive.

An example of sharing directives in robots.txt:

User-agent: * Allow: /blog/page Disallow: /blog

This example prevents all robots from indexing all pages starting with "/blog", but allows indexing pages starting with "/blog/page".

The previous example of robots.txt in the correct sort:

User-agent: * Disallow: /blog Allow: /blog/page

First we disable the entire section, then we allow some of its parts.

One more correct robots.txt example with joint directives:

User-agent: * Allow: / Disallow: /blog Allow: /blog/page

Pay attention to the correct sequence of directives in this robots.txt.

The "Allow" and "Disallow" directives can also be specified without parameters, in which case the value will be interpreted inversely to the "/" parameter.

An example of a "Disallow/Allow" directive without parameters:

User-agent: * Disallow: # is equivalent to Allow: / Disallow: /blog Allow: /blog/page

How to compose the correct robots.txt and how to use the interpretation of directives is your choice. Both options will be correct. The main thing is not to get confused.

For the correct compilation of robots.txt, it is necessary to accurately specify the priorities in the parameters of the directives and what will be prohibited for download by robots. We will look at the use of the "Disallow" and "Allow" directives more fully below, but now let's look at the robots.txt syntax. Knowing the syntax of robots.txt will get you closer to create the perfect robots txt with your own hands.

Robots.txt Syntax

Search engine robots voluntarily follow robots.txt commands- the standard for exceptions for robots, but not all search engines treat the robots.txt syntax in the same way. The robots.txt file has a strictly defined syntax, but at the same time write robots txt is not difficult as its structure is very simple and easy to understand.

Here is a specific list of simple rules, following which you will exclude common robots.txt errors:

  1. Each directive starts on a new line;
  2. Do not include more than one directive on a single line;
  3. Don't put a space at the beginning of a line;
  4. The directive parameter must be on one line;
  5. You don't need to enclose directive parameters in quotation marks;
  6. Directive parameters do not require closing semicolons;
  7. The command in robots.txt is specified in the format - [directive_name]:[optional space][value][optional space];
  8. Comments are allowed in robots.txt after the pound sign #;
  9. An empty newline can be interpreted as the end of a User-agent directive;
  10. The directive "Disallow:" (with an empty value) is equivalent to "Allow: /" - allow everything;
  11. The "Allow", "Disallow" directives specify no more than one parameter;
  12. The name of the robots.txt file does not allow the presence of capital letters, the erroneous spelling of the file name is Robots.txt or ROBOTS.TXT;
  13. Writing the names of directives and parameters in capital letters is considered bad form, and if, according to the standard, robots.txt is case-insensitive, file and directory names are often case-sensitive;
  14. If the directive parameter is a directory, then the directory name is always preceded by a slash "/", for example: Disallow: /category
  15. Too large robots.txt (more than 32 KB) are considered fully permissive, equivalent to "Disallow: ";
  16. Robots.txt that is inaccessible for some reason may be treated as completely permissive;
  17. If robots.txt is empty, then it will be treated as completely permissive;
  18. As a result of listing multiple "User-agent" directives without an empty newline, all subsequent "User-agent" directives except the first one can be ignored;
  19. The use of any symbols of national alphabets in robots.txt is not allowed.

Since different search engines may interpret the robots.txt syntax differently, some points can be omitted. So, for example, if you specify several "User-agent" directives without an empty line break, all "User-agent" directives will be accepted correctly by Yandex, since Yandex highlights entries by the presence in the "User-agent" line.

The robots should strictly indicate only what is needed, and nothing more. Don't think how to write everything in robots txt what is possible and how to fill it. Perfect robots txt is the one with fewer lines but more meaning. "Brevity is the soul of wit". This expression is very useful here.

How to check robots.txt

In order to check robots.txt for the correct syntax and structure of the file, you can use one of the online services. For example, Yandex and Google offer their own services for webmasters, which include robots.txt parsing:

Checking the robots.txt file in Yandex.Webmaster: http://webmaster.yandex.ru/robots.xml

In order to check robots.txt online necessary upload robots.txt to the site in the root directory. Otherwise, the service may report that failed to load robots.txt. It is recommended to first check robots.txt for availability at the address where the file is located, for example: your_site.ru/robots.txt.

In addition to verification services from Yandex and Google, there are many others online. robots.txt validators.

Robots.txt vs Yandex and Google

There is a subjective opinion that Yandex perceives the indication of a separate block of directives "User-agent: Yandex" in robots.txt more positively than the general block of directives with "User-agent: *". A similar situation with robots.txt and Google. Specifying separate directives for Yandex and Google allows you to manage site indexing through robots.txt. Perhaps they are flattered by a personal appeal, especially since for most sites the content of the robots.txt blocks of Yandex, Google and other search engines will be the same. With rare exceptions, all "User-agent" blocks will have default for robots.txt set of directives. Also, using different "User-agent" you can install prohibition of indexing in robots.txt for Yandex, but, for example, not for Google.

Separately, it is worth noting that Yandex takes into account such an important directive as "Host", and the correct robots.txt for Yandex should include this directive to indicate the main site mirror. The "Host" directive will be discussed in more detail below.

Disable indexing: robots.txt Disallow

Disallow - prohibiting directive, which is most often used in the robots.txt file. Disallow prohibits indexing of the site or part of it, depending on the path specified in the parameter of the Disallow directive.

An example of how to disable site indexing in robots.txt:

User-agent: * Disallow: /

This example closes the entire site from indexing for all robots.

The special characters * and $ can be used in the parameter of the Disallow directive:

* - any number of any characters, for example, the /page* parameter satisfies /page, /page1, /page-be-cool, /page/kak-skazat, etc. However, there is no need to specify * at the end of each parameter, since, for example, the following directives are interpreted in the same way:

User-agent: Yandex Disallow: /page User-agent: Yandex Disallow: /page*

$ - indicates the exact match of the exception to the parameter value:

User agent: Googlebot Disallow: /page$

In this case, the Disallow directive will disallow /page, but will not disallow /page1, /page-be-cool, or /page/kak-skazat from being indexed.

If close robots.txt site indexing, search engines may respond to such a move with the error “Blocked in robots.txt file” or “url restricted by robots.txt” (url prohibited by robots.txt file). If you need disable page indexing, you can use not only robots txt, but also similar html tags:

  • - do not index the content of the page;
  • - do not follow links on the page;
  • - it is forbidden to index content and follow links on the page;
  • - similar to content="none".

Allow indexing: robots.txt Allow

Allow - allowing directive and the opposite of the Disallow directive. This directive has a syntax similar to Disallow.

An example of how to disable site indexing in robots.txt except for some pages:

User-agent: * Disallow: /Allow: /page

It is forbidden to index the entire site, except for pages starting with /page.

Disallow and Allow with an empty parameter value

An empty Disallow directive:

User-agent: * Disallow:

Do not prohibit anything or allow indexing of the entire site and is equivalent to:

User-agent: * Allow: /

Empty directive Allow:

User-agent: * Allow:

Allow nothing or complete prohibition of site indexing is equivalent to:

User-agent: * Disallow: /

Main site mirror: robots.txt Host

The Host directive is used to indicate to the Yandex robot the main mirror of your site. Of all the popular search engines, the directive Host is recognized only by Yandex robots. The Host directive is useful if your site is available on multiple sites, for example:

mysite.ru mysite.com

Or to prioritize between:

Mysite.ru www.mysite.ru

You can tell the Yandex robot which mirror is the main one. The Host directive is specified in the "User-agent: Yandex" directive block and as a parameter, the preferred site address without "http://" is indicated.

An example of robots.txt indicating the main mirror:

User-agent: Yandex Disallow: /page Host: mysite.ru

The domain name mysite.ru without www is indicated as the main mirror. Thus, this type of address will be indicated in the search results.

User-agent: Yandex Disallow: /page Host: www.mysite.ru

The domain name www.mysite.ru is indicated as the main mirror.

Host directive in robots.txt file can be used only once, if the Host directive is specified more than once, only the first one will be taken into account, other Host directives will be ignored.

If you want to specify the main mirror for Googlebot, use the Google Webmaster Tools service.

Sitemap: robots.txt sitemap

Using the Sitemap directive, you can specify the location on the site in robots.txt.

Robots.txt example with sitemap address:

User-agent: * Disallow: /page Sitemap: http://www.mysite.ru/sitemap.xml

Specifying the address of the site map through sitemap directive in robots.txt allows the search robot to find out about the presence of a sitemap and start indexing it.

Clean-param Directive

The Clean-param directive allows you to exclude pages with dynamic parameters from indexing. Similar pages can serve the same content with different page URLs. Simply put, as if the page is available at different addresses. Our task is to remove all unnecessary dynamic addresses, which can be a million. To do this, we exclude all dynamic parameters, using the Clean-param directive in robots.txt.

Syntax of the Clean-param directive:

Clean-param: parm1[&parm2&parm3&parm4&..&parmn] [Path]

Consider the example of a page with the following URL:

www.mysite.ru/page.html?&parm1=1&parm2=2&parm3=3

Example robots.txt Clean-param:

Clean-param: parm1&parm2&parm3 /page.html # page.html only

Clean-param: parm1&parm2&parm3 / # for all

Crawl-delay directive

This instruction allows you to reduce the load on the server if robots visit your site too often. This directive is relevant mainly for sites with a large volume of pages.

Example robots.txt Crawl-delay:

User-agent: Yandex Disallow: /page Crawl-delay: 3

In this case, we "ask" Yandex robots to download the pages of our site no more than once every three seconds. Some search engines support decimal format as a parameter Crawl-delay robots.txt directives.

Sometimes it is necessary that the pages of the site or the links placed on them do not appear in the search results. You can hide site content from indexing using the robots.txt file, HTML markup, or authorization on the site.

Prohibition of indexing a site, section or page

If some pages or sections of the site should not be indexed (for example, with proprietary or confidential information), restrict access to them in the following ways:

    Use authorization on the site. We recommend this method to hide the main page of the site from indexing. If the home page is disabled in the robots.txt file or using the noindex meta tag, but it is linked to, the page may appear in search results.

Prohibition of page content indexing

Hide part of the page text from indexing

In the HTML code of the page, add the noindex element. For example:

The element is not sensitive to nesting - it can be located anywhere in the HTML code of the page. If you need to make the site code valid, you can use the tag in the following format:

text to be indexedHide a link on a page from indexing

In the HTML code of the page, add the a attribute to the a element. For example:

The attribute works similar to the nofollow directive in the robots meta tag, but only applies to the link for which it is specified.

Robots.txt for wordpress is one of the main tools for setting up indexing. Earlier we talked about speeding up and improving the article indexing process. Moreover, they considered this issue as if the search robot did not know and could not do anything. And we have to tell him. For this we used a sitemap file.

Perhaps you still do not know how the search robot indexes your site? By default, everything is allowed to be indexed. But he doesn't do it right away. The robot, having received a signal that it is necessary to visit the site, puts it in a queue. Therefore, indexing does not occur instantly at our request, but after some time. Once it's your site's turn, this spider robot is right there. First of all, it looks for the robots.txt file.

If robots.txt is found, it reads all directives and sees the file address at the end. Then the robot, in accordance with the sitemap, bypasses all the materials provided for indexing. He does this within a limited period of time. That is why, if you have created a site with several thousand pages and posted it in its entirety, then the robot simply will not have time to go around all the pages in one go. And only those that he managed to view will get into the index. And the robot walks all over the site and spends its time on it. And it’s not a fact that in the first place he will view exactly those pages that you are waiting for in the search results.

If the robot does not find the robots.txt file, it considers that everything is allowed to be indexed. And he begins to rummage through all the back streets. After making a complete copy of everything he could find, he leaves your site, until the next time. As you understand, after such a search, everything that is needed and everything that is not needed gets into the search engine index base. What you need to know is your articles, pages, pictures, videos, etc. Why don't you need to index?

For WordPress, this turns out to be a very important issue. The answer to it affects both the acceleration of the indexing of the content of your site, and its security. The fact is that all service information does not need to be indexed. And it is generally desirable to hide WordPress files from prying eyes. This will reduce the chance of your site being hacked.

WordPress creates a lot of copies of your articles with different URLs but the same content. It looks like this:

//site_name/article_name,

//site_name/category_name/article_name,

//site_name/heading_name/subheading_name/article_name,

//site_name/tag_name/article_name,

//site_name/archive_creation_date/article_name

With tags and archives in general guard. How many tags an article is attached to, so many copies are created. When editing an article, as many archives will be created on different dates, as many new addresses with almost similar content will appear. And there are also copies of articles with addresses for each comment. It's just plain awful.

A huge number of duplicates search engines evaluate as a bad site. If all these copies are indexed and provided in the search, then the weight of the main article will be spread over all copies, which is very bad. And it is not a fact that the article with the main address will be shown as a result of the search. Hence it is necessary to forbid indexing of all copies.

WordPress formats images as separate articles without text. In this form, without text and description, they look like articles absolutely incorrect. Therefore, you need to take measures to prevent these addresses from being indexed by search engines.

Why shouldn't it be indexed?

Five reasons to ban indexing!

  1. Full indexing puts extra load on your server.
  2. It takes precious time of the robot itself.
  3. Perhaps this is the most important thing, incorrect information can be misinterpreted by search engines. This will lead to incorrect ranking of articles and pages, and subsequently to incorrect results in the search results.
  4. Folders with templates and plugins contain a huge number of links to the sites of creators and advertisers. This is very bad for a young site, when there are no or very few links to your site from outside yet.
  5. By indexing all copies of your articles in archives and comments, the search engine gets a bad opinion of your site. Lots of duplicates. Many outbound links The search engine will downgrade your site in search results to the point of filtering. And the pictures, designed as a separate article with a title and without text, terrify the robot. If there are a lot of them, then the site may rattle under the Yandex AGS filter. My site was there. Checked!

Now, after all that has been said, a reasonable question arises: "Is it possible to somehow prohibit indexing something that is not necessary?". It turns out you can. At least not by order, but by recommendation. The situation of not completely prohibiting the indexing of some objects occurs due to the sitemap.xml file, which is processed after robots.txt. It turns out like this: robots.txt prohibits, and sitemap.xml allows. And yet we can solve this problem. How to do it right now and consider.

The wordpress robots.txt file is dynamic by default and doesn't really exist in wordpress. And it is generated only at the moment when someone requests it, be it a robot or just a visitor. That is, if you go to the site via an FTP connection, then you simply will not find the robots.txt file for wordpress in the root folder. And if you specify its specific address http://your_site_name/robots.txt in the browser, then you will get its contents on the screen as if the file exists. The content of this generated wordpress robots.txt file will be:

In the rules for compiling the robots.txt file, by default, everything is allowed to be indexed. The User-agent: * directive indicates that all subsequent commands apply to all search agents (*). But then nothing is limited. And as you know, this is not enough. We have already discussed folders and records with limited access, quite a lot.

In order to be able to make changes to the robots.txt file and save them there, you need to create it in a static, permanent form.

How to create robots.txt for wordpress

In any text editor (only in no case use MS Word and the like with automatic text formatting elements) create a text file with the following approximate content and send it to the root folder of your site. Changes can be made as needed.

You just need to take into account the features of compiling the file:

At the beginning of the lines of numbers, as here in the article, there should not be. The numbers are given here for the convenience of reviewing the contents of the file. There should not be any extra characters at the end of each line, including spaces or tabs. Between blocks there should be an empty line without any characters, including spaces. Just one space can do you great harm - BE CAREFUL .

How to check robots.txt for wordpress

You can check robots.txt for extra spaces in the following way. In a text editor, select all text by pressing Ctrl+A. If there are no spaces at the end of lines and empty lines, you will notice this. And if there is a selected void, then you need to remove the spaces and everything will be OK.

You can check if the prescribed rules are working correctly at the following links:

  • Robots.txt parsing Yandex Webmaster
  • Parsing robots.txt in Google Search console .
  • Service for creating a robots.txt file: http://pr-cy.ru/robots/
  • Service for creating and checking robots.txt: https://seolib.ru/tools/generate/robots/
  • Documentation from Yandex .
  • Documentation from google(English)

There is another way to check the robots.txt file for a wordpress site, this is to upload its content to the Yandex webmaster or specify the address of its location. If there are any errors, you will immediately know.

Correct robots.txt for wordpress

Now let's jump right into the content of the robots.txt file for a wordpress site. What directives must be present in it. The approximate content of the robots.txt file for wordpress, given its features, is given below:

User-agent: * Disallow: /wp-login.php Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/themes Disallow: */*comments Disallow: * /*category Disallow: */*tag Disallow: */trackback Disallow: */*feed Disallow: /*?* Disallow: /?s= Allow: /wp-admin/admin-ajax.php Allow: /wp-content /uploads/ Allow: /*?replytocom User-agent: Yandex Disallow: /wp-login.php Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/themes Disallow: */comments Disallow: */*category Disallow: */*tag Disallow: */trackback Disallow: */*feed Disallow: /*?* Disallow: /*?s= Allow: /wp-admin/admin- ajax.php Allow: /wp-content/uploads/ Allow: /*?replytocom Crawl-delay: 2.0 Host: site.ru Sitemap: http://site.ru/sitemap.xml

Wordpress robots.txt directives

Now let's take a closer look:

1 - 16 lines block settings for all robots

User-agent: - This is a required directive that defines the search agent. The asterisk says that the directive is for robots of all search engines. If the block is intended for a specific robot, then you must specify its name, for example, Yandex, as in line 18.

By default, everything is allowed for indexing. This is equivalent to the Allow: / directive.

Therefore, to prohibit indexing of specific folders or files, a special Disallow: directive is used.

In our example, using folder names and file name masks, a ban is made on all WordPress service folders, such as admin, themes, plugins, comments, category, tag... If you specify a directive in this form Disallow: /, then a ban will be given indexing the entire site.

Allow: - as I said, the directive allows indexing folders or files. It should be used when there are files deep in the forbidden folders that still need to be indexed.

In my example, line 3 Disallow: /wp-admin - prohibits indexing of the /wp-admin folder, and line 14 Allow: /wp-admin/admin-ajax.php - allows indexing of the /admin-ajax.php file located in the forbidden indexing folder /wp-admin/.

17 - Empty line (just pressing the Enter button without spaces)

18 - 33 settings block specifically for the Yandex agent (User-agent: Yandex). As you noticed, this block completely repeats all the commands of the previous block. And the question arises: "What the hell is such a trouble?". So this is all done just because of a few directives that we will consider further.

34 - Crawl-delay - Optional directive for Yandex only. It is used when the server is heavily loaded and does not have time to process robot requests. It allows you to set the search robot the minimum delay (in seconds and tenths of a second) between the end of loading one page and the start of loading the next. The maximum allowed value is 2.0 seconds. It is added directly after the Disallow and Allow directives.

35 - Empty string

36 - Host: site.ru - domain name of your site (MANDATORY directive for the Yandex block). If our site uses the HTTPS protocol, then the address must be specified in full as shown below:

Host: https://site.ru

37 - An empty string (just pressing the Enter button without spaces) must be present.

38 - Sitemap: http://site.ru/sitemap.xml - sitemap.xml file(s) location address (MANDATORY directive), located at the end of the file after an empty line and applies to all blocks.

Masks for robots.txt file directives for wordpress

Now a little how to create masks:

  1. Disallow: /wp-register.php - Disable indexing of the wp-register.php file located in the root folder.
  2. Disallow: /wp-admin - prohibits indexing the contents of the wp-admin folder located in the root folder.
  3. Disallow: /trackback - disables indexing of notifications.
  4. Disallow: /wp-content/plugins - prohibits indexing the contents of the plugins folder located in a subfolder (second level folder) of wp-content.
  5. Disallow: /feed - prohibits the indexing of the feed i.e. closes the site's RSS feed.
  6. * - means any sequence of characters, therefore it can replace both one character and part of the name or the entire name of a file or folder. The absence of a specific name at the end is tantamount to writing *.
  7. Disallow: */*comments - prohibits indexing the contents of folders and files in the name of which there are comments and located in any folders. In this case, it prevents comments from being indexed.
  8. Disallow: *?s= - prohibits indexing search pages

The above lines can be used as a working robots.txt file for wordpress. Only in 36, 38 lines you need to enter the address of your site and MANDATORY REMOVE line numbers. And you will get a working robots.txt file for wordpress , adapted to any search engine.

The only feature is that the size of the working robots.txt file for a wordpress site should not exceed 32 kB of disk space.

If you are absolutely not interested in Yandex, then you will not need lines 18-35 at all. That's probably all. I hope that the article was useful. If you have any questions write in the comments.

ROBOTS.TXT- Standard of exceptions for robots - a file in text format.txt to restrict access to the content of the site by robots. The file must be located in the site root (at /robots.txt). The use of the standard is optional, but search engines follow the rules contained in robots.txt. The file itself consists of a set of records of the form

:

where field is the name of the rule (User-Agent, Disallow, Allow, etc.)

Records are separated by one or more empty lines (line terminator: characters CR, CR+LF, LF)

How to set up ROBOTS.TXT correctly?

This paragraph provides the basic requirements for setting up a file, specific recommendations for setting up, examples for popular CMS

  • The file size must not exceed 32 KB.
  • The encoding must be ASCII or UTF-8.
  • A valid robots.txt file must contain at least one rule consisting of several directives. Each rule must contain the following directives:
    • which robot this rule is for (User-agent directive)
    • which resources this agent has access to (Allow directive), or which resources it does not have access to (Disallow).
  • Each rule and directive must start on a new line.
  • The value of the Disallow/Allow rule must begin with either a / or *.
  • All lines starting with the # symbol, or parts of lines starting with this symbol, are considered comments and are not taken into account by agents.

Thus, the minimum content of a properly configured robots.txt file looks like this:

User-agent: * #for all agents Disallow: #nothing is allowed = access to all files is allowed

How to create/modify ROBOTS.TXT?

You can create a file using any text editor (for example, notepad++). To create or modify a robots.txt file, access to the server via FTP/SSH is usually required, however, many CMS/CMFs have a built-in file content management interface through the administration panel (“admin panel”), for example: Bitrix, ShopScript and others.

What is the ROBOTS.TXT file for on the site?

As you can see from the definition, robots.txt allows you to control the behavior of robots when visiting a site, i.e. set up indexing of the site by search engines - this makes this file an important part of the SEO optimization of your site. The most important feature of robots.txt is the ban on indexing pages/files that do not contain useful information. Or in general, the entire site, which may be necessary, for example, for test versions of the site.

The main examples of what needs to be closed from indexing will be discussed below.

What needs to be closed from indexing?

Firstly, you should always disable site indexing during development to avoid getting into the index pages that will not be on the finished version of the site at all and pages with missing/duplicate/test content before they are filled.

Secondly, copies of the site created as test sites for development should be hidden from indexing.

Thirdly, we will analyze what content directly on the site should be prohibited from indexing.

  1. Administrative part of the site, service files.
  2. User authorization / registration pages, in most cases - personal sections of users (if public access to personal pages is not provided).
  3. Cart and checkout pages, order review.
  4. Product comparison pages, it is possible to selectively open such pages for indexing, provided they are unique. In general, comparison tables are countless pages with duplicate content.
  5. Search and filter pages can be left open for indexing only if they correct setting: individual urls filled with unique titles, meta tags. In most cases, such pages should be closed.
  6. Pages with sorting products/records, if they have different addresses.
  7. Pages with utm-, openstat-tags in URl (as well as all others).

Syntax ROBOTS.TXT

Now let's dwell on the syntax of robots.txt in more detail.

General provisions:

  • each directive must start on a new line;
  • the string must not start with a space;
  • the value of the directive must be on one line;
  • no need to enclose directive values ​​in quotation marks;
  • by default, for all values ​​of directives, * is written at the end, Example: User-agent: Yandex Disallow: /cgi-bin* # blocks access to pages Disallow: /cgi-bin # same
  • an empty newline is treated as the end of the User-agent rule;
  • only one value is specified in the "Allow", "Disallow" directives;
  • the name of the robots.txt file does not allow uppercase letters;
  • robots.txt larger than 32 KB is not allowed, robots will not download such a file and will consider the site to be completely allowed;
  • inaccessible robots.txt may be treated as fully permissive;
  • an empty robots.txt is considered fully permissive;
  • to specify the Cyrillic values ​​of the rules, use Punycod;
  • only UTF-8 and ASCII encodings are allowed: the use of any national alphabets and other characters in robots.txt is not allowed.

Special symbols:

  • #

    The comment start character, all text after # and before a line feed is considered a comment and is not used by robots.

    *

    A wildcard value denoting the prefix, suffix, or full value of the directive - any set of characters (including the empty one).

  • $

    Indication of the end of the line, the prohibition of completing * to the value, on Example:

    User-agent: * #for all Allow: /$ #allow indexing of the main page Disallow: * #prohibit indexing of all pages except allowed

List of directives

  1. user-agent

    Mandatory directive. Determines which robot the rule refers to, the rule may contain one or more such directives. You can use the * character to indicate a prefix, suffix, or full name robot. Example:

    #site closed for Google.News and Google.Images User-agent: Googlebot-Image User-agent: Googlebot-News Disallow: / #for all robots whose name starts with Yandex, close the “News” section User-agent: Yandex* Disallow: /news #open to everyone else User-agent: * Disallow:

  2. Disallow

    The directive specifies which files or directories should not be indexed. The value of the directive must begin with the character / or *. By default, * is appended to the end of the value, unless it is prohibited by the $ symbol.

  3. allow

    Each rule must have at least one Disallow: or Allow: directive.

    The directive specifies which files or directories should be indexed. The value of the directive must begin with the character / or *. By default, * is appended to the end of the value, unless it is prohibited by the $ symbol.

    The use of the directive is relevant only in conjunction with Disallow to allow indexing of some subset of pages prohibited from indexing by the Disallow directive.

  4. Clean param

    Optional, cross-sectional directive. Use the Clean-param directive if site page addresses contain GET parameters (shown after the ? sign in the URL) that do not affect their content (for example, UTM). With the help of this rule, all addresses will be brought to a single form - the original one, without parameters.

    Directive syntax:

    Clean-param: p0[&p1&p2&..&pn]

    p0… - names of parameters that do not need to be taken into account
    path - path prefix of pages for which the rule applies


    Example.

    The site has pages like

    www.example.com/some_dir/get_book.pl?ref=site_1&book_id=123 www.example.com/some_dir/get_book.pl?ref=site_2&book_id=123 www.example.com/some_dir/get_book.pl?ref=site_3&book_id= 123

    When specifying a rule

    User-agent: Yandex Disallow: Clean-param: ref /some_dir/get_book.pl

    the robot will reduce all page addresses to one:

    www.example.com/some_dir/get_book.pl?book_id=123

  5. Sitemap

    Optional directive, it is possible to place several such directives in one file, cross-sectional (it is enough to specify it once in the file, without duplicating for each agent).

    Example:

    Sitemap: https://example.com/sitemap.xml

  6. Crawl-delay

    The directive allows you to set the search robot the minimum period of time (in seconds) between the end of loading one page and the start of loading the next. Fractional values ​​supported

    The minimum allowable value for Yandex robots is 2.0.

    Google robots do not respect this directive.

    Example:

    User-agent: Yandex Crawl-delay: 2.0 # sets the timeout to 2 seconds User-agent: * Crawl-delay: 1.5 # sets the timeout to 1.5 seconds

  7. Host

    The directive specifies the main mirror of the site. At the moment, only Mail.ru is supported from popular search engines.

    Example:

    User-agent: Mail.Ru Host: www.site.ru # main mirror from www

Examples of robots.txt for popular CMS

ROBOTS.TXT for 1C:Bitrix

Bitrix CMS provides the ability to manage the contents of the robots.txt file. To do this, in the administrative interface, you need to go to the “Robots.txt settings” tool using the search, or along the path Marketing->Search engine optimization->Robots.txt settings. You can also change the contents of robots.txt through the built-in Bitrix file editor, or via FTP.

The example below can be used as a starter set of robots.txt for sites on Bitrix, but is not universal and requires adaptation depending on the site.

Explanations:

  1. the split into rules for different agents is due to the fact that Google does not support the Clean-param directive.
User-Agent: Yandex Disallow: */index.php Disallow: /bitrix/ Disallow: /*filter Disallow: /*order Disallow: /*show_include_exec_time= Disallow: /*show_page_exec_time= Disallow: /*show_sql_stat= Disallow: /*bitrix_include_areas = Disallow: /*clear_cache= Disallow: /*clear_cache_session= Disallow: /*ADD_TO_COMPARE_LIST Disallow: /*ORDER_BY Disallow: /*?print= Disallow: /*&print= Disallow: /*print_course= Disallow: /*?action= Disallow : /*&action= Disallow: /*register= Disallow: /*forgot_password= Disallow: /*change_password= Disallow: /*login= Disallow: /*logout= Disallow: /*auth= Disallow: /*backurl= Disallow: / *back_url= Disallow: /*BACKURL= Disallow: /*BACK_URL= Disallow: /*back_url_admin= Disallow: /*?utm_source= Disallow: /*?bxajaxid= Disallow: /*&bxajaxid= Disallow: /*?view_result= Disallow: /*&view_result= Disallow: /*?PAGEN*& Disallow: /*&PAGEN Allow: */?PAGEN* Allow: /bitrix/components/*/ Allow: /bitrix/cache/*/ Allow: /bitrix/js/* / Allow: /bitrix/templates/*/ Allow: /bitrix/panel/ */ Allow: /bitrix/components/*/*/ Allow: /bitrix/cache/*/*/ Allow: /bitrix/js/*/*/ Allow: /bitrix/templates/*/*/ Allow: /bitrix /panel/*/*/ Allow: /bitrix/components/ Allow: /bitrix/cache/ Allow: /bitrix/js/ Allow: /bitrix/templates/ Allow: /bitrix/panel/ Clean-Param: PAGEN_1 / Clean- Param: PAGEN_2 / #if there are more paginated components on the site, then duplicate the rule for all variants, changing the number Clean-Param: sort Clean-Param: utm_source&utm_medium&utm_campaign Clean-Param: openstat User-Agent: * Disallow: */index.php Disallow : /bitrix/ Disallow: /*filter Disallow: /*sort Disallow: /*order Disallow: /*show_include_exec_time= Disallow: /*show_page_exec_time= Disallow: /*show_sql_stat= Disallow: /*bitrix_include_areas= Disallow: /*clear_cache= Disallow : /*clear_cache_session= Disallow: /*ADD_TO_COMPARE_LIST Disallow: /*ORDER_BY Disallow: /*?print= Disallow: /*&print= Disallow: /*print_course= Disallow: /*?action= Disallow: /*&action= Disallow: / *register= Disallow: /*forgot_password= Disallow: /*change_password= Disallow: /*login= Disallow: /*logout= Disallow: /*auth= Disallow: /*backurl= Disallow: /*back_url= Disallow: /*BACKURL= Disallow: /*BACK_URL= Disallow: /*back_url_admin= Disallow: /*?utm_source= Disallow: /*?bxajaxid= Disallow: /*&bxajaxid= Disallow: /*?view_result= Disallow: /*&view_result= Disallow: /*utm_ Disallow: /*openstat= Disallow: /*?PAGEN*& Disallow: /*&PAGEN Allow: */?PAGEN* Allow: /bitrix/components/*/ Allow: /bitrix/cache/*/ Allow: /bitrix/js/*/ Allow: /bitrix/ templates/*/ Allow: /bitrix/panel/*/ Allow: /bitrix/components/*/*/ Allow: /bitrix/cache/*/*/ Allow: /bitrix/js/*/*/ Allow: /bitrix /templates/*/*/ Allow: /bitrix/panel/*/*/ Allow: /bitrix/components/ Allow: /bitrix/cache/ Allow: /bitrix/js/ Allow: /bitrix/templates/ Allow: /bitrix /panel/ Sitemap: http://site.com/sitemap.xml #replace with the address of your sitemap

ROBOTS.TXT for WordPress

There is no built-in tool for setting up robots.txt in the WordPress admin panel, so access to the file is possible only via FTP, or after installing a special plugin (for example, DL Robots.txt).

The example below can be used as a robots.txt starter kit for Wordpress sites, but is not universal and needs to be adapted depending on the site.


Explanations:

  1. the Allow directives contain the paths to the files of styles, scripts, pictures: for the correct indexing of the site, it is necessary that they be available to robots;
  2. for most sites, the author and tag archive pages only create duplicate content and do not create useful content, so in this example they are closed for indexing. If in your project such pages are necessary, useful and unique, then you should remove the Disallow: /tag/ and Disallow: /author/ directives.

An example of the correct ROBOTS.TXT for a site on WoRdPress:

User-agent: Yandex # For Yandex Disallow: /cgi-bin Disallow: /? Disallow: /wp- Disallow: *?s= Disallow: *&s= Disallow: /search/ Disallow: /author/ Disallow: /users/ Disallow: */trackback Disallow: */feed Disallow: */rss Disallow: */ embed Disallow: /xmlrpc.php Disallow: /tag/ Disallow: /readme.html Disallow: *?replytocom Allow: */uploads Allow: /*/*.js Allow: /*/*.css Allow: /wp-* .png Allow: /wp-*.jpg Allow: /wp-*.jpeg Allow: /wp-*.gif Clean-Param: utm_source&utm_medium&utm_campaign Clean-Param: openstat User-agent: * Disallow: /cgi-bin Disallow: / ? Disallow: /wp- Disallow: *?s= Disallow: *&s= Disallow: /search/ Disallow: /author/ Disallow: /users/ Disallow: */trackback Disallow: */feed Disallow: */rss Disallow: */ embed Disallow: /xmlrpc.php Disallow: *?utm Disallow: *openstat= Disallow: /tag/ Disallow: /readme.html Disallow: *?replytocom Allow: */uploads Allow: /*/*.js Allow: /* /*.css Allow: /wp-*.png Allow: /wp-*.jpg Allow: /wp-*.jpeg Allow: /wp-*.gif Sitemap: http://site.com/sitemap.xml # replace with the address of your sitemap

ROBOTS.TXT for OpenCart

There is no built-in tool for configuring robots.txt in the “admin panel” of OpenCart, so the file can only be accessed using FTP.

The example below can be used as a robots.txt starter for OpenCart sites, but is not universal and needs to be adapted depending on the site.


Explanations:

  1. the Allow directives contain the paths to the files of styles, scripts, pictures: for the correct indexing of the site, it is necessary that they be available to robots;
  2. splitting into rules for different agents is due to the fact that Google does not support the Clean-param directive;
User-agent: * Disallow: /*route=account/ Disallow: /*route=affiliate/ Disallow: /*route=checkout/ Disallow: /*route=product/search Disallow: /index.php?route=product/product *&manufacturer_id= Disallow: /admin Disallow: /catalog Disallow: /system Disallow: /*?sort= Disallow: /*&sort= Disallow: /*?order= Disallow: /*&order= Disallow: /*?limit= Disallow: /*&limit= Disallow: /*?filter_name= Disallow: /*&filter_name= Disallow: /*?filter_sub_category= Disallow: /*&filter_sub_category= Disallow: /*?filter_description= Disallow: /*&filter_description= Disallow: /*?tracking= Disallow: /*&tracking= Disallow: /*compare-products Disallow: /*search Disallow: /*cart Disallow: /*checkout Disallow: /*login Disallow: /*logout Disallow: /*vouchers Disallow: /*wishlist Disallow: /*my-account Disallow: /*order-history Disallow: /*newsletter Disallow: /*return-add Disallow: /*forgot-password Disallow: /*downloads Disallow: /*returns Disallow: /*transactions Disallow: /* create-account Disallow: /*recurring Disallow: /*address-book Disallow: /*reward-points Disallow: /*affiliate-forgot-password Disallow: /*create-affiliate-account Disallow: /*affiliate-login Disallow: /*affiliates Disallow: /*?filter_tag = Disallow: /*brands Disallow: /*specials Disallow: /*simpleregister Disallow: /*simplecheckout Disallow: *utm= Disallow: /*&page Disallow: /*?page*& Allow: /*?page Allow: /catalog/ view/javascript/ Allow: /catalog/view/theme/*/ User-agent: Yandex Disallow: /*route=account/ Disallow: /*route=affiliate/ Disallow: /*route=checkout/ Disallow: /*route= product/search Disallow: /index.php?route=product/product*&manufacturer_id= Disallow: /admin Disallow: /catalog Disallow: /system Disallow: /*?sort= Disallow: /*&sort= Disallow: /*?order= Disallow: /*&order= Disallow: /*?limit= Disallow: /*&limit= Disallow: /*?filter_name= Disallow: /*&filter_name= Disallow: /*?filter_sub_category= Disallow: /*&filter_sub_category= Disallow: /*? filter_description= Disallow: /*&filter_description= Disallow: /*compa re-products Disallow: /*search Disallow: /*cart Disallow: /*checkout Disallow: /*login Disallow: /*logout Disallow: /*vouchers Disallow: /*wishlist Disallow: /*my-account Disallow: /*order -history Disallow: /*newsletter Disallow: /*return-add Disallow: /*forgot-password Disallow: /*downloads Disallow: /*returns Disallow: /*transactions Disallow: /*create-account Disallow: /*recurring Disallow: /*address-book Disallow: /*reward-points Disallow: /*affiliate-forgot-password Disallow: /*create-affiliate-account Disallow: /*affiliate-login Disallow: /*affiliates Disallow: /*?filter_tag= Disallow : /*brands Disallow: /*specials Disallow: /*simpleregister Disallow: /*simplecheckout Disallow: /*&page Disallow: /*?page*& Allow: /*?page Allow: /catalog/view/javascript/ Allow: / catalog/view/theme/*/ Clean-Param: page / Clean-Param: utm_source&utm_medium&utm_campaign / Sitemap: http://site.com/sitemap.xml #replace with your sitemap address

ROBOTS.TXT for Joomla!

There is no built-in tool for setting up robots.txt in the Joomla admin panel, so the file can only be accessed using FTP.

The example below can be used as a robots.txt starter for Joomla sites with SEF enabled, but is not universal and needs to be adapted depending on the site.


Explanations:

  1. the Allow directives contain the paths to the files of styles, scripts, pictures: for the correct indexing of the site, it is necessary that they be available to robots;
  2. splitting into rules for different agents is due to the fact that Google does not support the Clean-param directive;
User-agent: Yandex Disallow: /*% Disallow: /administrator/ Disallow: /bin/ Disallow: /cache/ Disallow: /cli/ Disallow: /components/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /layouts/ Disallow: /libraries/ Disallow: /logs/ Disallow: /log/ Disallow: /tmp/ Disallow: /xmlrpc/ Disallow: /plugins/ Disallow: /modules/ Disallow: /component/ Disallow: /search* Disallow: /*mailto/ Allow: /*.css?*$ Allow: /*.less?*$ Allow: /*.js?*$ Allow: /*.jpg?*$ Allow: /*.png?* $ Allow: /*.gif?*$ Allow: /templates/*.css Allow: /templates/*.less Allow: /templates/*.js Allow: /components/*.css Allow: /components/*.less Allow: /media/*.js Allow: /media/*.css Allow: /media/*.less Allow: /index.php?*view=sitemap* #open sitemap Clean-param: searchword / Clean-param: limit&limitstart / Clean-param: keyword / User-agent: * Disallow: /*% Disallow: /administrator/ Disallow: /bin/ Disallow: /cache/ Disallow: /cli/ Disallow: /components/ Disallow: /includes/ Disallow: /installat ion/ Disallow: /language/ Disallow: /layouts/ Disallow: /libraries/ Disallow: /logs/ Disallow: /log/ Disallow: /tmp/ Disallow: /xmlrpc/ Disallow: /plugins/ Disallow: /modules/ Disallow: / component/ Disallow: /search* Disallow: /*mailto/ Disallow: /*searchword Disallow: /*keyword Allow: /*.css?*$ Allow: /*.less?*$ Allow: /*.js?*$ Allow: /*.jpg?*$ Allow: /*.png?*$ Allow: /*.gif?*$ Allow: /templates/*.css Allow: /templates/*.less Allow: /templates/*. js Allow: /components/*.css Allow: /components/*.less Allow: /media/*.js Allow: /media/*.css Allow: /media/*.less Allow: /index.php?*view =sitemap* #open sitemap Sitemap: http://your_sitemap_address

List of main agents

Bot Function
Googlebot Google's main indexing robot
Googlebot news Google News
Googlebot Image Google Pictures
Googlebot Video video
Mediapartners-Google
media partners Google Adsense, Google Mobile Adsense
AdsBot-Google landing page quality check
AdsBot-Google-Mobile-Apps Google Robot for Apps
YandexBot Yandex's main indexing robot
YandexImages Yandex.Images
YandexVideo Yandex.Video
YandexMedia multimedia data
YandexBlogs blog search robot
YandexAddurl robot accessing the page when it is added via the "Add URL" form
YandexFavicons robot that indexes site icons (favicons)
YandexDirect Yandex.Direct
YandexMetrika Yandex.Metrica
YandexCatalog Yandex.Catalog
YandexNews Yandex.News
YandexImageResizer mobile services robot
bingbot the main indexing robot Bing
Slurp main indexing robot Yahoo!
Mail.Ru main indexing robot Mail.Ru

FAQ

The robots.txt text file is public, so be aware that this file should not be used as a means of hiding confidential information.

Are there any differences between robots.txt for Yandex and Google?

There are no fundamental differences in the processing of robots.txt by Yandex and Google search engines, but a number of points should still be highlighted:

  • as mentioned earlier, the rules in robots.txt are advisory in nature, which is actively used by Google.

    In the robots.txt documentation, Google states that “..is not intended to prevent web pages from appearing in Google search results. “ and “If the robots.txt file prevents Googlebot from processing a web page, it can still be served to Google.” To exclude pages from Google search, you need to use robots meta tags.

    Yandex also excludes pages from the search, guided by the rules of robots.txt.

  • Yandex, unlike Google, supports the Clean-param and Crawl-delay directives.
  • Google AdsBots do not follow the rules for User-agent: *, they need separate rules.
  • Many sources indicate that script and style files (.js, .css) should only be opened for indexing by Google robots. In fact, this is not true and these files should also be opened for Yandex: on November 9, 2015, Yandex began using js and css when indexing sites (official blog post).

How to block a site from indexing in robots.txt?

To close a site in Robots.txt, one of the following rules must be used:

User-agent: * Disallow: / User-agent: * Disallow: *

It is possible to close the site only for one search engine (or several), while leaving the rest of the possibility of indexing. To do this, you need to change the User-agent directive in the rule: replace * with the name of the agent whose access should be denied ().

How to open a site for indexing in robots.txt?

In the usual case, to open a site for indexing in robots.txt, you do not need to take any action, you just need to make sure that all necessary directories are open in robots.txt. For example, if your site was previously hidden from indexing, then the following rules should be removed from robots.txt (depending on the one used):

  • disallow: /
  • Disallow: *

Please note that indexing can be disabled not only using the robots.txt file, but also using the robots meta tag.

It should also be noted that the absence of a robots.txt file in the site root means that site indexing is allowed.

How to specify the main site mirror in robots.txt?

At the moment, specifying the main mirror using robots.txt is not possible. Previously, the Yandex PS used the Host directive, which contained an indication of the main mirror, but since March 20, 2018, Yandex has completely abandoned its use. Now specifying the main mirror is possible only with the help of a 301-page redirect.

Sales Generator

Reading time: 18 minutes

We will send the material to you:

Issues discussed in the material:

  • What role does the robots.txt file play in site indexing
  • How to disable the indexing of the site and its individual pages using robots.txt
  • What robots.txt directives are used for site indexing settings
  • What are the most common mistakes made when creating a robots.txt file

The web resource is ready to go: it is filled with high-quality unique texts, original images, it is convenient to navigate through sections, and the design is pleasing to the eye. It remains only to present your brainchild to the Internet users. But search engines should be the first to get acquainted with the portal. The dating process is called indexing, and one of the main roles in it is played by the text file robots. In order for the robots.txt site to be indexed successfully, a number of specific requirements must be met.



Web resource engine (CMS) is one of the factors that significantly affect the speed of indexing by search spiders. Why is it important to direct crawlers to only important pages that should appear in SERPs?

  1. The search engine robot looks through a limited number of files on a particular resource, and then goes to the next site. In the absence of specified restrictions, the search spider can start by indexing engine files, the number of which is sometimes in the thousands - the robot will simply not have time for the main content.
  2. Or it will index completely different pages on which you plan to advance. Even worse, if search engines see the duplication of content they hate so much, when different links lead to the same (or almost identical) text or image.

Therefore, to forbid the search engine spiders to see too much is a necessity. This is what robots.txt is intended for - a plain text file, the name of which is written in lowercase letters without using capital letters. It is created in any text editor (Notepad++, SciTE, VEdit, etc.) and edited here. The file allows you to influence the indexing of the site by Yandex and Google.

For a programmer who does not yet have sufficient experience, it is better to first familiarize yourself with examples of the correct filling of the file. You need to select the web resources that are of interest to him, and in the address bar of the browser, type site.ru/robots.txt(where the first part before "/" is the name of the portal).

It is important to view only sites running on the engine you are interested in, since the CMS folders that are prohibited from indexing are named differently in different management systems. Therefore, the engine becomes the starting point. If your site is powered by WordPress, you need to look for blogs running on the same engine; for Joomla! will have its own ideal robots, etc. At the same time, it is advisable to take files from portals that attract significant traffic from search as samples.

What is site indexing with robots.txt



Search indexing- the most important indicator on which the success of the promotion largely depends. It seems that the site was created ideally: user requests are taken into account, the content is on top, navigation is convenient, but the site cannot make friends with search engines. The reasons must be sought in the technical side, specifically in the tools with which you can influence the indexing.

There are two of them - Sitemap.xml and robots.txt. Important files that complement each other and at the same time solve polar problems. The sitemap invites search spiders to "Welcome, please index all of these sections" by giving the bots the URL of each page to be indexed and the time of the page. latest update. The robots.txt file, on the other hand, serves as a "stop" sign, preventing spiders from roaming any part of the site without permission.

This file and the similarly named robots meta tag, which allows for finer settings, contain clear instructions for search engine crawlers, indicating prohibitions on indexing certain pages or entire sections.

Correctly set limits will best affect the indexing of the site. Although there are still amateurs who believe that it is possible to allow bots to study absolutely all files. But in this situation, the number of pages entered into the search engine database does not mean high quality indexing. Why, for example, do robots need the administrative and technical parts of the site or print pages (they are convenient for the user, and search engines are presented as duplicate content)? There are a lot of pages and files that bots spend time on, in fact, for nothing.

When a spider visits your site, it immediately looks for the robots.txt file intended for it. Having not found a document or finding it in an incorrect form, the bot starts to act independently, indexing literally everything in a row according to an algorithm known only to it. It doesn't necessarily start with new content that you would like to notify users about first. At best, indexing will simply drag on, at worst, it can also result in penalties for duplicates.

Having a proper robots text file will avoid many problems.



There are three ways to prevent indexing of sections or pages of a web resource, from point to high-level:

  • The noindex tag and the attribute are completely different code elements that serve different purposes, but are equally valuable SEO helpers. The question of their processing by search engines has become almost philosophical, but the fact remains: noindex allows you to hide part of the text from robots (it is not in the html standards, but it definitely works for Yandex), and nofollow prohibits following the link and passing its weight ( included in the standard classification, valid for all search engines).
  • The robots meta tag on a particular page affects that particular page. Below we will take a closer look at how to indicate in it the prohibition of indexing and following links located in the document. The meta tag is completely valid, systems take into account (or try to take into account) the specified data. Moreover, Google, choosing between robots in the form of a file in the root directory of the site and the page meta tag, gives priority to the latter.
  • robots.txt - this method is completely valid, supported by all search engines and other bots living on the Web. Nevertheless, his directives are not always regarded as an order to be executed (it was said above about non-authority for Google). The indexing rules specified in the file are valid for the site as a whole: individual pages, directories, sections.

Using examples, consider a ban on indexing the portal and its parts.



There are many reasons to stop spiders from indexing a website. It is still under development, is being redesigned or upgraded, the resource is an experimental platform, not intended for users.

A site can be blocked from indexing by robots.txt for all search engines, for an individual robot, or it can be banned for all but one.

2. How to disable robots.txt site indexing on individual pages

If the resource is small, then it is unlikely that you will need to hide pages (what is there to hide on a business card site), and large portals containing a substantial amount of service information cannot do without prohibitions. It is necessary to close from robots:

  • administrative panel;
  • service directories;
  • site search;
  • Personal Area;
  • registration forms;
  • order forms;
  • comparison of goods;
  • favorites;
  • basket;
  • captcha;
  • pop-ups and banners;
  • session IDs.

Irrelevant news and events, calendar events, promotions, special offers - these are the so-called garbage pages that are best hidden. It is also better to close outdated content on information sites in order to prevent negative ratings from search engines. Try to keep the updates regular - then you won't have to play hide-and-seek with search engines.

Prohibition of robots for indexing:



In robots.txt, you can specify complete or selective prohibitions on indexing folders, files, scripts, utm-tags, which can be an order both for individual search spiders and for robots of all systems.

Prohibition of indexing:

The robots meta tag serves as an alternative to the text file of the same name. Prescribed in source code web resource (in the index.html file), placed in a container . It is necessary to clarify who cannot index the site. If the ban is general, robots; if entry is denied to only one crawler, you need to specify its name (Google - Googlebot, "Yandex" - Yandex).

There are two options for writing a meta tag.

The "content" attribute can have the following values:

  • none - prohibition of indexing (including noindex and nofollow);
  • noindex - prohibition of content indexing;
  • nofollow - ban indexing links;
  • follow - permission to index links;
  • index - allow content indexing;
  • all - allow content and links to be indexed.

For different cases, you need to use combinations of values. For example, if you disable content indexing, you need to allow bots to index links: content="noindex, follow".


By closing the website from search engines through meta tags, the owner does not need to create robots.txt at the root.

It must be remembered that in the issue of indexing, a lot depends on the "politeness" of the spider. If he is “educated”, then the rules prescribed by the master will be relevant. But in general, the validity of the robots directives (both the file and the meta tag) does not mean one hundred percent following them. Even for search engines, not every ban is ironclad, and there is no need to talk about various kinds of content thieves. They are initially configured to circumvent all prohibitions.

In addition, not all crawlers are interested in content. For some, only links are important, for others - micro-markup, others check mirror copies of sites, and so on. At the same time, system spiders do not crawl around the site at all, like viruses, but remotely request the necessary pages. Therefore, most often they do not create any problems for resource owners. But, if mistakes were made during the design of the robot or some external non-standard situation arose, the crawler can significantly load the indexed portal.



Commands used:

1. "User-agent:"

The main guideline of the robots.txt file. Used for specification. The name of the bot is entered, for which further instructions will follow. For example:

  • User agent: Googlebot- the basic directive in this form means that all the following commands concern only the Google indexing robot;
  • User agent: Yandex- the prescribed permissions and prohibitions are intended for the Yandex robot.

Recording User-agent: * means referring to all other search engines (the special character "*" means "any text"). If we take into account the above example, then the asterisk will designate all search engines, except for "Yandex". Because Google completely dispenses with personal appeal, being content with the general designation "any text."


The most common command to disable indexing. Referring to the robot in "User-agent:", then the programmer indicates that he does not allow the bot to index part of the site or the entire site (in this case, the path from the root is indicated). The search spider understands this by expanding the command. We'll figure it out too.

User agent: Yandex

If there is such an entry in robots.txt, then the Yandex search bot understands that it cannot index the web resource as such: there are no clarifications after the forbidding sign “/”.

User agent: Yandex

Disallow: /wp-admin

In this example, there are clarifications: the ban on indexing applies only to the system folder wp-admin(the site is powered by WordPress). The Yandex robot sees the command and does not index the specified folder.

User agent: Yandex

Disallow: /wp-content/themes

This directive tells the crawler that it can index all the content " wp-content", with the exception of " themes", which the robot will do.

User agent: Yandex

Disallow: /index$

Another important symbol "$" appears, which allows for flexibility in prohibitions. In this case, the robot understands that it is not allowed to index pages whose links contain the sequence of letters " index". A separate file with the same name " index.php» You can index, and the robot clearly understands this.

You can enter a ban on the indexing of individual pages of the resource, the links of which contain certain characters. For example:

User agent: Yandex

The Yandex robot reads the command this way: do not index all pages with URLs containing "&" between any other characters.

User agent: Yandex

In this case, the robot understands that pages cannot be indexed only if their addresses end with "&".

Why it is impossible to index system files, archives, personal data of users, we think it is clear - this is not a topic for discussion. There is absolutely no need for a search bot to waste time checking data that no one needs. But regarding bans on page indexing, many people ask questions: what is the reason for the expediency of prohibitive directives? Experienced developers can give a dozen different reasons for tabooing indexing, but the main one will be the need to get rid of duplicate pages in the search. If there are any, it dramatically negatively affects ranking, relevance and other important aspects. Therefore, internal SEO optimization is unthinkable without robots.txt, in which it is quite simple to deal with duplicates: you just need to correctly use the "Disallow:" directive and special characters.

3. "Allow:"



The magic robots file allows you not only to hide unnecessary things from search engines, but also to open the site for indexing. robots.txt containing the command " allow:”, tells search engine spiders which elements of the web resource must be added to the database. The same clarifications as in the previous command come to the rescue, only now they expand the range of permissions for crawlers.

Let's take one of the examples given in the previous paragraph and see how the situation changes:

User agent: Yandex

Allow: /wp-admin

If "Disallow:" meant a ban, then now the contents of the system folder wp-admin becomes the property of Yandex on legal grounds and may appear in search results.

But in practice, this command is rarely used. There is a perfectly logical explanation for this: the absence of a disallow, indicated by "Disallow:", allows search spiders to consider the entire site as allowed for indexing. A separate directive is not required for this. If there are prohibitions, the content that does not fall under them is also indexed by robots by default.



Two more important commands for search spiders. " host:"- a target directive for a domestic search engine. Yandex is guided by it when determining the main mirror of a web resource whose address (with or without www) will participate in the search.

Consider the example of PR-CY.ru:

User agent: Yandex

The directive is used to avoid duplication of resource content.

Command " sitemap:» helps robots move correctly to the site map - a special file that represents a hierarchical structure of pages, content type, information about the frequency of updates, etc. The file serves as a navigator for search spiders sitemap.xml(on wordpress engine) sitemap.xml.gz), which they need to get to as quickly as possible. Then the indexing will speed up not only the site map, but also all other pages that will not slow down to appear in the search results.

Hypothetical example:

Commands that are indicated in the robots text file and are accepted by Yandex:

Directive

What is he doing

Names the search spider for which the rules listed in the file are written.

Indicates a ban for robots to index the site, its sections or individual pages.

Specifies the path to the sitemap hosted on the web resource.

Contains the following information for the search spider: The URL of the page includes non-indexable parameters (such as UTM tags).

Gives permission to index sections and pages of a web resource.

Allows you to delay scanning. Indicates the minimum time (in seconds) for the crawler between page loads: after checking one, the spider waits the specified amount of time before requesting the next page from the list.

*Required directive.

The Disallow, Sitemap, and Clean-param commands are the most commonly requested. Let's look at an example:

  • User-agent: * #indicating the robots to which the following commands are intended.
  • Disallow: /bin/ # Prevent indexers from crawling links from the Shopping Cart.
  • Disallow: /search/ # disallow indexing of search pages on the site.
  • Disallow: /admin/ # disallow search in the admin panel.
  • Sitemap: http://example.com/sitemap # indicates the path to the sitemap for the crawler.
  • Clean-param: ref /some_dir/get_book.pl

Recall that the above interpretations of directives are relevant for Yandex - spiders of other search engines can read commands differently.



The theoretical base is created - it is time to create an ideal (well, or very close to it) text file robots. If the site is running on an engine (Joomla!, WordPress, etc.), it is supplied with a mass of objects, without which normal operation is impossible. But there is no informative component in such files. In most CMS, the content storage is the database, but the robots cannot get to it. And they continue to look for content in engine files. Accordingly, the time allocated for indexing is wasted.

Very important Strive for unique content your web resource , carefully monitoring the occurrence of duplicates. Even a partial repetition of the information content of the site does not have the best effect on its evaluation by search engines. If the same content can be found at different URLs, this is also considered duplicate.

The two main search engines, Yandex and Google, will inevitably reveal duplication during crawling and artificially lower the position of the web resource in the search results.

Don't forget a great tool to help you deal with duplication - canonical meta tag. By writing a different URL in it, the webmaster thus indicates to the search spider the preferred page for indexing, which will be the canonical one.

For example, a page with pagination https://ktonanovenkogo.ru/page/2 contains the Canonical meta tag pointing to https://ktonanovenkogo.ru , which eliminates problems with duplicate headers.

So, we put together all the received theoretical knowledge and proceed to their practical implementation in robots.txt for your web resource, the specifics of which must be taken into account. What is required for this important file:

  • text editor (Notepad or any other) for writing and editing robots;
  • a tester who will help find errors in the created document and check the correctness of indexing bans (for example, Yandex.Webmaster);
  • An FTP client that simplifies uploading a finished and verified file to the root of a web resource (if the site is running on WordPress, then robots is most often stored in the Public_html system folder).

The first thing a search crawler does is request a file created specifically for it and located at the URL "/robots.txt".

A web resource can contain a single file "/robots.txt". No need to put it in custom subdirectories where spiders won't look for the document anyway. If you want to create robots in subdirectories, you need to remember that you still need to collect them into a single file in the root folder. Using the "Robots" meta tag is more appropriate.

URLs are case sensitive - remember that "/robots.txt" is not capitalized.

Now you need to be patient and wait for the search spiders, who will first study your properly created, correct robots.txt and start crawling your web portal.

Correct setting of robots.txt for indexing sites on different engines

If you have a commercial resource, then the creation of the robots file should be entrusted to an experienced SEO specialist. This is especially important if the project is complex. For those who are not ready to accept what has been said for an axiom, let us explain: this important text file has a serious impact on the indexing of the resource by search engines, the speed of processing the site by bots depends on its correctness, and the content of robots has its own specifics. The developer needs to take into account the type of site (blog, online store, etc.), engine, structural features and other important aspects that a novice master may not be able to do.

At the same time, you need to make the most important decisions: what to hide from crawling, what to leave visible to crawlers so that the pages appear in the search. It will be very problematic for an inexperienced SEO to cope with such a volume of work.


User-agent:* # general rules for robots, except for "Yandex" and Google,

Disallow: /cgi-bin # hosting folder
disallow: /? # all query parameters on the main
Disallow: /wp- # all WP files: /wp-json/, /wp-includes, /wp-content/plugins
Disallow: /wp/ # if there is a /wp/ subdirectory where the CMS is installed (if not, # the rule can be removed)
Disallow: *?s= # search
Disallow: *&s= # search
Disallow: /search/ # search
Disallow: /author/ # archivist
Disallow: /users/ # archivers
Disallow: */trackback # trackbacks, notifications in comments about an open # link to an article
Disallow: */feed # all feeds
Disallow: */rss # rssfeed
Disallow: */embed # all embeds
Disallow: */wlwmanifest.xml # Windows Live Writer manifest xml file (can be removed if not used)
Disallow: /xmlrpc.php # WordPress API file
Disallow: *utm*= # links with utm tags
Disallow: *openstat= # tagged linksopenstat
Allow: */uploads # open folder with uploads files
Sitemap: http://site.ru/sitemap.xml # sitemap address

User-agent: GoogleBot& # rules for Google

Disallow: /cgi-bin

Disallow: /wp-
Disallow: /wp/
Disallow: *?s=
Disallow: *&s=
Disallow: /search/
Disallow: /author/
Disallow: /users/
Disallow: */trackback
Disallow: */feed
Disallow: */rss
Disallow: */embed
Disallow: */wlwmanifest.xml
Disallow: /xmlrpc.php
Disallow: *utm*=
Disallow: *openstat=
Allow: */uploadsAllow: /*/*.js # open js scripts inside /wp- (/*/ - for priority)
Allow: /*/*.css # open css files inside /wp- (/*/ - for priority)
Allow: /wp-*.png # images in plugins, cache folder, etc.
Allow: /wp-*.jpg # images in plugins, cache folder, etc.
Allow: /wp-*.jpeg # pictures in plugins, cache folder, etc.
Allow: /wp-*.gif # pictures in plugins, cache folder, etc.
Allow: /wp-admin/admin-ajax.php # used by plugins to not block JS and CSS

User-agent: Yandex # rules for Yandex

Disallow: /cgi-bin

Disallow: /wp-
Disallow: /wp/
Disallow: *?s=
Disallow: *&s=
Disallow: /search/
Disallow: /author/
Disallow: /users/
Disallow: */trackback
Disallow: */feed
Disallow: */rss
Disallow: */embed
Disallow: */wlwmanifest.xml
Disallow: /xmlrpc.php
Allow: */uploads
Allow: /*/*.js
Allow: /*/*.css
Allow: /wp-*.png
Allow: /wp-*.jpg
Allow: /wp-*.jpeg
Allow: /wp-*.gif
Allow: /wp-admin/admin-ajax.php
Clean-Param: utm_source&utm_medium&utm_campaign # Yandex recommends not closing # from indexing, but deleting tag parameters, # Google does not support such rules
Clean-Param: openstat # similar



User-agent: *
Disallow: /administrator/
Disallow: /cache/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/
Sitemap: http://path of your XML sitemap



User-agent: *
Disallow: /*index.php$
Disallow: /bitrix/
Disallow: /auth/
Disallow: /personal/
Disallow: /upload/
Disallow: /search/
Disallow: /*/search/
Disallow: /*/slide_show/
Disallow: /*/gallery/*order=*
Disallow: /*?print=
Disallow: /*&print=
Disallow: /*register=
Disallow: /*forgot_password=
Disallow: /*change_password=
Disallow: /*login=
Disallow: /*logout=
Disallow: /*auth=
Disallow: /*?action=
Disallow: /*action=ADD_TO_COMPARE_LIST
Disallow: /*action=DELETE_FROM_COMPARE_LIST
Disallow: /*action=ADD2BASKET
Disallow: /*action=BUY
Disallow: /*bitrix_*=
Disallow: /*backurl=*
Disallow: /*BACKURL=*
Disallow: /*back_url=*
Disallow: /*BACK_URL=*
Disallow: /*back_url_admin=*
Disallow: /*print_course=Y
Disallow: /*COURSE_ID=
Disallow: /*?COURSE_ID=
Disallow: /*?PAGEN
Disallow: /*PAGEN_1=
Disallow: /*PAGEN_2=
Disallow: /*PAGEN_3=
Disallow: /*PAGEN_4=
Disallow: /*PAGEN_5=
Disallow: /*PAGEN_6=
Disallow: /*PAGEN_7=


Disallow: /*PAGE_NAME=search
Disallow: /*PAGE_NAME=user_post
Disallow: /*PAGE_NAME=detail_slide_show
Disallow: /*SHOWALL
Disallow: /*show_all=
Sitemap: http://path of your XML sitemap



User-agent: *
Disallow: /assets/cache/
Disallow: /assets/docs/
Disallow: /assets/export/
Disallow: /assets/import/
Disallow: /assets/modules/
Disallow: /assets/plugins/
Disallow: /assets/snippets/
Disallow: /install/
Disallow: /manager/
Sitemap: http://site.ru/sitemap.xml

5. Robots.txt, an example for Drupal

User-agent: *
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/
Disallow: /profile
Disallow: /profile/*
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /index.php
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: *register*
Disallow: *login*
Disallow: /top-rated-
Disallow: /messages/
Disallow: /book/export/
Disallow: /user2userpoints/
Disallow: /myuserpoints/
Disallow: /tagadelic/
Disallow: /referral/
Disallow: /aggregator/
Disallow: /files/pin/
Disallow: /your-votes
Disallow: /comments/recent
Disallow: /*/edit/
Disallow: /*/delete/
Disallow: /*/export/html/
Disallow: /taxonomy/term/*/0$
Disallow: /*/edit$
Disallow: /*/outline$
Disallow: /*/revisions$
Disallow: /*/contact$
Disallow: /*downloadpipe
Disallow: /node$
Disallow: /node/*/track$

Disallow: /*?page=0
Disallow: /*section
Disallow: /* order
Disallow: /*?sort*
Disallow: /*&sort*
Disallow: /*votesupdown
Disallow: /*calendar
Disallow: /*index.php
Allow: /*?page=

Sitemap: http://path to your XML sitemap

ATTENTION! Site content management systems are constantly updated, so the robots file may also change: additional pages or groups of files may be closed, or, conversely, opened for indexing. It depends on the goals of the web resource and the current engine changes.

7 common mistakes when indexing a site using robots.txt



Errors made during file creation cause robots.txt to function incorrectly or even lead to the impossibility of the file to work.

What errors are possible:

  • Logical (marked rules collide). You can identify this type of error during testing in Yandex.Webmaster and GoogleRobotsTestingTool.
  • Syntactic (directives are written with errors).

More common than others are:

  • the record is not case-sensitive;
  • capital letters are used;
  • all rules are listed on one line;
  • rules are not separated by an empty line;
  • specifying the crawler in the directive;
  • each file of the folder that needs to be closed is listed separately;
  • the mandatory Disallow directive is missing.

Consider common mistakes, their consequences and, most importantly, measures to prevent them on your web resource.

  1. File location. The URL of the file should be of the following form: http://site.ru/robots.txt (instead of site.ru, the address of your site is listed). The robots.txt file is based exclusively in the root folder of the resource - otherwise, search spiders will not see it. Without getting banned, they will crawl the entire site and even those files and folders that you would like to hide from search results.
  2. Case sensitive. No capital letters. http://site.ru/Robots.txt is wrong. In this case, the search engine robot will receive a 404 (error page) or 301 (redirect) as a server response. Crawling will take place without taking into account the directives indicated in robots. If everything is done correctly, the server response is code 200, in which the owner of the resource will be able to control the search crawler. The only correct option is "robots.txt".
  3. Opening in a browser page. Search spiders will only be able to correctly read and use the directives of the robots.txt file if it opens in a browser page. It is important to pay close attention to the server side of the engine. Sometimes a file of this type is offered for download. Then you should set up the display - otherwise the robots will crawl the site as they please.
  4. Prohibition and permission errors."Disallow" - a directive to prohibit scanning of the site or its sections. For example, you need to prevent robots from indexing pages with search results on the site. In this case, the robots.txt file should contain the line: "Disallow: /search/". The crawler understands that all pages where "search" occurs are prohibited from crawling. With a total ban on indexing, Disallow: / is written. But the allowing directive "Allow" is not necessary in this case. Although it is not uncommon for a command to be written like this: “Allow:”, assuming that the robot will perceive this as permission to index “nothing”. You can allow the entire site to be indexed through the "Allow: /" directive. There is no need to confuse commands. This leads to crawling errors by spiders, which eventually add pages that are absolutely not the ones that should be promoted.
  5. directive match. Disallow: and Allow: for the same page are found in robots, which causes crawlers to prioritize the allow directive. For example, initially the partition was opened for crawling by spiders. Then, for some reason, it was decided to hide it from the index. Naturally, a ban is added to the robots.txt file, but the webmaster forgets to remove the permission. For search engines, the ban is not so important: they prefer to index the page bypassing commands that exclude each other.
  6. Host directive:. Recognized only by Yandex spiders and used to determine the main mirror. A useful command, but, alas, it seems to be erroneous or unknown to all other search engines. When involving it in your robots, it is optimal to specify as User-agent: everyone and the Yandex robot, for which you can personally register the Host command:

    User Agent: Yandex
    Host: site.ru

    The directive prescribed for all crawlers will be perceived by them as erroneous.

  7. Sitemap directive:. With the help of a sitemap, bots find out what pages are on a web resource. A very common mistake is developers not paying attention to the location of the sitemap.xml file, although it determines the list of URLs included in the map. By placing the file outside the root folder, the developers themselves put the site at risk: crawlers incorrectly determine the number of pages, as a result, important parts of the web resource are not included in the search results.

For example, by placing a Sitemap file in a directory at the URL http://primer.ru/catalog/sitemap.xml , you can include any URLs starting with http://primer.ru/catalog/ ... And URLs like, say, http://primer.ru/images/ ... should not be included in the list.

Summarize. If the site owner wants to influence the process of indexing a web resource by search bots, the robots.txt file is of particular importance. It is necessary to carefully check the created document for logical and syntactical errors, so that in the end the directives work for the overall success of your site, ensuring high-quality and fast indexing.

How to avoid errors by creating the correct robots.txt structure for site indexing



The structure of robots.txt is clear and simple, it is quite possible to write the file yourself. You just need to carefully monitor the syntax that is extremely important for robots. Search bots follow the directives of the document voluntarily, but search engines interpret the syntax differently.

A list of the following mandatory rules will help eliminate the most common mistakes when creating robots.txt. To write the right document, you should remember that:

  • each directive starts on a new line;
  • in one line - no more than one command;
  • a space cannot be placed at the beginning of a line;
  • the command parameter must be on one line;
  • directive parameters do not need to be quoted;
  • command parameters do not require a semicolon at the end;
  • the directive in robots.txt is specified in the format: [command_name]:[optional space][value][optional space];
  • after the pound sign # comments are allowed in robots.txt;
  • an empty string can be interpreted as the end of the User-agent command;
  • the prohibiting directive with an empty value - "Disallow:" is similar to the directive "Allow: /" that allows scanning the entire site;
  • "Allow", "Disallow" directives can contain no more than one parameter. Each new parameter is written on a new line;
  • only lowercase letters are used in the name of the robots.txt file. Robots.txt or ROBOTS.TXT - erroneous spellings;
  • The robots.txt standard does not regulate case sensitivity, but files and folders are often sensitive in this matter. Therefore, although it is acceptable to use capital letters in the names of commands and parameters, this is considered bad form. It is better not to get carried away with the upper case;
  • when the command parameter is a folder, a slash "/" is required before the name, for example: Disallow: /category;
  • if the robots.txt file weighs more than 32 KB, search bots perceive it as equivalent to "Disallow:" and consider it completely allowing indexing;
  • the inaccessibility of robots.txt (for various reasons) may be perceived by crawlers as the absence of crawl bans;
  • empty robots.txt is regarded as allowing indexing of the site as a whole;
  • if multiple "User-agent" commands are listed without a blank line between them, search spiders may treat the first directive as the only one, ignoring all subsequent "User-agent" directives;
  • robots.txt does not allow the use of any symbols of national alphabets.

The above rules are not relevant for all search engines, because they interpret the robots.txt syntax differently. For example, "Yandex" selects entries by the presence in the "User-agent" line, so it does not matter for it the presence of an empty line between different "User-agent" directives.

In general, robots should contain only what is really needed for proper indexing. No need to try to embrace the immensity and fit the maximum data into the document. The best robots.txt is a meaningful file, the number of lines doesn't matter.

The text document robots needs to be checked for the correct structure and correct syntax, which will help the services presented on the Web. To do this, you need to upload robots.txt to the root folder of your site, otherwise the service may report that it was unable to load the required document. Before robots.txt is recommended to check for availability at the address of the file (your_site.ru/robots.txt).

The largest search engines Yandex and Google offer their website analysis services to webmasters. One of the aspects of analytical work is robots check:

There are a lot of online robots.txt validators on the Internet, you can choose any one you like.

Array ( => 24 [~ID] => 24 => 10.10.2019 18:52:28 [~TIMESTAMP_X] => 10.10.2019 18:52:28 => 1 [~MODIFIED_BY] => 1 => 10.10. 2019 18:51:03 [~DATE_CREATE] => 10/10/2019 18:51:03 => 1 [~CREATED_BY] => 1 => 6 [~IBLOCK_ID] => 6 => [~IBLOCK_SECTION_ID] => => Y [~ACTIVE] => Y => Y [~GLOBAL_ACTIVE] => Y => 500 [~SORT] => 500 => Articles by Pavel Bobylev [~NAME] => Articles by Pavel Bobylev => 11744 [~PICTURE] = > 11744 => 13 [~LEFT_MARGIN] => 13 => 14 [~RIGHT_MARGIN] => 14 => 1 [~DEPTH_LEVEL] => 1 => Pavel Bobylev [~DESCRIPTION] => Pavel Bobylev => text [~DESCRIPTION_TYPE ] => text => Articles by Pavel Bobylev Pavel Bobylev [~SEARCHABLE_CONTENT] => Articles by Pavel Bobylev Pavel Bobylev => stati-pavla-bobyleva [~CODE] => stati-pavla-bobyleva => [~XML_ID] => => [~TMP_ID] => => [~DETAIL_PICTURE] => => [~SOCNET_GROUP_ID] => => /blog/index.php?ID=6 [~LIST_PAGE_URL] => /blog/index.php?ID=6 => /blog/list.php?SECTION_ID=24 [~SECTION_PAGE_URL] => /b log/list.php?SECTION_ID=24 => blog [~IBLOCK_TYPE_ID] => blog => blog [~IBLOCK_CODE] => blog => [~IBLOCK_EXTERNAL_ID] => => [~EXTERNAL_ID] =>)