Working With the Limitations of Your Robots.txt File

If you know a bit about SEO, you’ve probably heard of the robots.txt file and know that it’s the bit of text that tells search engine bots how to crawl your website. Search engines employ bots to help them build up their databases by indexing web pages and recording links going in and out of a site.

Your website’s robots.txt file is a text file that gives direction to these bots, helping them to see a website the way you want them to and understand what to pay attention to and what to ignore. There are some important limitations to these text files and knowing what those are will help you use them to your advantage.

Robots txt user-agents

The main function of the robots.txt file is to manage traffic and keep some pages or files off of or visible to certain search engines. Each rule in the text file applies only to the specific search engine bots, or user-agents, that are defined in the rule.

If a particular user-agent doesn’t have its own, unique rule, it will only pay attention to the most specific instructions given in the file. You can use this to your advantage by learning as much as possible about how different search engines rank pages. You can then allow and disallow those engines’ specific bots to index the pages you want.

This can help you keep duplicate content from popping up in searches. It can also tell certain engines to ignore pages on your site that you want to remain static, such as a login page, and not give you an SEO ding for not updating content on that page.

Specifying for user-agent also allows you to command a certain amount of delay for each user-agent bot, which can keep your site from getting overloaded with traffic. It also lets you tell specific search engines to ignore website maintenance pages that you don’t want showing up in public searches.

Robots txt instructions

Another important limitation to bear in mind is that this file can only issue instructions: it cannot enforce them. You can expect web crawlers from reputable organizations like Google, or Bing to respect the instructions of your file, but malicious web crawlers are not bound to obey.

To malicious bots, your robots.txt file may actually be pointing out precisely what they want to look at: the very files you don’t want them crawling. Knowing this empowers you to take more specific action to block those files you want to keep out of malicious hands.

One of the best ways to do this is to identify the URL of any content you want to stay private and then store it in a password-protected directory that is password-protected. Then there is no need to specify anything about these files as bots cannot access them even if they run across them. Better yet, once your content is in such a directory, it doesn’t need a specific line in your robots.txt file, so you’re no longer involuntarily alerting malicious bots to the content’s existence in the first place.

Robots txt interpretation

Different web crawlers will interpret your file contents and instructions in different ways. Knowing some of the differences in how search engine web crawlers interpret syntax can be used to your advantage to control how those bots crawl the site.

As an example, many sites put in a crawl delay to force web crawlers to slow down. This keeps the site from being overloaded with traffic. Google does not obey this instruction and all delays need to be set from Google’s own Search Console.

Since Google is one of the crawlers most likely to load your network with traffic, you can set a longer delay for Google’s bot on Search Console and then safely put in a general instruction to all other bots to delay by a shorter period.

Your robots.txt file is a crucial part of your website and how it interacts with the web in general. It’s worth taking time to really get to know this file so you can build a strategy to incorporate it into your successful SEO plan.

admin

This is the "wpengine" admin user that our staff uses to gain access to your admin area to provide support and troubleshooting. It can only be accessed by a button in our secure log that auto generates a password and dumps that password after the staff member has logged in. We have taken extreme measures to ensure that our own user is not going to be misused to harm any of our clients sites.