What Is Robots.txt?

The robots.txt is a simple text file that sits in the root directory of your site and tells crawlers what should be crawled.

The table below provides a quick reference to the key robots.txt directives. User-agent = Specifies which crawler the rules apply to. See user agent tokens. Using * targets all crawlers. Disallow - Prevents specified URLs from being crawled. Allow - Allows specific URLs to be crawled, even if a parent directory is disallowed. Sitemap - Indicates the location of your XML Sitemap by helping search engines to discover it.

User-agent: * Disallow: /downloads/ Allow: /downloads/free/ Disallow: /*.pdf$ Sitemap: https://www.example.com/sitemap

  1. By using /*, the rule matches any path on the website. As a result, any URL ending with .pdf will be blocked from crawling.
    1. Note that you must always specify relative paths and never absolute URLs, like “https://www.example.com/form/” for Disallow and
    2. Allow directives. Except: The Sitemap directive requires a full, absolute URL to indicate the location of the sitemap.
    3. Be cautious to avoid malformed rules. For example, using /form without a trailing slash will also match a page /form-design-examples/, which may be a page on your blog that you want to index.

Let’s dive into examples of how you can use robots.txt for each case.

  • User-agent: * Used to specify which search engine these rules apply to. * translates to "all".
  • Disallow: /downloads/ Do not crawl the downloads directory
  • Allow: /downloads/free/ EXCEPTION: you can crawl the downloads/free/ directory, in spite of the above rule
  • Disallow: /*.pdf$ Do not crawl any pdf file
  • Sitemap: https://www.example.com/sitemap Gives the location of your sitemap. This must be a full URL. You can offer multiple sitemaps, each must have its own line.

We hope this helps! As always, if you have any questions, feel free to contact us.


Related posts

Published by

AmeriWeb Hosting

AmeriWeb Hosting

AmeriWeb Hosting Newsletter