The robots.txt is a simple text file that sits in the root directory of your site and tells crawlers what should be crawled.
The table below provides a quick reference to the key robots.txt directives. User-agent = Specifies which crawler the rules apply to. See user agent tokens. Using * targets all crawlers. Disallow - Prevents specified URLs from being crawled. Allow - Allows specific URLs to be crawled, even if a parent directory is disallowed. Sitemap - Indicates the location of your XML Sitemap by helping search engines to discover it.
User-agent: * Disallow: /downloads/ Allow: /downloads/free/ Disallow: /*.pdf$ Sitemap: https://www.example.com/sitemap
- By using /*, the rule matches any path on the website. As a result, any URL ending with .pdf will be blocked from crawling.
- Note that you must always specify relative paths and never absolute URLs, like “https://www.example.com/form/” for Disallow and
- Allow directives. Except: The Sitemap directive requires a full, absolute URL to indicate the location of the sitemap.
- Be cautious to avoid malformed rules. For example, using /form without a trailing slash will also match a page /form-design-examples/, which may be a page on your blog that you want to index.
Let’s dive into examples of how you can use robots.txt for each case.
- User-agent: * Used to specify which search engine these rules apply to. * translates to "all".
- Disallow: /downloads/ Do not crawl the downloads directory
- Allow: /downloads/free/ EXCEPTION: you can crawl the downloads/free/ directory, in spite of the above rule
- Disallow: /*.pdf$ Do not crawl any pdf file
- Sitemap: https://www.example.com/sitemap Gives the location of your sitemap. This must be a full URL. You can offer multiple sitemaps, each must have its own line.
We hope this helps! As always, if you have any questions, feel free to contact us.