🤖

Robots.txt Checker

Test any URL against any site's robots.txt. Detects allow/disallow conflicts, regex mistakes, and AI-crawler-specific rules (GPTBot, ClaudeBot, PerplexityBot).

Robots.txt Checker
Test any URL against any site's robots.txt. Detects allow/disallow conflicts and shows which crawlers can fetch which paths, including AI crawlers (GPTBot, ClaudeBot, PerplexityBot).
Fetches /robots.txt from this domain.
What it does

See exactly which crawlers can fetch which URLs on your site

A wrong line in robots.txt has destroyed more SEO traffic than any algorithm update. One disallow rule pushed live during a staging deploy and your entire site goes invisible to Google for the time it takes someone to notice.

This checker fetches your live robots.txt, parses it the way Google's crawler does (path-prefix matching, longest-rule-wins, wildcard expansion), and lets you test any URL against any user-agent. Including the AI crawlers, GPTBot, ClaudeBot, PerplexityBot, Google-Extended, that have their own rules and matter for GEO/AEO.

The 6 robots.txt mistakes I see most often

Six robots.txt mistakes that quietly destroy SEO traffic

1. Disallow: / left over from staging

Someone deployed staging's robots.txt to production. Site goes uncrawlable. Happens more than you'd think, once a quarter on average for new clients.

2. Blocking JS, CSS, or image folders

Old advice from 2015 was to block crawlers from JS/CSS for "crawl budget". Modern Google needs to render your pages, blocking JS makes pages look broken to Google. Always allow.

3. Conflicting Allow and Disallow paths

Disallow: /products/ and Allow: /products/widgets/, Google's rule is "longest path wins" but most CMSs and security plugins write rules in the wrong order. The checker shows the resolved decision per URL.

4. User-agent: * with too-broad Disallow

Blocking with the wildcard hits Google AND every legitimate crawler. Use specific user-agents to target only the ones you mean.

5. Forgetting to add a sitemap directive

Adding Sitemap: https://yoursite.com/sitemap.xml to robots.txt is the single highest-impact one-line change you can make for indexation.

6. Using robots.txt to hide pages from search results

robots.txt blocks crawling, not indexing. A page can still appear in search results if other sites link to it, Google just shows it without a snippet. To hide a page from search results, use noindex meta tag instead.

AI crawlers

AI crawler robots.txt, what to allow, what to block

Most sites are still configured for Googlebot only. AI engines like ChatGPT, Claude, Perplexity, and Gemini use different crawlers, and they're the ones that decide whether your brand gets cited in AI answers. Configure these correctly:

  • GPTBot, used by OpenAI to crawl for ChatGPT training. ALLOW if you want citations in ChatGPT answers.

  • ClaudeBot, Anthropic's crawler. ALLOW for Claude citations.

  • PerplexityBot, Perplexity's crawler. ALLOW for Perplexity citations.

  • Google-Extended, Google's separate AI training crawler (different from Googlebot). ALLOW for Gemini/SGE citations.

  • CCBot, Common Crawl (used by many LLMs). ALLOW for general LLM exposure.

Blocking these means you opt out of AI citation entirely, and your competitors opted in. The default 2024 setup of "allow Google, block everything AI" is actively harmful to GEO.

Frequently asked

FAQs about the Robots.txt Checker

robots.txt blocks crawling, bots don't fetch the page at all. noindex (in a meta tag or HTTP header) lets bots crawl the page but tells them not to put it in the search index. To hide a page from search results, use noindex. To save crawl budget on never-useful URLs (e.g. infinite calendar pages), use robots.txt.
At the root of your domain only, https://yourdomain.com/robots.txt. Subdirectories don't work. Subdomains need their own robots.txt at their root.
Google fetches robots.txt every 24 hours by default. Submit an updated version via Search Console → Robots.txt Tester to force a refresh.
Yes, with wildcards: Disallow: /*?session= or Disallow: /*?utm_*. Test the patterns with this checker to make sure they don't accidentally block valid URLs.
Either Google fetched the disallow rule recently (allow up to 48 hours for indexed pages to drop), or the pages were already indexed before the rule was added (use noindex to remove them, then re-disallow once they're out of the index).
Only if you have a specific reason (paywalled content, copyright concerns). For most marketing sites, blocking AI crawlers means losing GEO citation visibility, increasingly a major traffic source.
Rules that apply to ALL crawlers that don't have their own specific block. Use it for site-wide directives like sitemap location and broad disallow patterns. Use specific user-agent blocks (User-agent: Googlebot) for crawler-specific rules.
Beyond tools
Need an audit, not a checker?
These tools spot problems. I solve them. Book a free strategy call, I'll review your site live and give you a prioritised fix list.