DevToolBoxGRATIS
Blog

Sintassi ed esempi robots.txt

8 min di letturadi DevToolBox

The robots.txt file is the first thing search engine crawlers check when visiting your website. A properly configured robots.txt can protect private content, save crawl budget, and improve SEO. This guide covers every directive, wildcard pattern, and real-world example you need.

What is robots.txt?

robots.txt is a plain text file placed at the root of your website (e.g., https://example.com/robots.txt) that tells web crawlers which URLs they are allowed or forbidden to access. It follows the Robots Exclusion Protocol (REP), first proposed in 1994 and now an internet standard (RFC 9309).

How crawlers use robots.txt:

  1. A crawler arrives at your domain and requests /robots.txt before crawling any other page.
  2. If the file exists, the crawler parses rules for its specific User-agent.
  3. The crawler follows matching Disallow and Allow directives when deciding which URLs to fetch.
  4. If no robots.txt is found (404), the crawler assumes everything is allowed.

Important: robots.txt is advisory, not enforceable. Well-behaved bots (Googlebot, Bingbot) respect it, but malicious scrapers may ignore it entirely. For truly private content, use authentication or server-side access controls.

Syntax Rules & Directives

A robots.txt file is composed of one or more rule groups. Each group starts with a User-agent line followed by directives:

  • User-agentUser-agent — Specifies which crawler the rules apply to. Use * for all crawlers.
  • DisallowDisallow — Blocks a URL path. An empty value (Disallow:) means nothing is blocked.
  • AllowAllow — Explicitly permits a URL path, overriding a broader Disallow. Supported by Googlebot and most modern crawlers.
  • SitemapSitemap — Points to your XML sitemap. Can appear anywhere in the file, outside rule groups.
  • Crawl-delayCrawl-delay — Requests a delay (in seconds) between successive requests. Supported by Bing and Yandex but NOT by Google.

Formatting rules:

  • One directive per line.
  • Lines starting with # are comments.
  • Blank lines separate rule groups.
  • Paths are case-sensitive (/Admin is different from /admin).
  • The file must be named exactly robots.txt and placed at the domain root.

Basic syntax example:

# This is a comment
User-agent: *
Disallow: /private/
Disallow: /tmp/
Allow: /private/public-page.html

# Slow down Bingbot
User-agent: Bingbot
Crawl-delay: 10

Sitemap: https://example.com/sitemap.xml

Wildcard Patterns

Google, Bing, and most major crawlers support two wildcard characters in robots.txt paths:

  • ** (asterisk) — Matches any sequence of characters (including empty).
  • $$ (dollar sign) — Anchors the match at the end of the URL.

Wildcard pattern examples:

User-agent: *

# Block all URLs containing "?sort="
Disallow: /*?sort=

# Block all .pdf files
Disallow: /*.pdf$

# Block all URLs with query parameters
Disallow: /*?*

# Block all .json API responses
Disallow: /*.json$

# Block paths containing /temp/ anywhere
Disallow: /*/temp/

# Allow specific .xml files (sitemaps)
Allow: /*.xml$

Note: The original robots.txt specification did not include wildcards. They are extensions supported by major search engines. Always test your patterns to ensure they match what you intend.

Common robots.txt Examples

Block all crawlers from the entire site:

User-agent: *
Disallow: /

Allow all crawlers (explicit):

User-agent: *
Disallow:

Block a specific crawler:

User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

Block a specific directory:

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /staging/

Block all but allow a specific path:

User-agent: *
Disallow: /api/
Allow: /api/public/

Multiple rule groups:

# Default rules for all crawlers
User-agent: *
Disallow: /admin/
Disallow: /private/

# Googlebot gets special access
User-agent: Googlebot
Disallow: /admin/
Allow: /private/google-partner/

# Block aggressive SEO bots entirely
User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

Sitemap: https://example.com/sitemap.xml

WordPress robots.txt

WordPress sites have specific directories and files that should typically be blocked from crawlers to save crawl budget and prevent indexing of admin/utility pages:

User-agent: *
# Block WordPress admin and login
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

# Block WordPress includes
Disallow: /wp-includes/

# Block XML-RPC (security best practice)
Disallow: /xmlrpc.php

# Block trackbacks and pingbacks
Disallow: /trackback/
Disallow: /*/trackback/

# Block comment feeds
Disallow: /*/feed/
Disallow: /*/comments/

# Block search results pages
Disallow: /?s=
Disallow: /search/

# Block author archives (optional)
Disallow: /author/

# Block tag pages with thin content (optional)
Disallow: /tag/

# Allow all media uploads
Allow: /wp-content/uploads/

Sitemap: https://example.com/sitemap_index.xml

Tip: Do not block /wp-content/uploads/ — that is where your media files live, and you want them indexed. Also, never block your CSS/JS files as Google needs them for rendering.

Next.js / React SPA robots.txt

Modern JavaScript frameworks serve static assets and API routes that should not be indexed. Here is a recommended robots.txt for Next.js applications:

User-agent: *
# Block Next.js internal routes
Disallow: /_next/
Disallow: /api/

# Block internal utility pages
Disallow: /404
Disallow: /500

# Block query parameter variations
Disallow: /*?*

# Allow static assets that search engines need
Allow: /_next/static/
Allow: /_next/image/

# Allow public API endpoints (if any)
Allow: /api/public/

Sitemap: https://example.com/sitemap.xml

In Next.js 13+, you can generate robots.txt programmatically by creating an app/robots.ts file that exports a metadata function. For static sites, place robots.txt in the public/ directory.

E-commerce robots.txt

E-commerce sites need careful robots.txt configuration to prevent crawling of user accounts, checkout flows, and faceted navigation that creates duplicate content:

User-agent: *
# Block user account pages
Disallow: /account/
Disallow: /my-account/
Disallow: /login/
Disallow: /register/
Disallow: /password-reset/

# Block checkout and cart
Disallow: /cart/
Disallow: /checkout/
Disallow: /order-confirmation/

# Block wishlist and compare
Disallow: /wishlist/
Disallow: /compare/

# Block internal search results
Disallow: /search/
Disallow: /*?q=
Disallow: /*?search=

# Block faceted navigation (duplicate content)
Disallow: /*?sort=
Disallow: /*?order=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?price=
Disallow: /*?page=

# Block review/rating sort variations
Disallow: /*?rating=

# Allow product and category pages
Allow: /products/
Allow: /category/
Allow: /collections/

Sitemap: https://example.com/sitemap.xml

Warning: Do not block product pages or category pages — those are your money pages. Only block utility paths, user-specific pages, and duplicate content generators like sort/filter parameters.

Block AI Crawlers

With the rise of large language models, many site owners want to prevent AI training crawlers from scraping their content. Here are the known AI crawler user agents and how to block them:

Known AI crawler user agents:

  • GPTBotGPTBot — OpenAI's crawler for training data
  • ChatGPT-UserChatGPT-User — OpenAI's crawler for ChatGPT browsing feature
  • Google-ExtendedGoogle-Extended — Google's AI training crawler (Gemini/Bard)
  • anthropic-aianthropic-ai — Anthropic's crawler for Claude training data
  • ClaudeBotClaudeBot — Anthropic's web crawler
  • CCBotCCBot — Common Crawl bot (used by many AI companies)
  • BytespiderBytespider — ByteDance's crawler
  • AmazonbotAmazonbot — Amazon's crawler
# Block all known AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Amazonbot
Disallow: /

# Still allow regular search engine crawlers
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Note: Blocking AI crawlers in robots.txt does not retroactively remove content already collected. It only prevents future crawling. Also, new AI crawlers appear regularly, so review your robots.txt periodically.

Sitemap Directive

The Sitemap directive tells crawlers where to find your XML sitemap. This is especially useful because it does not require a specific User-agent block — it applies globally:

# Single sitemap
Sitemap: https://example.com/sitemap.xml

# Multiple sitemaps
Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-products.xml

# Sitemap index file
Sitemap: https://example.com/sitemap_index.xml

Sitemap directive rules:

  • Must use a full absolute URL (including https://).
  • Can list multiple Sitemap directives for multiple sitemaps.
  • Can be placed anywhere in the file (not tied to a User-agent group).
  • Supports sitemap index files that reference other sitemaps.

Benefit: Even if Google discovers your sitemap through Search Console, including it in robots.txt ensures all crawlers can find it automatically.

Testing & Validation

Always test your robots.txt before deploying to production. A single typo can accidentally block your entire site from search engines.

  • Google Search ConsoleGoogle Search Console — The "robots.txt Tester" tool lets you test specific URLs against your rules.
  • curlcurl command — Quickly check if your robots.txt is accessible:
  • OnlineOnline validators — Tools like Google's robots.txt tester or Merkle's robots.txt analyzer.
# Check if robots.txt is accessible
curl -I https://example.com/robots.txt

# View the full content
curl https://example.com/robots.txt

# Check what Googlebot sees (simulate Googlebot)
curl -A "Googlebot" https://example.com/robots.txt

# Test a specific URL against robots.txt (Python)
pip install robotexclusionrulesparser
python -c "
import robotexclusionrulesparser as rerp
rp = rerp.RobotExclusionRulesParser()
rp.fetch('https://example.com/robots.txt')
print(rp.is_allowed('Googlebot', '/private/page'))
"

Common mistakes to avoid:

  • Placing robots.txt in a subdirectory instead of the domain root.
  • Using relative URLs in the Sitemap directive (must be absolute).
  • Blocking CSS/JS files that search engines need for rendering.
  • Forgetting that Disallow: / blocks EVERYTHING including your homepage.
  • Not testing after changes — always verify with Google Search Console.
  • Using incorrect line endings or BOM characters (use UTF-8 without BOM).

robots.txt vs meta robots vs X-Robots-Tag

There are three main ways to control crawler behavior. Each serves a different purpose:

MethodScopeLocationPrevents CrawlingPrevents IndexingGranularity
robots.txtEntire paths / directoriesDomain root /robots.txtYesNo (URLs can still appear in search results)URL path level
meta robotsIndividual pagesHTML <head> tagNo (page must be crawled to see the tag)Yes (noindex)Per page
X-Robots-TagAny resource (PDF, image, etc.)HTTP response headerNo (resource must be fetched to see the header)Yes (noindex)Per resource

Tip: To truly prevent a page from appearing in search results, use meta robots noindex or X-Robots-Tag noindex. robots.txt alone does NOT prevent indexing — Google can still index URLs it discovers through links, even if it cannot crawl them.

Generate your robots.txt file instantly

FAQ

Does robots.txt prevent pages from appearing in Google search results?

No. robots.txt prevents crawling but not indexing. If other sites link to your blocked page, Google may still show the URL in search results (without a snippet). To prevent indexing, use a meta robots noindex tag or X-Robots-Tag: noindex HTTP header instead.

Where should I place the robots.txt file?

The robots.txt file must be placed at the root of your domain: https://example.com/robots.txt. It does not work in subdirectories. For subdomains (e.g., blog.example.com), you need a separate robots.txt at the subdomain root.

Can I use robots.txt to block AI crawlers like GPTBot and ClaudeBot?

Yes. Add User-agent: GPTBot followed by Disallow: / to block OpenAI's crawler. Similarly, use User-agent: ClaudeBot with Disallow: / for Anthropic's crawler. However, this only prevents future crawling and does not remove previously collected data.

What happens if my site has no robots.txt file?

If a crawler receives a 404 response for /robots.txt, it assumes there are no restrictions and will crawl all accessible pages on your site. This is the default behavior defined in the Robots Exclusion Protocol.

What is the difference between Disallow: and Disallow: / in robots.txt?

Disallow: (with an empty value) means nothing is disallowed — crawlers can access everything. Disallow: / (with a forward slash) means the entire site is blocked. This single-character difference is critical and a common source of misconfiguration.

𝕏 Twitterin LinkedIn
È stato utile?

Resta aggiornato

Ricevi consigli dev e nuovi strumenti ogni settimana.

Niente spam. Cancella quando vuoi.

Prova questi strumenti correlati

🤖Robots.txt Generator🏷️Meta Tag Generator📐Schema Markup Generator

Articoli correlati

Meta Tag essenziali per ogni sito web: Guida completa ai meta tag HTML

Meta tag HTML essenziali per SEO, Open Graph, Twitter Cards, sicurezza e performance. Template completo pronto da copiare.

.htaccess Redirect Cheat Sheet: Esempi da copiare

Riferimento completo redirect .htaccess.