Does FirecrawlAgent respect robots.txt?

Yes. FirecrawlAgent respects Disallow and Allow directives in robots.txt using the token FirecrawlAgent. Crawl-delay support is not documented, so you may need server-side rate limiting for fine-grained control.

Does Firecrawl use my content for AI training?

This is unclear. Firecrawl's robots.txt signals ai-train=yes, but the company doesn't explicitly state whether it trains its own models on crawled content. Content is delivered to API consumers as markdown, so downstream usage depends on each integrator.

How can I identify FirecrawlAgent in my server logs?

Look for the FirecrawlAgent token in user-agent strings. Identification can be difficult because Firecrawl doesn't publish a single stable user-agent string or IP ranges. If you're unsure, contact Firecrawl support for verification.

Can I block FirecrawlAgent by IP address?

IP-based blocking is unreliable because Firecrawl may use multi-tenant or proxy IP pools. No official IP range list is published. Use robots.txt as your primary blocking method.

Should I allow FirecrawlAgent on my site?

If you want your content available to AI agents and RAG pipelines built on Firecrawl's platform, yes. Source URLs are passed through to integrators, so there's potential for referral traffic. Block it if you have concerns about unclear downstream data usage.

Does FirecrawlAgent render JavaScript?

Yes. FirecrawlAgent supports full JavaScript rendering, so it can access content generated by client-side frameworks. This means dynamically loaded content is not hidden from the crawler.

Agent Directory FirecrawlFirecrawlAgent

FirecrawlAgent

AI User Initiated

Firecrawl agent fetching pages during user-initiated web scraping tasks.

What does FirecrawlAgent do?

FirecrawlAgent is the crawler behind Firecrawl, an infrastructure API that scrapes, searches, and interacts with live web pages to produce LLM-ready data. It powers AI agent workflows, RAG pipelines, and automated extraction tasks for Firecrawl's customers. Whether your content drives referral traffic depends on how downstream integrators surface source URLs and citations in their products.

Should I allow and optimize for FirecrawlAgent to drive organic growth?

FirecrawlAgent feeds content into AI agents and RAG pipelines built by Firecrawl's customers. Source URLs and full-page markdown are included in API responses, so integrators can surface clickable citations linking back to your site. The actual referral traffic depends on how each integrator builds their product. Allowing FirecrawlAgent means your content is available to a growing ecosystem of AI-powered applications, which can increase your visibility across multiple downstream products.

Here's how to optimize for FirecrawlAgent:

Allow FirecrawlAgent in your robots.txt to remain visible in Firecrawl-powered applications
Use clean semantic HTML since Firecrawl converts pages to markdown for LLM consumption
Include descriptive title tags and meta descriptions to improve content extraction quality
Add structured data (JSON-LD) to help automated extraction identify key entities and relationships
Ensure your most valuable content renders in the initial page load, not behind lazy-loading triggers
Keep canonical URLs consistent so downstream integrators link to the correct source

Data Usage & Training

It's unclear whether Firecrawl uses crawled content to train its own models. Firecrawl's robots.txt includes a Content-Signal header with ai-train=yes, search=yes, ai-input=yes, but the company doesn't explicitly state whether it trains on collected data or simply passes it through to customers. Crawled content is delivered to Firecrawl's API consumers as markdown with source URLs, so downstream usage varies by integrator.

How FirecrawlAgent Accesses Content

Here's how FirecrawlAgent accesses your site and understands your content:

Fetches pages via HTTP requests with full JavaScript rendering
Supports prompt-driven extraction and automated crawl tasks
Operates on-demand through API endpoints (scrape, crawl, agent)
May route requests through multi-tenant or proxy IP pools
User-agent string is not documented as a single stable value

Primarily on-demand and user-initiated via Firecrawl's API endpoints. Scheduled monitoring is also supported through the /monitor feature, which can produce recurring crawls.

How to Block or Control FirecrawlAgent

To block FirecrawlAgent via robots.txt: User-agent: FirecrawlAgent Disallow: / IP-based blocking is unreliable because Firecrawl may route requests through multi-tenant or proxy IP pools. No published IP range list is available. If robots.txt rules aren't being respected, contact Firecrawl support through the channels listed at docs.firecrawl.dev.

Common Issues & Troubleshooting

Watch out for these common problems when working with FirecrawlAgent:

Firecrawl requests may be hard to identify in logs because the user-agent string is not documented as a single stable value
IP-based blocking is unreliable due to multi-tenant and proxy IP pools
CloudFlare and similar bot protection services may inconsistently block or allow requests
Content behind login walls or CAPTCHAs is inaccessible to the crawler
Crawl-delay directives are not documented as supported, so rate limiting may require server-side controls