Does ApifyWebsiteContentCrawler respect robots.txt?

Only when the user running the Actor enables the respectRobotsTxtFile input setting. This is not enabled by default. If you rely on robots.txt for blocking, be aware that compliance depends on how each individual user configures their crawl.

Is my content used to train AI models?

Apify's documentation does not clarify whether crawled content is used to train Apify-owned models. The primary purpose is delivering cleaned content to customers for their own LLM workflows. Contact hello@apify.com if you need a definitive answer.

Can I block ApifyWebsiteContentCrawler by IP address?

IP-based blocking is difficult because the Actor can route requests through Apify's residential and datacenter proxy pools. Published IP ranges are not available. Server-level user-agent blocking or contacting Apify directly are more reliable options.

What user-agent string does ApifyWebsiteContentCrawler use?

The exact user-agent string is not publicly documented. It may include the substring "apifywebsitecontentcrawler" but can also be customized per run. Check your server logs for this substring to identify requests.

Will allowing this crawler drive traffic to my site?

Potentially, but indirectly. The Actor outputs include source URLs and metadata. If the downstream application (an AI assistant, search UI, or chatbot) displays source links, users may click through to your site. The referral value depends entirely on what the end product does with the data.

Does ApifyWebsiteContentCrawler render JavaScript?

Yes. It supports full browser rendering via Playwright for JavaScript-heavy sites. It also has a raw HTTP mode for faster crawling of static pages, though this mode will miss dynamically loaded content.

Agent Directory ApifyApifyWebsiteContentCrawler

ApifyWebsiteContentCrawler

AI User Initiated

Apify actor that crawls websites and extracts text content for AI models, LLM apps, and RAG pipelines.

What does ApifyWebsiteContentCrawler do?

ApifyWebsiteContentCrawler deep-crawls websites and extracts cleaned text, Markdown, and HTML optimized for AI and LLM consumption. It powers Apify platform integrations and feeds into downstream AI workflows like RAG pipelines, vector databases, LangChain, and LlamaIndex. Referral traffic depends on the downstream application: if the end product includes source URLs or citation links, your site can receive clicks from AI-powered search UIs and assistants built on the crawled data.

Should I allow and optimize for ApifyWebsiteContentCrawler to drive organic growth?

ApifyWebsiteContentCrawler doesn't directly surface your content to end users. The value depends on what downstream applications do with the crawled data. If someone builds a RAG-powered search assistant or AI chatbot using this Actor, and that application includes source URLs in its output, your site can receive referral traffic. Allowing the crawler ensures your content is available for these AI-powered products. Blocking it removes your content from any workflows built on top of it.

Here's how to optimize for ApifyWebsiteContentCrawler:

Allow apifywebsitecontentcrawler in your robots.txt if you want your content included in Apify-powered AI workflows
Use semantic HTML and clean markup so the text extraction produces high-quality output
Include an XML sitemap to help the crawler discover all relevant pages
Ensure your pages load quickly, even with JavaScript rendering enabled
Add descriptive meta tags and structured data (JSON-LD) to improve content extraction quality
Keep important content in the main body, not hidden behind modals or lazy-loaded elements

Data Usage & Training

It is unclear from Apify's public documentation whether content crawled by this Actor is used to train Apify-owned AI models. The documentation focuses on delivering cleaned content to customers for use in their own LLM workflows. If this concerns you, contact Apify directly at [email protected] or https://apify.com/contact for clarification.

How ApifyWebsiteContentCrawler Accesses Content

Here's how ApifyWebsiteContentCrawler accesses your site and understands your content:

Supports full browser rendering (Playwright) for JavaScript-heavy sites
Also offers a raw HTTP mode for faster crawling of static pages
Can use sitemaps to discover URLs
Supports login cookies and session-based authentication
Routes requests through Apify Proxy (residential and datacenter pools)
Outputs structured records including text, Markdown, raw HTML, and screenshots

On-demand only. Actors execute when started by a user or triggered by a schedule or integration, not as a continuous crawl.

How to Block or Control ApifyWebsiteContentCrawler

To block this crawler via robots.txt, add: User-agent: apifywebsitecontentcrawler Disallow: / This only works if the person running the Actor has enabled the respectRobotsTxtFile input setting, which is not on by default. You can also attempt to block the user-agent string containing "apifywebsitecontentcrawler" at the server level. IP-based blocking is less effective because requests may route through Apify's residential and datacenter proxy pools. For a formal opt-out or whitelist request, contact Apify at [email protected] or https://apify.com/contact.

Common Issues & Troubleshooting

Watch out for these common problems when working with ApifyWebsiteContentCrawler:

robots.txt is only respected when the Actor's respectRobotsTxtFile input is explicitly enabled by the user running the crawl
Requests routed through Apify Proxy (residential IPs) make IP-based blocking unreliable
Anti-bot protections like CAPTCHAs and WAFs may block requests, but the Actor can switch to full browser mode to bypass some of these
Raw HTTP mode fails on JavaScript-heavy pages, so content may be incomplete if browser rendering is not enabled
The user-agent string is not always consistent and may be customized per run, making UA-based blocking less reliable

Quick Reference

Platform

Apify

Agent Category

AI User Initiated

Growth Value

Official Documentation

apify.com/apify/website-content-crawler

User Agent String

apifywebsitecontentcrawler

robots.txt Entry

User-agent: apifywebsitecontentcrawler
Disallow: /

See which agents visit your site

Monitor real-time AI agent and bot activity on your site for free with Siteline Agent Analytics

Get started free

Frequently Asked Questions

Similar Agents & Bots

ChatGPT-User

OpenAI browsing agent fetching pages at user request.

Claude-User

User-initiated fetches triggered by Claude sessions.

DuckAssistBot

DuckDuckGo assistant fetching content for answers.

ExaBot

Exa agent fetching pages at user request for search and research.

Learn More

Data Analysis

How well do AI agents understand top software products? An in depth benchmark analysis

Data Analysis

The Rise of Claude Code Web Agents

Agentic Web Tech

How Websites Will Need to Adapt for Their New Agentic Visitors

Data Analysis

Understanding User Intent Through AI Agent & Bot Traffic

Related Resources

Agent Readiness Check

Check if AI agents and bots can easily discover your content

AI Agent Directory

Continuously updated directory of AI agents, bots & crawlers

Case Studies

Real stories of driving organic growth from AI search

Blog

Research, guides, feature updates and more

💥 Get started

Ready to track ApifyWebsiteContentCrawler on your site?

Start monitoring agent traffic, understand how AI discovers your content, and optimize for the next generation of search.

Get started free