Agent DirectoryApifyApifyWebsiteContentCrawler

ApifyWebsiteContentCrawler

Apify actor that crawls websites and extracts text content for AI models, LLM apps, and RAG pipelines.

What does ApifyWebsiteContentCrawler do?

ApifyWebsiteContentCrawler deep-crawls websites and extracts cleaned text, Markdown, and HTML optimized for AI and LLM consumption. It powers Apify platform integrations and feeds into downstream AI workflows like RAG pipelines, vector databases, LangChain, and LlamaIndex. Referral traffic depends on the downstream application: if the end product includes source URLs or citation links, your site can receive clicks from AI-powered search UIs and assistants built on the crawled data.

Should I allow and optimize for ApifyWebsiteContentCrawler to drive organic growth?

ApifyWebsiteContentCrawler doesn't directly surface your content to end users. The value depends on what downstream applications do with the crawled data. If someone builds a RAG-powered search assistant or AI chatbot using this Actor, and that application includes source URLs in its output, your site can receive referral traffic. Allowing the crawler ensures your content is available for these AI-powered products. Blocking it removes your content from any workflows built on top of it.

Here's how to optimize for ApifyWebsiteContentCrawler:

  • Allow apifywebsitecontentcrawler in your robots.txt if you want your content included in Apify-powered AI workflows
  • Use semantic HTML and clean markup so the text extraction produces high-quality output
  • Include an XML sitemap to help the crawler discover all relevant pages
  • Ensure your pages load quickly, even with JavaScript rendering enabled
  • Add descriptive meta tags and structured data (JSON-LD) to improve content extraction quality
  • Keep important content in the main body, not hidden behind modals or lazy-loaded elements

Data Usage & Training

It is unclear from Apify's public documentation whether content crawled by this Actor is used to train Apify-owned AI models. The documentation focuses on delivering cleaned content to customers for use in their own LLM workflows. If this concerns you, contact Apify directly at [email protected] or https://apify.com/contact for clarification.

How ApifyWebsiteContentCrawler Accesses Content

Here's how ApifyWebsiteContentCrawler accesses your site and understands your content:

  • Supports full browser rendering (Playwright) for JavaScript-heavy sites
  • Also offers a raw HTTP mode for faster crawling of static pages
  • Can use sitemaps to discover URLs
  • Supports login cookies and session-based authentication
  • Routes requests through Apify Proxy (residential and datacenter pools)
  • Outputs structured records including text, Markdown, raw HTML, and screenshots

On-demand only. Actors execute when started by a user or triggered by a schedule or integration, not as a continuous crawl.

How to Block or Control ApifyWebsiteContentCrawler

To block this crawler via robots.txt, add: User-agent: apifywebsitecontentcrawler Disallow: / This only works if the person running the Actor has enabled the respectRobotsTxtFile input setting, which is not on by default. You can also attempt to block the user-agent string containing "apifywebsitecontentcrawler" at the server level. IP-based blocking is less effective because requests may route through Apify's residential and datacenter proxy pools. For a formal opt-out or whitelist request, contact Apify at [email protected] or https://apify.com/contact.

Common Issues & Troubleshooting

Watch out for these common problems when working with ApifyWebsiteContentCrawler:

  • robots.txt is only respected when the Actor's respectRobotsTxtFile input is explicitly enabled by the user running the crawl
  • Requests routed through Apify Proxy (residential IPs) make IP-based blocking unreliable
  • Anti-bot protections like CAPTCHAs and WAFs may block requests, but the Actor can switch to full browser mode to bypass some of these
  • Raw HTTP mode fails on JavaScript-heavy pages, so content may be incomplete if browser rendering is not enabled
  • The user-agent string is not always consistent and may be customized per run, making UA-based blocking less reliable

Quick Reference

Platform
Agent Category
Growth Value
User Agent String
apifywebsitecontentcrawler
robots.txt Entry
User-agent: apifywebsitecontentcrawler
Disallow: /

See which agents visit your site

Monitor real-time AI agent and bot activity on your site for free with Siteline Agent Analytics

Get started free

Frequently Asked Questions

Similar Agents & Bots

Learn More

Related Resources

💥 Get started

Ready to track ApifyWebsiteContentCrawler on your site?

Start monitoring agent traffic, understand how AI discovers your content, and optimize for the next generation of search.