Web & Communication Tool

The Web Crawler Tool

Go beyond a single result: crawl multiple pages within a domain and extract their text, titles, and URLs — with polite delays and page limits — so an agent can ingest a whole site’s content, on infrastructure you control.

Explore VDF AI Agents

Multi-pageWhole sections, not one page

PoliteDelays and per-site limits

CleanExtracted text, titles, URLs

100%On-prem crawling

The Depth Problem

One page rarely holds the whole answer

A single search result is a starting point, but the real content — a documentation site, a knowledge base, a competitor’s product pages — spans many pages. Reading them one by one doesn’t scale, and naive scraping gets you blocked.

Content spans pages

The full picture lives across a whole section of a site.

Manual page-by-page

Visiting each page by hand is slow and incomplete.

Naive scraping gets blocked

Hammering a site without delays is rude and quickly rate-limited.

Messy HTML

Raw pages are full of markup an agent has to wade through.

How the Tool Works

A site’s content, ingested cleanly

Breadth

Crawl across the site

Many pages, one call.

The tool follows links within a domain and extracts the text, titles, and URLs of multiple pages, so an agent can ingest a whole documentation set or product section rather than a single page.

Multi-page within a domain
Text, titles, and URLs
Per-site page limits
Up to five starting URLs

Site

Multi-Page

Whole sections

PagesTextTitlesURLs

Politeness

Crawls responsibly

Delays and limits built in.

Configurable delays between requests and a cap on pages per site mean the crawler behaves politely and stays within bounds — getting the content without getting blocked.

Polite

Rate-Limited

Delays + caps

DelaysLimitsResponsibleStable

Governance

On-premise crawling

Under your control.

Crawling runs through a controlled tool with audit logging on infrastructure you operate, so external content ingestion is governed rather than ad-hoc.

100%

On-Prem

Governed, logged

On-premGovernedAudit logControlled

Inputs

Parameters

The web_crawler tool accepts these inputs when an agent calls it. Required inputs are flagged.

Name Type Required Description

urls array Required List of URLs to crawl (1–5).

max_pages_per_site integer
default: 10 Optional Maximum pages to crawl per website (1–20).

delay_min number
default: 1 Optional Minimum delay between requests in seconds (0.5–10).

delay_max number
default: 3 Optional Maximum delay between requests in seconds (1–20).

max_text_length integer
default: 5000 Optional Maximum text length to extract per page (1000–10000).

Where it pays back

Where the web crawler pays back

Documentation ingest

Pull a whole docs site into text for retrieval.

Competitive research

Extract a competitor’s product pages for analysis.

Knowledge building

Feed crawled content into a private RAG index.

Content audits

Gather a site’s pages to review coverage.

Monitoring

Re-crawl pages to track changes over time.

Agent research

Let a research agent ingest a site, not just a page.

How VDF AI connects it

Assigned to agents, orchestrated as networks

On VDF AI, an industry’s use cases map to agents, and you assign tools like this one to those agents. Compose multiple agents into a governed, on-premise network.

Industry Your sector Finance, healthcare, telecom, government, and more. Use Case A job to be done Concrete workflows the business needs solved. Agent A specialized worker Governed AI agents that execute the use case. Tool Web Crawler The capability you assign to an agent. Network Agents, orchestrated Many use cases and agents, working as one.

ROI Snapshot

What changes after you assign it

Whole-site

Content ingested at depth

Clean

Text without the markup

Polite

No blocking or abuse

100%

Crawled on-prem

FAQ

Questions about the Web Crawler tool

What does the web crawler do?

It crawls multiple pages within a domain and extracts their text, titles, and URLs, with polite delays and a per-site page limit, so an agent can ingest a whole site section rather than a single page.

Does it crawl across domains?

It stays within the same domain for each starting URL and accepts up to five starting URLs, keeping crawls focused and responsible.

How does it avoid getting blocked?

Configurable minimum and maximum delays between requests plus a cap on pages per site keep it polite and within bounds.

Is crawling governed?

Yes. It runs through a controlled tool with audit logging on infrastructure you operate.

How is it used by agents?

Research and knowledge agents use it to ingest sites into private RAG, often paired with web search for discovery and the federated vector search for retrieval afterward.

Agents that use it

Assign Web Crawler to these agents

These VDF AI agents can be assigned this tool. Open an agent to see the full toolkit it can run.

Related tools

Tools that work well alongside this one

Keep exploring

Where this tool delivers value

Private Research Assistant Private Customer Support RAG Telecommunications Finance & Banking Browse all tools

Ingest whole sites, not single pages

See the web crawler feed a research agent’s knowledge base — on infrastructure you control.

See how tools work on VDF AI Deploy on your own infrastructure

The Web Crawler Tool

One page rarely holds the whole answer

Content spans pages

Manual page-by-page

Naive scraping gets blocked

Messy HTML

A site’s content, ingested cleanly

Crawl across the site

Crawls responsibly

On-premise crawling

Parameters

Where the web crawler pays back

Documentation ingest

Competitive research

Knowledge building

Content audits

Monitoring

Agent research

Assigned to agents, orchestrated as networks

What changes after you assign it

Questions about the Web Crawler tool

Assign Web Crawler to these agents

Tools that work well alongside this one

Where this tool delivers value

Ingest whole sites, not single pages

AI Agent Infrastructure for Regulated Industries: A 2026 Architecture Guide

Enterprise AI Agent Platform Buyer's Guide: 10 Questions to Ask Before You Sign

The Web Crawler Tool

One page rarely holds the whole answer

Content spans pages

Manual page-by-page

Naive scraping gets blocked

Messy HTML

A site’s content, ingested cleanly

Crawl across the site

Crawls responsibly

On-premise crawling

Parameters

Where the web crawler pays back

Documentation ingest

Competitive research

Knowledge building

Content audits

Monitoring

Agent research

Assigned to agents, orchestrated as networks

What changes after you assign it

Questions about the Web Crawler tool

Assign Web Crawler to these agents

AI Content Planning Agent

AI Image Generation Agent

AI Writing Assistant

AI PR Coordinator

Tools that work well alongside this one

Web Search

Federated Vector Search

RAG Vector Query

Document Generator

Sentiment Analysis

Where this tool delivers value

Ingest whole sites, not single pages

AI Agent Infrastructure for Regulated Industries: A 2026 Architecture Guide

Enterprise AI Agent Platform Buyer's Guide: 10 Questions to Ask Before You Sign

Request a Demo

Thank You!