Web & Communication Tool

The Web Crawler Tool

Go beyond a single result: crawl multiple pages within a domain and extract their text, titles, and URLs — with polite delays and page limits — so an agent can ingest a whole site’s content, on infrastructure you control.

Explore VDF AI Agents
Multi-pageWhole sections, not one page
PoliteDelays and per-site limits
CleanExtracted text, titles, URLs
100%On-prem crawling
The Depth Problem

One page rarely holds the whole answer

A single search result is a starting point, but the real content — a documentation site, a knowledge base, a competitor’s product pages — spans many pages. Reading them one by one doesn’t scale, and naive scraping gets you blocked.

01

Content spans pages

The full picture lives across a whole section of a site.

02

Manual page-by-page

Visiting each page by hand is slow and incomplete.

03

Naive scraping gets blocked

Hammering a site without delays is rude and quickly rate-limited.

04

Messy HTML

Raw pages are full of markup an agent has to wade through.

How the Tool Works

A site’s content, ingested cleanly

Breadth

Crawl across the site

Many pages, one call.

The tool follows links within a domain and extracts the text, titles, and URLs of multiple pages, so an agent can ingest a whole documentation set or product section rather than a single page.

  • Multi-page within a domain
  • Text, titles, and URLs
  • Per-site page limits
  • Up to five starting URLs
Site
Multi-Page

Whole sections

PagesTextTitlesURLs

Politeness

Crawls responsibly

Delays and limits built in.

Configurable delays between requests and a cap on pages per site mean the crawler behaves politely and stays within bounds — getting the content without getting blocked.

Polite
Rate-Limited

Delays + caps

DelaysLimitsResponsibleStable

Governance

On-premise crawling

Under your control.

Crawling runs through a controlled tool with audit logging on infrastructure you operate, so external content ingestion is governed rather than ad-hoc.

100%
On-Prem

Governed, logged

On-premGovernedAudit logControlled
Inputs

Parameters

The web_crawler tool accepts these inputs when an agent calls it. Required inputs are flagged.

Name Type Required Description
urls array Required List of URLs to crawl (1–5).
max_pages_per_site integer
default: 10
Optional Maximum pages to crawl per website (1–20).
delay_min number
default: 1
Optional Minimum delay between requests in seconds (0.5–10).
delay_max number
default: 3
Optional Maximum delay between requests in seconds (1–20).
max_text_length integer
default: 5000
Optional Maximum text length to extract per page (1000–10000).
Where it pays back

Where the web crawler pays back

Documentation ingest

Pull a whole docs site into text for retrieval.

Competitive research

Extract a competitor’s product pages for analysis.

Knowledge building

Feed crawled content into a private RAG index.

Content audits

Gather a site’s pages to review coverage.

Monitoring

Re-crawl pages to track changes over time.

Agent research

Let a research agent ingest a site, not just a page.

How VDF AI connects it

Assigned to agents, orchestrated as networks

On VDF AI, an industry’s use cases map to agents, and you assign tools like this one to those agents. Compose multiple agents into a governed, on-premise network.

ROI Snapshot

What changes after you assign it

Whole-site
Content ingested at depth
Clean
Text without the markup
Polite
No blocking or abuse
100%
Crawled on-prem
FAQ

Questions about the Web Crawler tool

What does the web crawler do?

It crawls multiple pages within a domain and extracts their text, titles, and URLs, with polite delays and a per-site page limit, so an agent can ingest a whole site section rather than a single page.

Does it crawl across domains?

It stays within the same domain for each starting URL and accepts up to five starting URLs, keeping crawls focused and responsible.

How does it avoid getting blocked?

Configurable minimum and maximum delays between requests plus a cap on pages per site keep it polite and within bounds.

Is crawling governed?

Yes. It runs through a controlled tool with audit logging on infrastructure you operate.

How is it used by agents?

Research and knowledge agents use it to ingest sites into private RAG, often paired with web search for discovery and the federated vector search for retrieval afterward.

Ingest whole sites, not single pages

See the web crawler feed a research agent’s knowledge base — on infrastructure you control.