AI Research

Your Crawler Doesn't Know What You Asked For

Daniel Campos
Daniel Campos
·27 min read
crawlingagentsinfrastructuresearchzipfalgorithms
Your Crawler Doesn't Know What You Asked For

Breadth-first crawling is what happens when your crawler has no idea what you asked for.

You say "find partner bios." The crawler fetches the homepage, then the blog, then the careers page, then 20 portfolio company links, then the privacy policy. Thirty pages later: zero bios. The crawler wasn't broken. It just didn't know what mattered.

Most crawlers work like lawnmowers — start at a URL, fan out in every direction, treat every link equally. The configuration burden falls on the developer: write URL patterns, set depth limits, build allow/deny lists. A 10-page crawl needs 10 lines of config. A 100-page crawl across a site with complex navigation needs a small engineering project.

At Zipf, crawling is one step inside a larger agent loop. Monitors deploy LLM agents that patrol web sources on a schedule — plan a workflow, execute searches and crawls, score the results for what changed. The crawl step is where intent matters most: the agent passes the monitor's description straight to the crawler, which uses that sentence to decide every URL priority. A daily arXiv patrol, a weekly team-page tracker — both depend on the crawler spending its budget wisely.

We built a system where you describe what you want in a single sentence. The system does the rest. This post shows why BFS fails, how badly it fails on real websites, and what we built to replace it. All numbers come from simulations on real website link structures fetched in May 2026.

The Problem with BFS

Breadth-first search is the default algorithm for web crawlers. It is simple, predictable, and wrong for almost every real extraction task.

Consider a concrete example. You want to extract the names and bios of partners at Greylock Partners. Their website has:

  • 1 homepage with global navigation (15 links: Portfolio, News, AI, Cybersecurity, Enterprise, etc.)
  • 1 /team page listing 46 partners (Reid Hoffman, Sarah Guo, Seth Rosenberg, and 43 others)
  • 46 individual /team/{name} pages (your targets)
  • 20 external portfolio company links (Figma, Discord, Coinbase, Databricks, etc.)
  • Social links, blog posts, and legal pages

A BFS crawler with a budget of 30 pages will spend its first level on the homepage (1 page), then fan out to the first links in DOM order at depth 1 — Portfolio, News, About, AI, Cybersecurity. By the time it reaches the team listing, it has also queued the blog index, portfolio page, and 20 external company links. It exhausts its budget on navigation and portfolio pages. Zero partner bios.

BFS crawl on Greylock Partners showing 30-page budget exhausted at depth 1 with 0 targets found
BFS crawl on Greylock Partners showing 30-page budget exhausted at depth 1 with 0 targets found

The bug in BFS is not speed. It is indifference. On a real website with navigation, footers, sidebars, and cross-links, the ratio of useful pages to total pages is often below 10%. BFS crawls the haystack when you want the needles.

Why Not Just Use the Sitemap?

The first objection to "BFS is broken" is usually: "Just parse sitemap.xml." If the site publishes a sitemap, it lists every URL — no crawling needed, no budget wasted on navigation.

In theory, yes. In practice, sitemaps are unreliable as a primary discovery mechanism for three reasons.

Most sites don't have useful sitemaps. Small-to-medium company websites (the Greylocks, the startup landing pages, the VC portfolio sites) rarely publish XML sitemaps. They have no SEO team maintaining one. When a sitemap does exist, it is often auto-generated by a CMS and contains every page on the site — blog posts, legal pages, press releases, job listings — with no way to distinguish target content from noise. You still need to filter.

Sitemaps go stale. A sitemap is a static file that gets regenerated on some schedule (or never). New pages may not appear for days. Deleted pages linger as 404s. For monitoring use cases where freshness matters — "find newly published research papers," "detect new team members" — a stale sitemap misses exactly the content you care about.

Sitemaps don't understand intent. Even a perfect, up-to-date sitemap with every URL on the site gives you a flat list. Greylock's hypothetical sitemap would list 80+ URLs: 46 bio pages, 15 navigation pages, 20 portfolio pages, blog posts, legal pages. You still need to figure out which URLs are bio pages. That's the same intent-matching problem, just with a different input format.

Our system does check for sitemaps when they exist — they are a useful supplement to link-graph crawling, especially for discovering pages that are poorly linked. But sitemaps cannot replace intelligent crawling. The hard problem was never "find the list of URLs." It was "decide which URLs to spend your budget on."

The Alternative: One Sentence

Intent-driven crawling inverts the problem. Instead of configuring how to crawl (URL patterns, depth limits, allow/deny lists), you describe what you want to find:

"Find individual team member bio pages with their background and role"

The system uses that sentence — and only that sentence — to make every decision during the crawl: which links to follow, which pages to skip, how deep to go, and how to allocate its budget. No configuration required.

The same crawl on Greylock that BFS handled with zero targets — all 29 non-seed pages wasted on navigation and portfolio links — yields 20 partner bio pages out of a 30-page budget.

The Receipts

We ran intent-driven crawling against BFS on five real websites with different structural challenges. Each site was modeled from production page fetches (May 2026) with real link counts, DOM ordering, and category distributions. Both crawlers got identical page budgets of 30 pages.

How to read the table: "Targets on site" is the total number of pages matching the extraction intent. "Found" is how many the crawler reached within its 30-page budget. The intent-driven crawler sometimes stops early when it runs out of high-scoring candidates, so it may not use all 30 pages.

SiteTotal linksTargets on siteBFS foundIntent found
arXiv cs.CL2,54830019
Hacker News20122211
Greylock4346020
Python Docs26117017
ACL 202426,55850020
Total287

At a budget of 30 pages each, BFS finds 2 targets across all five sites. Intent-driven finds 87.

In 4 of 5 sites, BFS finds zero targets. It exhausts its budget on navigation, format variants, and author links before reaching any content matching the extraction intent. The only site where BFS finds any targets (Hacker News) is the smallest and flattest, where some target links happen to appear early in DOM order.

The rest of this post explains what each experiment looks like in detail, why BFS fails in different ways on each site, and how the system works.

Experiments: Five Sites, Five Failure Modes

The individual experiments below use site-appropriate budgets that stress-test each failure mode (e.g., a larger budget for HN to show that even with more room, BFS still struggles). The summary table above standardizes at budget 30 for clean comparison.

Experiment 1: arXiv — Finding LLM Efficiency Papers

Intent: "Find research papers about large language model efficiency, including distillation, quantization, pruning, and inference optimization"

Site structure: 2,548 links on the cs.CL/recent listing. 250 unique papers, each generating ~9 links (1 abstract + 3 format variants + ~5 author search links). Only 30 papers match the LLM efficiency intent — 1.2% of total links.

Budget: 30 pages

arXiv experiment: BFS 0% vs Intent-Driven 85.7% harvest rate
arXiv experiment: BFS 0% vs Intent-Driven 85.7% harvest rate

arXiv is the most brutal test for BFS. Each paper generates ~9 links in DOM order: /abs/, /pdf/, /html/, /format/, and 2-10 author search links. BFS follows them serially, so 29 non-seed pages barely cover 3 complete paper blocks — and with only 30/250 papers matching the intent, the probability of hitting a match in the first 3 is low. The intent scorer filters all 2,298 format/author/nav links and focuses exclusively on abstract pages with relevant titles.

Experiment 2: Hacker News — Finding AI Startup Show HN Posts

Intent: "Find Show HN posts about AI startups launching new developer tools or APIs"

Site structure: 201 links on the Show HN page. 30 stories, each with ~7 links (external URL + 2 item links + user link + vote link + action links). 22 match the AI startup intent, but they are interleaved with action links in DOM order.

Budget: 40 pages

Hacker News experiment: BFS 10% vs Intent-Driven 37.9% harvest rate
Hacker News experiment: BFS 10% vs Intent-Driven 37.9% harvest rate

Hacker News is the most action-link-dense site in our experiments. Every story generates vote, hide, flag, user, and timestamp links. BFS follows all of them in DOM order and reaches only ~4 matching posts before budget exhaustion. The intent scorer filters 160 action/author/nav links and scores remaining posts by title relevance to "AI startups."

Experiment 3: Greylock Partners — Extracting Partner Bios

Intent: "Find individual partner biography pages with their name, role, investment focus areas, and career background"

Site structure: 43 links on Greylock's homepage (1 /team hub, 15 navigation, 20 portfolio company external links, 4 social, 3 blog). Team page has 46 bio targets (Reid Hoffman, Sarah Guo, Seth Rosenberg, Saam Motamedi, Jerry Chen, and 41 others) + 13 nav + 7 other. The hub /team is 1 of 43 links — a needle in a haystack.

Budget: 30 pages

Greylock Partners experiment: BFS 0% vs Intent-Driven 90.9% harvest rate
Greylock Partners experiment: BFS 0% vs Intent-Driven 90.9% harvest rate

Greylock is the canonical case for intent-driven crawling. BFS finds the /team hub but then follows all 15 navigation links from the homepage — Portfolio, News, About, AI sector page, Cybersecurity — before bio pages become reachable at depth 2. It exhausts its budget on navigation and external portfolio links. The intent scorer identifies /team as a hub via listing pattern detection, skips 28 navigation and external links, and spends 60% of its budget on the 46 bio pages behind the hub.

Experiment 4: Python Docs — Finding Asyncio Documentation

Intent: "Find asyncio API documentation including runners, tasks, streams, synchronization primitives, event loops, and subprocesses"

Site structure: 261 links on the stdlib index page. 164 module documentation links (only 17 are asyncio-related), 92 section anchors, 18 navigation links. Asyncio modules are in the "Networking and Interprocess Communication" section — approximately the 90th module in DOM order, buried after text processing, data structures, numeric, functional, file, persistence, compression, crypto, OS, and concurrent modules.

Budget: 20 pages

Python docs experiment: BFS 0% vs Intent-Driven 92.9% harvest rate
Python docs experiment: BFS 0% vs Intent-Driven 92.9% harvest rate

This experiment demonstrates a different failure mode: not link-type pollution (arXiv) or action links (HN), but positional burial. All 164 module links look identical to BFS — they are all the same type (module documentation). But the ones you want (asyncio) happen to be in the 90th position, buried after 12 other standard library sections. The intent scorer matches "asyncio" in URL paths and anchor text, jumping directly to the relevant modules regardless of DOM position.

Experiment 5: ACL 2024 — Finding RLHF Research Papers

Intent: "Find papers about reinforcement learning from human feedback (RLHF), preference optimization, and alignment techniques for language models"

Site structure: 26,558 links on the ACL 2024 proceedings page. This is the most link-dense page in our experiments. 2,757 paper links + 2,757 PDF links + 2,794 BibTeX links + 14,732 author links + 73 volume links + 2,753 anchors + 35 external + 6 nav. 50 of the 2,757 papers (~1.8% of papers) match the RLHF intent — but those 50 papers represent just 0.19% of the 26,558 total links on the page, because each paper generates ~5 non-paper links (PDF, BibTeX, author search, etc.).

Budget: 30 pages

ACL 2024 experiment: BFS 0% vs Intent-Driven 95.2% harvest rate
ACL 2024 experiment: BFS 0% vs Intent-Driven 95.2% harvest rate

ACL 2024 proceedings is the extreme case: 26,558 links on a single page, with 58% being author search links. BFS processes only 0.11% of the frontier before budget exhaustion. The junk pre-filter removes 20,936 links (79%), and the intent scorer surfaces RLHF papers by matching title keywords like "preference optimization," "alignment," and "human feedback."

Harvest Rate Across All Five Sites

Harvest rate comparison chart across all five experiment sites
Harvest rate comparison chart across all five experiment sites

The pattern is consistent. The web is mostly navigation. Agents need content.

Budget Sweep: How Quickly Do You Reach Coverage?

The experiments above used fixed budgets. A natural follow-up: how much budget does each approach need to find all the targets?

Budget efficiency curves showing intent-driven reaches coverage faster at every budget level
Budget efficiency curves showing intent-driven reaches coverage faster at every budget level

Intent-driven produces results from page one. At budget 5, the intent scorer finds 3-4 targets on every site. BFS finds zero on 4 of 5 sites. BFS believes every link is equally promising. Intent-driven believes the user's sentence.

BFS has a dead zone. On arXiv, BFS finds zero papers until budget 75. On ACL, zero even at budget 100. On Python docs, zero until budget 100. The dead zone exists because BFS must process all links in DOM order before reaching targets, and on link-dense pages, relevant content is buried deep in the queue.

Intent-driven hits diminishing returns gracefully. Once all targets are found, additional budget goes to off-topic pages. But the target count plateaus at the maximum — the system found everything there was to find. BFS barely reaches the same plateau even at 3-5x the budget.

Normalized: What Fraction of All Targets?

The absolute chart above hides something. ACL has 50 targets and arXiv has 30 — so finding 20 on each means very different things (40% vs 67%). Normalizing to fraction of total targets tells a cleaner convergence story:

Coverage as fraction of total possible targets across all sites and budgets
Coverage as fraction of total possible targets across all sites and budgets

Python docs saturates at budget 30 — the smallest target set on the smallest site. Greylock reaches 100% at budget 75. ACL and HN reach 100% at budget 100. arXiv is the outlier: it asymptotes at 97% because the last paper is buried in 2,548 format-variant and author-search links. At budget 30, intent-driven covers 40–100% of targets across all five sites. BFS covers 0% on four of five.

How It Works: Four Layers of Intelligence

Now that you have seen the results, here is the architecture. The system combines four layers, each addressing a different failure mode of naive crawling.

Four algorithmic layers of intent-driven crawling: junk filter, graph scoring, LLM intent scoring, budget allocation
Four algorithmic layers of intent-driven crawling: junk filter, graph scoring, LLM intent scoring, budget allocation

Each layer is cheap on its own. The junk filter is pure pattern matching. Graph scoring is O(links). LLM scoring batches 20 URLs into a single call. The budget allocator is arithmetic. Together, they eliminate 80-95% of irrelevant URLs before a single page is fetched.

Layer 1: The Junk Pre-Filter

Before any scoring happens, the system runs every discovered URL through a four-tier pre-filter. The goal: eliminate URLs that are structurally unlikely to contain useful content, at zero cost.

Tier 1 — Hard structural rejection. Wikipedia meta-namespaces (Special:, Talk:, User:), forum action links (/vote, /flag, /reply), authentication pages, print/export variants, GitHub platform navigation (/features, /pricing, /copilot), privacy policy pages, RSS feeds, and asset directories (/_next/, /wp-content/). These URLs almost never contain extraction targets.

Tier 2 — Domain-level filtering. When the intent is clearly non-social (contains words like "research," "patent," "regulatory," "API documentation"), social media domains are filtered. When the intent mentions "community sentiment" or "twitter reactions," they pass through. This is intent-adaptive, not a static deny list.

Tier 3 — Soft rejection. Navigation category URLs (/about, /contact, /search), excessive path depth (>6 segments), and URLs with complex query strings (>3 non-tracking parameters). These are probably not useful, but not certainly.

Tier 4 — Anchor-text rescue. Soft-rejected URLs get a second chance if their anchor text matches keywords from the intent. A link to /about with anchor text "Meet our research team" will be rescued when the intent mentions "team members." This prevents over-filtering on sites where content lives at unexpected paths.

In production, the pre-filter removes approximately 60% of harvested URLs before any scoring occurs.

Layer 2: Graph-Based Scoring

Once junk URLs are removed, the remaining candidates are scored using three graph algorithms that run concurrently during the crawl.

OPIC (Streaming PageRank). Online Page Importance Computation is a streaming approximation of PageRank that computes importance as the crawl progresses, without a complete link graph. Each URL has a cash value (priority) and a history value (cumulative importance). When a page is crawled, its cash is distributed to outgoing links, scaled by a damping factor of 0.85.

OPIC streaming PageRank showing cash distribution and LLM-gated dampening
OPIC streaming PageRank showing cash distribution and LLM-gated dampening

The key innovation is LLM-gated OPIC dampening. Traditional OPIC gives high scores to structurally important pages regardless of relevance — a site's /blog index might outrank any individual bio page. Our system caps the OPIC cash of URLs scored below 0.1 by the LLM intent scorer, preventing structurally important but semantically irrelevant pages from consuming budget. PageRank tells you what the site cares about. Intent tells you what the user cares about.

HITS (Hub and Authority Detection). HITS classifies pages into hubs (link to many relevant pages) and authorities (are the relevant pages). A team listing page is a hub. An individual bio is an authority. Hubness is computed from four signals:

SignalWeightWhat it detects
Out-degree0 – 0.30Pages with many outlinks (20+ links → max score)
Listing pattern-0.1 or +0.3URL paths like /team, /blog, /products, /docs
Template slugs0 – 0.25Repeated URL prefixes in outlinks (e.g., /team/alice, /team/bob)
Link diversity0 – 0.15Low diversity (links point to same section) = good hub signal

Hub pages get priority during the hub discovery phase (first 30% of budget), ensuring the system finds listing pages early and can expand into their children.

QMin (Parent Quality Propagation). Prevents wasting budget on subtrees that have already proven low-quality. Tracks the minimum content quality score along the path from seed to each URL: qmin(child) = min(qmin(parent), quality(parent)). If a page is off-topic, all URLs discovered through it inherit a low QMin, pruning the entire subtree.

Layer 3: LLM Pre-Fetch Intent Scoring

This is the most distinctive layer: scoring URL relevance before fetching the page, using only the URL path, anchor text, and parent page context.

When the crawler discovers new links, it batches up to 20 candidate URLs and sends them to Claude with the crawl intent. The model scores each URL 0.0–1.0:

  • 0.8–1.0: Very likely matches intent (path and anchor directly suggest target content)
  • 0.5–0.7: Probably relevant (some signals match)
  • 0.2–0.4: Might be relevant (weak signals)
  • 0.0–0.1: Likely irrelevant (navigation, login, unrelated section)

The model handles format deduplication (arXiv /abs/ vs /pdf/), intent exclusions, and utility page detection. The batch approach is efficient — the total LLM budget is max_pages / 2, so a 30-page crawl makes at most 15 scoring calls.

Search ranks pages. Crawlers spend money. Every page you fetch is a budget decision, and the LLM scorer is the only signal that understands meaning, not just structure.

Layer 4: Budget Allocation

The final layer enforces disciplined spending. The page budget is split into three slices:

Budget allocation showing 30/60/10 split across hub discovery, detail expansion, and exploration
Budget allocation showing 30/60/10 split across hub discovery, detail expansion, and exploration

This split reflects how real websites are structured. Content is rarely at depth 0 — it is behind listing pages, category indexes, and navigation hubs. The system spends its first 30% finding those hubs, then turns 60% of its budget loose on the links those hubs contain.

When a budget slice is exhausted, overflow borrows from the exploration reserve. When the exploration reserve is also exhausted, the crawl stops — even if the frontier still has candidates. Budget discipline prevents runaway crawls.

The Seven-Signal Blend

All four layers feed into a single scoring function. Seven signals are blended with configurable weights:

Seven-signal URL scoring weights in intent-driven mode
Seven-signal URL scoring weights in intent-driven mode

The intent-driven preset gives 30% weight to LLM semantic relevance — the single strongest signal. Parent quality (20%) and path potential (15%) together account for another 35%, capturing the idea that good content clusters: if a parent page was relevant, its children are more likely to be too. OPIC gets only 5% weight. This is deliberate — structural importance and intent relevance are often orthogonal.

Why Four Layers? Ablation and Cost

The four-layer architecture is not arbitrary. Each layer addresses a specific failure mode, and removing any one causes catastrophic degradation — but on different site types. We removed each layer individually and measured the impact across three sites representing different structural challenges.

But the ablation also reveals something about cost. Each layer carries a price tag, and the relationship between cost and criticality is not what you might expect.

Ablation study showing targets found and layer cost when each layer is removed across three site types
Ablation study showing targets found and layer cost when each layer is removed across three site types

The link counts in the ablation differ from the experiment totals because the ablation measures links after the seed page is fetched (i.e., the frontier the crawler must process), while the experiment descriptions count all links on the seed page including navigation and anchors.

No single layer is universally critical — but every layer is critical somewhere.

Removing the junk pre-filter (cost: free, pure regex) has no effect on Greylock (only 43 links). But on arXiv targets drop 65%, and on ACL they drop 83%. The LLM intent scorer processes URLs in DOM order with a fixed capacity (~300 URLs per crawl). Without the junk filter, that capacity is consumed by format variants and author search links. A free layer protects the expensive one from wasting its budget.

Removing graph scoring (cost: free, O(links)) is irrelevant on flat sites (arXiv, ACL have no hubs). But on Greylock it is catastrophic: zero targets. The /team hub has no keyword overlap with the intent — the word "team" doesn't appear in "find individual partner biography pages." Without HITS hub detection, /team scores lower than blog posts with partner names in their titles. The system never discovers the 46 bio links behind the hub.

Removing LLM scoring (cost: ~15 API calls per 30-page crawl) destroys performance on content-dense flat sites: zero targets on both arXiv and ACL. Keyword matching alone cannot distinguish "Efficient Distillation for Small Language Models" (target) from "Bias in Large Language Models" (non-target) — both share "language" and "models." The LLM understands the intent asks for efficiency techniques, not general LLM research. On hub-dependent sites like Greylock, LLM scoring is unnecessary — graph + budget layers do all the work for zero LLM cost.

Removing budget allocation (cost: free, arithmetic) collapses Greylock to 1 target (down 95%). The hub is found, but without phase splits, the 46 bio links compete in a flat queue with 24 external portfolio links and 3 blog posts. The system fetches portfolio companies alongside bios, wasting 28 of 29 non-seed pages.

The cost story is counterintuitive: three of four layers are computationally free (pattern matching, graph traversal, arithmetic), and removing any of them is catastrophic on the right site type. The only layer with real cost — LLM scoring — is also the only one that understands meaning rather than structure. But it is only critical on flat, link-dense sites. On hub-dependent sites, the free layers handle everything.

The junk filter and LLM scoring handle the horizontal problem (thousands of links on one page, most irrelevant). Graph scoring and budget allocation handle the vertical problem (content buried behind hubs, requiring phased exploration through depth levels). Real websites combine both problems, which is why all four layers are needed — but not all four layers cost money.

Where This Fails

Intent-driven crawling is not magic. It has clear limitations.

JavaScript-rendered content. The system scores URLs based on what appears in the HTML link graph. Single-page applications that load content via JavaScript API calls have no links to score. The crawler sees a page with zero outlinks and stops. This affects sites built entirely on client-side rendering (some React SPAs, Angular apps). Sites with server-rendered HTML or progressive enhancement work fine.

Intent ambiguity. A vague intent like "find interesting content" gives the LLM scorer nothing to work with — everything scores 0.4-0.6 and the system degrades to graph-only scoring, which is a marginal improvement over BFS. The system works best when the intent names specific content types, topics, or structural patterns. "Find partner bios" is good. "Find stuff about this company" is not.

Adversarial site structures. Sites that deliberately hide content behind authentication, paywalls, or CAPTCHAs cannot be crawled regardless of scoring intelligence. The system also struggles with sites that use identical URL patterns for different content types (e.g., a CMS where /page/123 could be a blog post, a product page, or a legal notice). The LLM scorer relies on URL paths and anchor text containing some semantic signal.

Small budgets on deep sites. If the target content requires traversing 4+ depth levels and the budget is under 20 pages, even intent-driven crawling may not reach it. The 30/60/10 budget split means only 6 pages (30% of 20) go to hub discovery. If the path to content requires discovering multiple intermediate hubs, that may not be enough. The fix is straightforward: increase the budget.

Sites where BFS already works. On small sites with flat structures and high target density (most pages are relevant), BFS performs comparably. The overhead of LLM scoring provides no benefit when the baseline approach already finds targets. We see this on sites with fewer than 50 links and >30% target density.

The natural objection: why crawl at all? Just search for "RLHF papers on arXiv" or "AI startups on Hacker News."

Search has three structural problems for this use case:

  1. Temporal drift. Searching "agentic search papers arxiv today" works sometimes. Tomorrow the temporal qualifier slips, results from last week leak in, and you are back to manual filtering. The page arxiv.org/list/cs.AI/recent is always today's papers. The URL is stable. The content is not.

  2. Source ambiguity. You already know where the content lives. You don't need a search engine to discover arXiv — you need a crawler to filter it. Search adds an unnecessary indirection that introduces noise from other sources.

  3. Snippets, not content. For structured extraction (paper titles, abstracts, author lists, or startup names, product descriptions, pricing), you need the full page. Search gives you titles and 160-character snippets.

The deeper issue is that search is designed for discovery — finding sources you don't know about. Intent-driven crawling is designed for extraction — pulling specific content from sources you already know. These are different problems.

From Crawling to Monitoring

The experiments in this post are one-shot crawls. But the architecture was built for something else: persistent monitoring.

Consider the arXiv experiment. Every day, arxiv.org/list/cs.CL/recent publishes new papers. The page structure is identical — same link patterns, same ~2,500 links with format variants and author search links, same 60% junk ratio. But the papers are different. If your intent is "find papers about agentic search and context management", that intent doesn't change. The desired results do.

This is the core pattern behind Zipf monitors: the website is stable, the intent is stable, but the content is temporally variant. The same crawl, with the same intent and the same seed URL, run on a schedule, becomes a topic-aware feed — without ever writing a search query.

The same pattern applies to Hacker News. The Show HN page structure doesn't change — 30 stories, ~7 links per story, action links interleaved with content. But the stories rotate daily. An intent like "find posts about AI developer tools and infrastructure" applied to the same URL every 6 hours gives you a filtered feed of relevant launches. With expansion: "external", the crawler follows the product links off HN to the actual startup pages, turning HN into a discovery hub and the external sites into extraction targets. Same intent, same seed — richer data.

Search can't do this cleanly. A daily search for "AI developer tools Hacker News" returns a mix of yesterday's results, blog posts about HN, and HN comments that mention the keywords. A daily crawl of news.ycombinator.com/show with an intent filter returns exactly the new posts that match — and nothing else.

Production Evidence: A/B Experiments

We are currently running controlled A/B experiments on the crawl algorithm in production, routing each crawl to one of three algorithm profiles:

  • Control (C): Baseline configuration (30/60/10 budget split, OPIC damping 0.85, intent-driven scorer weights)
  • T1 (Aggressive Depth): 20/70/10 split, higher path-potential weight, 34% fewer LLM calls
  • T3 (Precision Efficient): Balanced profile testing whether T1's cost savings are achievable without quality loss

From the first round (7-day window, staging and production):

MetricControlT1 (Aggressive)T3 (Precision)
Useful pages per crawl1.101.04 (94.5%)1.05 (95.2%)
Harvest ratio0.3250.310 (95.3%)0.306 (94.1%)
LLM calls per crawl1.40.9 (66.2%)1.4 (99.1%)
Failure rate6.1%6.2%6.2%

The control profile remains strongest on harvest quality. T1 trades 5.5% quality for 34% fewer LLM calls — a worthwhile trade for high-volume monitoring where cost matters more than per-crawl precision. These experiments run continuously, with new profiles tested every 1-2 weeks.

What We Learned

Three lessons from shipping this.

Graph algorithms alone are not enough. OPIC and HITS are necessary but insufficient. PageRank tells you what is structurally important; HITS tells you what is a hub. Neither tells you what is relevant to this user's intent. The LLM scorer carries the most weight (30%) because it is the only signal that understands meaning, not just structure.

Pre-filtering matters more than scoring. Removing 60% of URLs before any scoring happens is the single largest efficiency gain. The junk pre-filter is pure pattern matching — zero LLM calls, sub-millisecond per URL — but it prevents the scorer from wasting capacity on structurally impossible URLs. The anchor-text rescue in Tier 4 prevents over-filtering from killing recall.

Budget discipline prevents runaway crawls. Without the 30/60/10 split, the crawler over-explores hubs. A team listing links to 40 bios, but also the blog, portfolio, and careers. Without budget phases, it follows all of those and never reaches detail pages. The allocator forces the transition from hub discovery to detail expansion at 30% of budget spent.

Looking Ahead

We are actively experimenting with:

  • Learned weight presets that adapt to site structure during the crawl. If the first few pages reveal a deep site, shift weight from hubness to path potential. If the site is flat, increase OPIC weight.
  • Cross-crawl learning that reuses successful path patterns from previous crawls of the same domain. If /team/{name} was consistently high-quality on prior crawls, pre-score those URL patterns higher next time.
  • Intent decomposition that breaks complex extraction goals into sub-intents, each with its own budget slice and scoring weights.

The goal remains the same: a single sentence of intent should be enough to find what you want on any website. The algorithms underneath get more sophisticated, but the interface stays simple.


All experiments described in this post use the crawl system that powers Zipf AI workflow monitors. The system is available via the Crawl API with intent and expansion parameters.

Share this article

Cite this post

Plain text
Campos, D. (2026, May 14). Your Crawler Doesn't Know What You Asked For. Zipf AI Blog. https://www.zipf.ai/blog/intent-driven-crawling
BibTeX
@online{campos2026crawler,
  author    = {Daniel Campos},
  title     = {Your Crawler Doesn't Know What You Asked For},
  year      = {2026},
  month     = {may},
  url       = {https://www.zipf.ai/blog/intent-driven-crawling},
  note      = {Zipf AI Blog}
}
Daniel Campos

Daniel Campos

Building persistent web monitoring for AI agents. Previously at Microsoft Bing, Snowflake, Neeva, Neural Magic, and Walmart.

Skip to main content
Your Crawler Doesn't Know What You Asked For | Zipf AI