Back to API Overview

Crawl API

Agent-driven web crawling with AI-powered classification, structured data extraction, and recursive link following.

Credits: 1/page (basic) | 2/page (with classify_documents, extraction_schema, or classifiers) | FREE (summary)

POST /api/v1/crawls

Deploy an agent to crawl and extract from web sources.

Parameters

ParameterTypeDefaultDescription
urlsstring[]required1-100 seed URLs to crawl
max_pagesint10Maximum pages to crawl (1-1000)
processing_modestring"async""async", "sync", or "webhook"
classify_documentsboolfalseAI classification (triggers advanced pricing)
extraction_schemaobject-Custom fields to extract (triggers advanced)
generate_summaryboolfalseAI summary of crawled docs (FREE)
expansionstring"none"URL expansion: "internal", "external", "both", "none"
intentstring-Natural language crawl goal (10-2000 chars). Guides URL prioritization with expansion
auto_generate_intentboolfalseAuto-generate intent from extraction_schema field descriptions
session_idUUID-Link to existing session
dry_runboolfalseValidate without executing
webhook_urlURL-Required if processing_mode="webhook"

Advanced Parameters

ParameterTypeDefaultDescription
follow_linksboolfalseRecursive crawling (legacy, prefer expansion)
link_extraction_configobject-max_depth (1-10), url_patterns[], exclude_patterns[], detect_pagination
classifiers[]array-AI filters: `{type: "url"
budget_configobject-max_pages, max_depth, max_credits
use_cacheboolfalseGlobal cache (50% savings on hits)
cache_max_ageint86400Cache TTL in seconds
premium_proxyboolfalseResidential IPs
ultra_premiumboolfalseMobile residential IPs (highest success)
retry_strategystring"auto""none", "auto" (escalate tiers), "premium_only"

Processing Modes

ModeBehaviorBest For
asyncReturns immediately, poll for resultsLarge crawls
syncWaits for completion (max_pages ≤ 10)Small crawls, real-time
webhookReturns immediately, POSTs results to URLEvent-driven

Smart Crawl

When expansion is enabled, the agent prioritizes URLs using intelligent algorithms:

  • OPIC: Streaming PageRank approximation that updates as you crawl
  • HITS: Hub and authority scoring for link aggregators and authoritative sources
  • Anchor text relevance: Link text matching your intent gets priority

Efficiency: 135x average discovery ratio vs traditional BFS crawling.

Intent-Driven Crawling

With intent + expansion, the agent uses your goal to guide its patrol:

  1. Depth auto-adjustment: Default depth 1→2 (or 3 with "individual"/"specific"/"detail" keywords)
  2. URL pattern boosting: Intent-matching patterns get +0.3 score boost
  3. Budget allocation: 30% hub discovery, 60% detail expansion, 10% exploration
  4. Navigation penalty recovery: -0.4 on /about/, /team/ partially recovered (+0.2) when matching intent

No extra cost (heuristic-based).

Smart Crawl Response Fields

When active, GET /api/v1/crawls/{id} includes smart_crawl:

FieldDescription
smart_crawl.enabledWhether Smart Crawl was active
smart_crawl.algorithmAlgorithm used (opic)
smart_crawl.statstotal_nodes, total_edges, avg_inlinks, graph_density
smart_crawl.top_pages[]url, importance, inlinks
smart_crawl.domain_authorityDomain → authority score map
smart_crawl.efficiencycoverage_ratio, importance_captured, avg_importance_per_page
smart_crawl.hubs[]url, score
smart_crawl.authorities[]url, score

Each result also includes: importance_score, inlink_count, outlink_count, hub_score, authority_score, anchor_texts[].

Examples

# Basic crawl (5 credits)
curl -X POST https://www.zipf.ai/api/v1/crawls \
  -H "Authorization: Bearer $ZIPF_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "max_pages": 5, "classify_documents": false}'

# Intent-driven crawl with extraction
curl -X POST https://www.zipf.ai/api/v1/crawls \
  -H "Authorization: Bearer $ZIPF_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com/team"],
    "max_pages": 30,
    "expansion": "internal",
    "intent": "Find individual team member bio pages with background and role",
    "extraction_schema": {
      "name": "Extract the person full name",
      "role": "Extract job title or role"
    }
  }'

Extraction Schema

Extract structured data using AI. Triggers advanced pricing.

Format: {"field_name": "Extraction instruction (min 10 chars)"}. Field names: [a-zA-Z_][a-zA-Z0-9_]*, max 20 fields.

Response: extracted_data (field→value), extraction_metadata (fields_extracted, confidence_scores).


GET /api/v1/crawls/suggest-schema

Get info about the suggest-schema endpoint.

POST /api/v1/crawls/suggest-schema

Analyze a URL and suggest extraction fields. 2 credits.

Response: detected_page_type, suggested_schema, field_metadata with confidence scores.

Page types: E-commerce, Blog, News, Documentation, Job Listing, Recipe, Event, Company About

GET /api/v1/crawls/preview-extraction

Get info about the preview-extraction endpoint.

POST /api/v1/crawls/preview-extraction

Test extraction schema on a single URL. 1 credit.


GET /api/v1/crawls

List crawl jobs. Query: limit (1-100), offset, status (pending/running/completed/failed/cancelled)

GET /api/v1/crawls/{id}

Get detailed crawl status, results, classifications, summary, and Smart Crawl analysis.

Response includes: access_level (owner/org_viewer/public), smart_crawl section (when expansion was used).

DELETE /api/v1/crawls/{id}

Cancel a running/scheduled crawl. Releases reserved credits.


Detailed link graph analysis. FREE.

ParameterDefaultDescription
include_nodestrueInclude node data
include_edgesfalseInclude edge data (can be large)
node_limit100Max nodes (1-1000)
sort_bypagerankpagerank, inlinks, authority, hub
min_importance0Filter by minimum PageRank

Response: link_graph.stats (total_nodes, total_edges, avg_inlinks, avg_outlinks, graph_density, top_domains), domain_authority, nodes[] (url, pagerank, inlinks, outlinks, hub_score, authority_score), edges[] (from, to, anchor), pagination.

Session-level link graphs: see Sessions API.


GET /api/v1/crawls/{id}/execute

Get execution options and history.

POST /api/v1/crawls/{id}/execute

Execute a scheduled crawl immediately. Params: force, webhook_url, browser_automation, scraperapi

POST /api/v1/crawls/{id}/share

Toggle sharing. Body: {"public_share": true} or {"shared_with_org": true, "organization_id": "..."}

Crawls in sessions inherit org sharing from parent session.

GET /api/v1/public/crawl/{id}

Access publicly shared crawl (no auth).

GET /api/v1/search/jobs/{id}/crawls

Get all crawls created from a search job's URLs.


Workflow Recrawl API

Importance-based recrawl prioritization for patrol workflows. Prerequisites: Workflow with session, Smart Crawl enabled.

MethodEndpointDescription
GET/api/v1/workflows/{id}/recrawlGet recrawl config and status
POST/api/v1/workflows/{id}/recrawlEnable/update recrawl
DELETE/api/v1/workflows/{id}/recrawlDisable recrawl
PATCH/api/v1/workflows/{id}/recrawlTrigger immediate recrawl

POST Parameters

ParameterDefaultDescription
strategy"importance""importance", "time", "hybrid"
min_interval_hours24Min hours between recrawls
max_interval_hours720Max hours before recrawl
importance_multiplier2.0How much importance affects frequency
change_detectiontrueTrack content changes via SHA-256
priority_threshold0.0Only recrawl URLs above this score
execute_nowfalseExecute immediately after enabling

Interval formula: max_interval - (importance × multiplier × (max_interval - min_interval))

Skip to main content
Crawl API - Zipf AI Documentation