Crawl API
Agent-driven web crawling with AI-powered classification, structured data extraction, and recursive link following.
Credits: 1/page (basic) | 2/page (with classify_documents, extraction_schema, or classifiers) | FREE (summary)
POST /api/v1/crawls
Deploy an agent to crawl and extract from web sources.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
urls | string[] | required | 1-100 seed URLs to crawl |
max_pages | int | 10 | Maximum pages to crawl (1-1000) |
processing_mode | string | "async" | "async", "sync", or "webhook" |
classify_documents | bool | false | AI classification (triggers advanced pricing) |
extraction_schema | object | - | Custom fields to extract (triggers advanced) |
generate_summary | bool | false | AI summary of crawled docs (FREE) |
expansion | string | "none" | URL expansion: "internal", "external", "both", "none" |
intent | string | - | Natural language crawl goal (10-2000 chars). Guides URL prioritization with expansion |
auto_generate_intent | bool | false | Auto-generate intent from extraction_schema field descriptions |
session_id | UUID | - | Link to existing session |
dry_run | bool | false | Validate without executing |
webhook_url | URL | - | Required if processing_mode="webhook" |
Advanced Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
follow_links | bool | false | Recursive crawling (legacy, prefer expansion) |
link_extraction_config | object | - | max_depth (1-10), url_patterns[], exclude_patterns[], detect_pagination |
classifiers[] | array | - | AI filters: `{type: "url" |
budget_config | object | - | max_pages, max_depth, max_credits |
use_cache | bool | false | Global cache (50% savings on hits) |
cache_max_age | int | 86400 | Cache TTL in seconds |
premium_proxy | bool | false | Residential IPs |
ultra_premium | bool | false | Mobile residential IPs (highest success) |
retry_strategy | string | "auto" | "none", "auto" (escalate tiers), "premium_only" |
Processing Modes
| Mode | Behavior | Best For |
|---|---|---|
async | Returns immediately, poll for results | Large crawls |
sync | Waits for completion (max_pages ≤ 10) | Small crawls, real-time |
webhook | Returns immediately, POSTs results to URL | Event-driven |
Smart Crawl
When expansion is enabled, the agent prioritizes URLs using intelligent algorithms:
- OPIC: Streaming PageRank approximation that updates as you crawl
- HITS: Hub and authority scoring for link aggregators and authoritative sources
- Anchor text relevance: Link text matching your intent gets priority
Efficiency: 135x average discovery ratio vs traditional BFS crawling.
Intent-Driven Crawling
With intent + expansion, the agent uses your goal to guide its patrol:
- Depth auto-adjustment: Default depth 1→2 (or 3 with "individual"/"specific"/"detail" keywords)
- URL pattern boosting: Intent-matching patterns get +0.3 score boost
- Budget allocation: 30% hub discovery, 60% detail expansion, 10% exploration
- Navigation penalty recovery:
-0.4on/about/,/team/partially recovered (+0.2) when matching intent
No extra cost (heuristic-based).
Smart Crawl Response Fields
When active, GET /api/v1/crawls/{id} includes smart_crawl:
| Field | Description |
|---|---|
smart_crawl.enabled | Whether Smart Crawl was active |
smart_crawl.algorithm | Algorithm used (opic) |
smart_crawl.stats | total_nodes, total_edges, avg_inlinks, graph_density |
smart_crawl.top_pages[] | url, importance, inlinks |
smart_crawl.domain_authority | Domain → authority score map |
smart_crawl.efficiency | coverage_ratio, importance_captured, avg_importance_per_page |
smart_crawl.hubs[] | url, score |
smart_crawl.authorities[] | url, score |
Each result also includes: importance_score, inlink_count, outlink_count, hub_score, authority_score, anchor_texts[].
Examples
# Basic crawl (5 credits)
curl -X POST https://www.zipf.ai/api/v1/crawls \
-H "Authorization: Bearer $ZIPF_API_KEY" \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com"], "max_pages": 5, "classify_documents": false}'
# Intent-driven crawl with extraction
curl -X POST https://www.zipf.ai/api/v1/crawls \
-H "Authorization: Bearer $ZIPF_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com/team"],
"max_pages": 30,
"expansion": "internal",
"intent": "Find individual team member bio pages with background and role",
"extraction_schema": {
"name": "Extract the person full name",
"role": "Extract job title or role"
}
}'
Extraction Schema
Extract structured data using AI. Triggers advanced pricing.
Format: {"field_name": "Extraction instruction (min 10 chars)"}. Field names: [a-zA-Z_][a-zA-Z0-9_]*, max 20 fields.
Response: extracted_data (field→value), extraction_metadata (fields_extracted, confidence_scores).
GET /api/v1/crawls/suggest-schema
Get info about the suggest-schema endpoint.
POST /api/v1/crawls/suggest-schema
Analyze a URL and suggest extraction fields. 2 credits.
Response: detected_page_type, suggested_schema, field_metadata with confidence scores.
Page types: E-commerce, Blog, News, Documentation, Job Listing, Recipe, Event, Company About
GET /api/v1/crawls/preview-extraction
Get info about the preview-extraction endpoint.
POST /api/v1/crawls/preview-extraction
Test extraction schema on a single URL. 1 credit.
GET /api/v1/crawls
List crawl jobs. Query: limit (1-100), offset, status (pending/running/completed/failed/cancelled)
GET /api/v1/crawls/{id}
Get detailed crawl status, results, classifications, summary, and Smart Crawl analysis.
Response includes: access_level (owner/org_viewer/public), smart_crawl section (when expansion was used).
DELETE /api/v1/crawls/{id}
Cancel a running/scheduled crawl. Releases reserved credits.
Link Graph API
Detailed link graph analysis. FREE.
GET /api/v1/crawls/{id}/link-graph
| Parameter | Default | Description |
|---|---|---|
include_nodes | true | Include node data |
include_edges | false | Include edge data (can be large) |
node_limit | 100 | Max nodes (1-1000) |
sort_by | pagerank | pagerank, inlinks, authority, hub |
min_importance | 0 | Filter by minimum PageRank |
Response: link_graph.stats (total_nodes, total_edges, avg_inlinks, avg_outlinks, graph_density, top_domains), domain_authority, nodes[] (url, pagerank, inlinks, outlinks, hub_score, authority_score), edges[] (from, to, anchor), pagination.
Session-level link graphs: see Sessions API.
GET /api/v1/crawls/{id}/execute
Get execution options and history.
POST /api/v1/crawls/{id}/execute
Execute a scheduled crawl immediately. Params: force, webhook_url, browser_automation, scraperapi
POST /api/v1/crawls/{id}/share
Toggle sharing. Body: {"public_share": true} or {"shared_with_org": true, "organization_id": "..."}
Crawls in sessions inherit org sharing from parent session.
GET /api/v1/public/crawl/{id}
Access publicly shared crawl (no auth).
GET /api/v1/search/jobs/{id}/crawls
Get all crawls created from a search job's URLs.
Workflow Recrawl API
Importance-based recrawl prioritization for patrol workflows. Prerequisites: Workflow with session, Smart Crawl enabled.
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/v1/workflows/{id}/recrawl | Get recrawl config and status |
| POST | /api/v1/workflows/{id}/recrawl | Enable/update recrawl |
| DELETE | /api/v1/workflows/{id}/recrawl | Disable recrawl |
| PATCH | /api/v1/workflows/{id}/recrawl | Trigger immediate recrawl |
POST Parameters
| Parameter | Default | Description |
|---|---|---|
strategy | "importance" | "importance", "time", "hybrid" |
min_interval_hours | 24 | Min hours between recrawls |
max_interval_hours | 720 | Max hours before recrawl |
importance_multiplier | 2.0 | How much importance affects frequency |
change_detection | true | Track content changes via SHA-256 |
priority_threshold | 0.0 | Only recrawl URLs above this score |
execute_now | false | Execute immediately after enabling |
Interval formula: max_interval - (importance × multiplier × (max_interval - min_interval))