Crawl API

Agent-driven web crawling with AI-powered classification, structured data extraction, and recursive link following.

Credits: 1/page (basic) | 2/page (with classify_documents, extraction_schema, or classifiers) | FREE (summary)

POST /api/v1/crawls

Deploy an agent to crawl and extract from web sources.

Parameters

Parameter	Type	Default	Description
`urls`	string[]	required	1-100 seed URLs to crawl
`max_pages`	int	10	Maximum pages to crawl (1-1000)
`processing_mode`	string	`"async"`	`"async"`, `"sync"`, or `"webhook"`
`classify_documents`	bool	false	AI classification (triggers advanced pricing)
`extraction_schema`	object	-	Custom fields to extract (triggers advanced)
`generate_summary`	bool	false	AI summary of crawled docs (FREE)
`expansion`	string	`"none"`	URL expansion: `"internal"`, `"external"`, `"both"`, `"none"`
`intent`	string	-	Natural language crawl goal (10-2000 chars). Guides URL prioritization with expansion
`auto_generate_intent`	bool	false	Auto-generate intent from `extraction_schema` field descriptions
`session_id`	UUID	-	Link to existing session
`dry_run`	bool	false	Validate without executing
`webhook_url`	URL	-	Required if `processing_mode="webhook"`

Advanced Parameters

Parameter	Type	Default	Description
`follow_links`	bool	false	Recursive crawling (legacy, prefer `expansion`)
`link_extraction_config`	object	-	`max_depth` (1-10), `url_patterns[]`, `exclude_patterns[]`, `detect_pagination`
`classifiers[]`	array	-	AI filters: `{type: "url"
`budget_config`	object	-	`max_pages`, `max_depth`, `max_credits`
`use_cache`	bool	false	Global cache (50% savings on hits)
`cache_max_age`	int	86400	Cache TTL in seconds
`premium_proxy`	bool	false	Residential IPs
`ultra_premium`	bool	false	Mobile residential IPs (highest success)
`retry_strategy`	string	`"auto"`	`"none"`, `"auto"` (escalate tiers), `"premium_only"`

Processing Modes

Mode	Behavior	Best For
`async`	Returns immediately, poll for results	Large crawls
`sync`	Waits for completion (`max_pages ≤ 10`)	Small crawls, real-time
`webhook`	Returns immediately, POSTs results to URL	Event-driven

Smart Crawl

When expansion is enabled, the agent prioritizes URLs using intelligent algorithms:

OPIC: Streaming PageRank approximation that updates as you crawl
HITS: Hub and authority scoring for link aggregators and authoritative sources
Anchor text relevance: Link text matching your intent gets priority

Efficiency: 135x average discovery ratio vs traditional BFS crawling.

Intent-Driven Crawling

With intent + expansion, the agent uses your goal to guide its patrol:

Depth auto-adjustment: Default depth 1→2 (or 3 with "individual"/"specific"/"detail" keywords)
URL pattern boosting: Intent-matching patterns get +0.3 score boost
Budget allocation: 30% hub discovery, 60% detail expansion, 10% exploration
Navigation penalty recovery: -0.4 on /about/, /team/ partially recovered (+0.2) when matching intent

No extra cost (heuristic-based).

Smart Crawl Response Fields

When active, GET /api/v1/crawls/{id} includes smart_crawl:

Field	Description
`smart_crawl.enabled`	Whether Smart Crawl was active
`smart_crawl.algorithm`	Algorithm used (`opic`)
`smart_crawl.stats`	`total_nodes`, `total_edges`, `avg_inlinks`, `graph_density`
`smart_crawl.top_pages[]`	`url`, `importance`, `inlinks`
`smart_crawl.domain_authority`	Domain → authority score map
`smart_crawl.efficiency`	`coverage_ratio`, `importance_captured`, `avg_importance_per_page`
`smart_crawl.hubs[]`	`url`, `score`
`smart_crawl.authorities[]`	`url`, `score`

Each result also includes: importance_score, inlink_count, outlink_count, hub_score, authority_score, anchor_texts[].

Examples

# Basic crawl (5 credits)
curl -X POST https://www.zipf.ai/api/v1/crawls \
  -H "Authorization: Bearer $ZIPF_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "max_pages": 5, "classify_documents": false}'

# Intent-driven crawl with extraction
curl -X POST https://www.zipf.ai/api/v1/crawls \
  -H "Authorization: Bearer $ZIPF_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com/team"],
    "max_pages": 30,
    "expansion": "internal",
    "intent": "Find individual team member bio pages with background and role",
    "extraction_schema": {
      "name": "Extract the person full name",
      "role": "Extract job title or role"
    }
  }'

Extraction Schema

Extract structured data using AI. Triggers advanced pricing.

Format: {"field_name": "Extraction instruction (min 10 chars)"}. Field names: [a-zA-Z_][a-zA-Z0-9_]*, max 20 fields.

Response: extracted_data (field→value), extraction_metadata (fields_extracted, confidence_scores).

GET /api/v1/crawls/suggest-schema

Get info about the suggest-schema endpoint.

POST /api/v1/crawls/suggest-schema

Analyze a URL and suggest extraction fields. 2 credits.

Response: detected_page_type, suggested_schema, field_metadata with confidence scores.

Page types: E-commerce, Blog, News, Documentation, Job Listing, Recipe, Event, Company About

GET /api/v1/crawls/preview-extraction

Get info about the preview-extraction endpoint.

POST /api/v1/crawls/preview-extraction

Test extraction schema on a single URL. 1 credit.

GET /api/v1/crawls

List crawl jobs. Query: limit (1-100), offset, status (pending/running/completed/failed/cancelled)

GET /api/v1/crawls/{id}

Get detailed crawl status, results, classifications, summary, and Smart Crawl analysis.

Response includes: access_level (owner/org_viewer/public), smart_crawl section (when expansion was used).

DELETE /api/v1/crawls/{id}

Cancel a running/scheduled crawl. Releases reserved credits.

Link Graph API

Detailed link graph analysis. FREE.

GET /api/v1/crawls/{id}/link-graph

Parameter	Default	Description
`include_nodes`	true	Include node data
`include_edges`	false	Include edge data (can be large)
`node_limit`	100	Max nodes (1-1000)
`sort_by`	`pagerank`	`pagerank`, `inlinks`, `authority`, `hub`
`min_importance`	0	Filter by minimum PageRank

Response: link_graph.stats (total_nodes, total_edges, avg_inlinks, avg_outlinks, graph_density, top_domains), domain_authority, nodes[] (url, pagerank, inlinks, outlinks, hub_score, authority_score), edges[] (from, to, anchor), pagination.

Session-level link graphs: see Sessions API.

GET /api/v1/crawls/{id}/execute

Get execution options and history.

POST /api/v1/crawls/{id}/execute

Execute a scheduled crawl immediately. Params: force, webhook_url, browser_automation, scraperapi

POST /api/v1/crawls/{id}/share

Toggle sharing. Body: {"public_share": true} or {"shared_with_org": true, "organization_id": "..."}

Crawls in sessions inherit org sharing from parent session.

GET /api/v1/public/crawl/{id}

Access publicly shared crawl (no auth).

GET /api/v1/search/jobs/{id}/crawls

Get all crawls created from a search job's URLs.

Workflow Recrawl API

Importance-based recrawl prioritization for patrol workflows. Prerequisites: Workflow with session, Smart Crawl enabled.

Method	Endpoint	Description
GET	`/api/v1/workflows/{id}/recrawl`	Get recrawl config and status
POST	`/api/v1/workflows/{id}/recrawl`	Enable/update recrawl
DELETE	`/api/v1/workflows/{id}/recrawl`	Disable recrawl
PATCH	`/api/v1/workflows/{id}/recrawl`	Trigger immediate recrawl

POST Parameters

Parameter	Default	Description
`strategy`	`"importance"`	`"importance"`, `"time"`, `"hybrid"`
`min_interval_hours`	24	Min hours between recrawls
`max_interval_hours`	720	Max hours before recrawl
`importance_multiplier`	2.0	How much importance affects frequency
`change_detection`	true	Track content changes via SHA-256
`priority_threshold`	0.0	Only recrawl URLs above this score
`execute_now`	false	Execute immediately after enabling

Interval formula: max_interval - (importance × multiplier × (max_interval - min_interval))