# Scraping Pros API — Full Reference

> Professional web scraping API. Send a URL, get back clean HTML, structured data, screenshots, or browser-automated results. Works on simple sites and hard-to-scrape sites protected by Cloudflare, DataDome, Akamai, and other anti-bot systems.

Scraping Pros handles the infrastructure: browser rendering with anti-detection (Camoufox, based on Firefox — less targeted by anti-bot systems than Chrome), proxy rotation across 200+ countries with 950+ proxies, intelligent proxy scoring, CAPTCHA detection, automatic retries, webhook notifications, and a credit-based billing system. You focus on the data.

## Instructions for AI Agents

When generating code to integrate with the Scraping Pros API:

- Base URL: `https://api.scrapingpros.com/v1/`
- Auth: `Authorization: Bearer YOUR_TOKEN` header on every request (or use demo: `demo_6x595maoA6GdOdVb`)
- **Demo token (no signup needed)**: `demo_6x595maoA6GdOdVb` — 5,000 credits/month, all features except country proxies
- Content-Type: `application/json` for all POST requests
- The main endpoint is `POST /v1/sync/scrape` — it handles everything from simple HTTP fetches to complex browser automations
- Default behavior (no options): simple HTTP GET with proxy rotation. Fast (~1-3s). Returns raw HTML.
- Add `"format": "markdown"` to get clean text instead of HTML — recommended for AI/LLM use, strips scripts/styles/nav
- Add `"browser": true` when: the site uses JavaScript rendering, SPAs, lazy loading, or has anti-bot protections
- Add `"use_proxy": "any"` when: the site blocks direct requests or you need IP rotation
- Add `"use_proxy": "US"` (ISO code) when: you need content from a specific country (prices, availability, geo-restricted pages). Requires prior approval.
- Add `"retry_on_block": true` to auto-retry when CAPTCHA/403 detected — up to 3 retries with different IP/fingerprint. Early CAPTCHA detection cuts fail time from ~85s to ~5s. Only charges for successful attempt
- The response includes `statusCode` (the target page's HTTP status), `html` (when format=html), or `markdown` (when format=markdown)
- The response always includes `timings` (even on errors) — use it to diagnose slow requests
- Use `POST /v1/async/viability-test` with `{"urls": ["https://example.com"]}` to analyze a site before scraping — returns recommended strategy, CAPTCHA detection, proxy requirements
- For batch processing, use async collections with `callback_url` to get a webhook POST when all jobs complete
- Check `potentiallyBlockedByCaptcha: true` to detect blocked requests
- Parse `timings` to understand performance (proxy_ms, navigation_ms, etc.)
- Handle 429 errors: read `Retry-After` header, implement exponential backoff
- Monitor `X-Quota-Remaining` header to track monthly credit usage
- **Credit system**: 1 simple request = 1 credit, 1 browser request = 5 credits. Proxy and anti-bot are included. No charge on infrastructure errors.

## POST /v1/sync/scrape — Core Endpoint

The primary endpoint. Scrapes a single URL synchronously and returns results.

### Request Body

```json
{
  "url": "https://example.com",
  "format": "html",
  "browser": false,
  "browser_type": "heavy",
  "use_proxy": "any",
  "screenshot": false,
  "actions": [],
  "extract": {},
  "headers": {},
  "cookies": {},
  "language": "",
  "window_size": "desktop",
  "http_method": null,
  "network_capture": null
}
```

### Parameters

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| url | string | required | Target URL to scrape. Must be http:// or https://. |
| format | string | "html" | "html" (raw HTML) or "markdown" (clean text, strips scripts/styles/nav). Use "markdown" for AI/LLM consumption. |
| browser | boolean | false | Enable headless browser rendering. Required for JavaScript-heavy sites. |
| browser_type | string | "heavy" | "heavy" (Camoufox, anti-detection) or "light" (Playwright Chromium, faster). |
| use_proxy | string or object | null | "any" for automatic proxy, "US"/"GB"/etc for country-specific, or {"proxy": "US", "max_retries": 3}. |
| screenshot | boolean | false | Capture full-page screenshot (base64 PNG). Requires browser=true. |
| actions | array | null | Browser automation actions (click, type, evaluate, etc). Requires browser=true. |
| extract | object | null | CSS/XPath selectors for structured data extraction. Works with and without browser. |
| headers | object | null | Custom HTTP headers to send with the request. |
| cookies | object | null | Cookies to set before loading the page. |
| language | string | "" | Accept-Language header and browser locale (e.g. "en-US", "es-AR"). |
| window_size | string | "desktop" | Browser viewport: "desktop", "mobile", or "any". |
| http_method | object | null | For non-GET requests: {"method": "post", "payload": {...}}. Cannot combine with browser=true. |
| network_capture | object | null | Capture network requests: {"resource_types": ["xhr", "fetch"]}. Requires browser=true. |
| retry_on_block | boolean | false | Auto-retry when CAPTCHA or 403 detected. Max 3 retries with different IP/fingerprint. Only charges credits for the successful attempt. |

### Response

```json
{
  "html": "<!doctype html>...",
  "markdown": null,
  "statusCode": 200,
  "message": "OK",
  "screenshot": null,
  "executionTime": 2.341,
  "extracted_data": null,
  "evaluate_results": null,
  "network_requests": null,
  "potentiallyBlockedByCaptcha": false,
  "timings": {
    "queue_wait_ms": 45,
    "proxy_ms": 120,
    "browser_launch_ms": 2300,
    "navigation_ms": 8500,
    "extraction_ms": 150
  },
  "guidance": {
    "success": true,
    "error_type": null,
    "error_provider": null,
    "credits_charged": 5,
    "credits_refunded": false,
    "next_steps": [],
    "suggested_request": null,
    "stop_reason": null
  }
}
```

### Simple Scrape {#simple-scrape}

Fastest option. Simple HTTP request, no browser.

```bash
curl -X POST https://api.scrapingpros.com/v1/sync/scrape \
  -H "Authorization: Bearer demo_6x595maoA6GdOdVb" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'
```

```python
import httpx

response = httpx.post(
    "https://api.scrapingpros.com/v1/sync/scrape",
    headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"},
    json={"url": "https://example.com"}
)
data = response.json()
print(data["html"][:200])
```

### Markdown Output {#markdown-output}

Get clean text instead of HTML. Recommended for AI/LLM consumption, RAG pipelines, and content analysis. Strips scripts, styles, navigation, footers, and boilerplate.

```bash
curl -X POST https://api.scrapingpros.com/v1/sync/scrape \
  -H "Authorization: Bearer demo_6x595maoA6GdOdVb" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "format": "markdown"}'
```

```python
response = httpx.post(
    "https://api.scrapingpros.com/v1/sync/scrape",
    headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"},
    json={"url": "https://example.com", "format": "markdown"}
)
data = response.json()
print(data["markdown"])  # Clean text: "# Example Domain\n\nThis domain is for use in..."
# data["html"] is null when format=markdown
```

When `format=markdown`: the `markdown` field contains clean text, `html` is null.
When `format=html` (default): the `html` field contains raw HTML, `markdown` is null.

### Browser Scrape {#browser-scrape}

For JavaScript-rendered pages, SPAs, or sites with anti-bot protection.

```bash
curl -X POST https://api.scrapingpros.com/v1/sync/scrape \
  -H "Authorization: Bearer demo_6x595maoA6GdOdVb" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "browser": true, "use_proxy": "any"}'
```

```python
response = httpx.post(
    "https://api.scrapingpros.com/v1/sync/scrape",
    headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"},
    json={"url": "https://example.com", "browser": True, "use_proxy": "any"}
)
```

### Data Extraction {#data-extraction}

Extract structured data without parsing HTML yourself. Works with CSS selectors or XPath.

```bash
curl -X POST https://api.scrapingpros.com/v1/sync/scrape \
  -H "Authorization: Bearer demo_6x595maoA6GdOdVb" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://quotes.toscrape.com/",
    "extract": {
      "title": "css:h1",
      "quotes": {"selector": "css:.text", "multiple": true},
      "authors": {"selector": "css:.author", "multiple": true}
    }
  }'
```

Response includes:
```json
{
  "extracted_data": {
    "title": "Quotes to Scrape",
    "quotes": ["The world as we have created it...", "It is our choices..."],
    "authors": ["Albert Einstein", "J.K. Rowling"]
  }
}
```

Extraction supports:
- CSS selectors: `"css:h1"`, `"css:.price"`, `"css:#product-title"`
- XPath: `"xpath://div[@class='product']"`, `"//h1/text()"`
- Multiple results: `{"selector": "css:.item", "multiple": true}`
- Attribute extraction: `{"selector": "css:img", "attribute": "src"}`

### Screenshot {#screenshot}

Capture a full-page screenshot as base64-encoded PNG.

```python
response = httpx.post(
    "https://api.scrapingpros.com/v1/sync/scrape",
    headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"},
    json={
        "url": "https://example.com",
        "browser": True,
        "screenshot": True
    }
)
import base64
png_data = base64.b64decode(response.json()["screenshot"])
with open("screenshot.png", "wb") as f:
    f.write(png_data)
```

### Proxy with Country {#proxy-country}

Access geo-restricted content using country-specific proxies.

```python
# First, check available countries
countries = httpx.get(
    "https://api.scrapingpros.com/v1/proxy/countries",
    headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"}
).json()
# Returns: {"countries": ["US", "GB", "AR", "BR", ...]}

# Request access to a country (one-time, requires admin approval)
httpx.post(
    "https://api.scrapingpros.com/v1/proxy/request-country",
    headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"},
    json={"country_code": "US", "reason": "Need US pricing data"}
)

# Once approved, use country proxies
response = httpx.post(
    "https://api.scrapingpros.com/v1/sync/scrape",
    headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"},
    json={"url": "https://example.com", "use_proxy": "US"}
)
```

### Browser Actions {#browser-actions}

Automate complex interactions: click buttons, fill forms, scroll, execute JavaScript.

```python
response = httpx.post(
    "https://api.scrapingpros.com/v1/sync/scrape",
    headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"},
    json={
        "url": "https://example.com/search",
        "browser": True,
        "actions": [
            {"type": "input", "selector": "css:input[name=q]", "text": "web scraping"},
            {"type": "click", "selector": "css:button[type=submit]"},
            {"type": "wait-for-selector", "selector": "css:.results", "time": 5000},
            {"type": "evaluate", "script": "document.querySelectorAll('.result').length"}
        ],
        "extract": {
            "results": {"selector": "css:.result-title", "multiple": true}
        }
    }
)
data = response.json()
print(data["evaluate_results"])  # [10]  — number of results
print(data["extracted_data"]["results"])  # ["Result 1", "Result 2", ...]
```

Available action types:
- `click` — Click an element (CSS/XPath selector)
- `input` — Type text into a form field
- `select` — Select an option from a dropdown
- `key-press` — Press a keyboard key (Enter, Tab, etc.)
- `wait-for-selector` — Wait for an element to appear
- `wait-for-timeout` — Wait a fixed number of milliseconds
- `evaluate` — Execute arbitrary JavaScript and return the result
- `while` — Loop actions while a condition is met
- `collect` — Extract data during a loop iteration

### Async Batch Processing {#async-batch}

Process many URLs efficiently using collections and runs. Add `callback_url` to receive a webhook when the run completes (instead of polling).

```python
# 1. Create a collection (with optional webhook)
collection = httpx.post(
    "https://api.scrapingpros.com/v1/async/collections",
    headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"},
    json={
        "name": "product-pages",
        "requests": [
            {"url": "https://example.com/product/1", "use_proxy": "any"},
            {"url": "https://example.com/product/2", "use_proxy": "any"},
            {"url": "https://example.com/product/3", "use_proxy": "any"},
        ],
        "callback_url": "https://your-server.com/webhook"  # optional
    }
).json()
collection_id = collection["id"]

# 2. Start a run
run = httpx.post(
    f"https://api.scrapingpros.com/v1/async/collections/{collection_id}/run",
    headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"}
).json()
run_id = run["run_id"]

# 3a. Poll for completion (if no webhook)
import time
while True:
    status = httpx.get(
        f"https://api.scrapingpros.com/v1/async/collections/{collection_id}/runs/{run_id}",
        headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"}
    ).json()
    if status["success_requests"] + status["failed_requests"] >= status["total_requests"]:
        break
    time.sleep(5)

print(f"Done: {status['success_requests']} succeeded, {status['failed_requests']} failed")

# 3b. Or receive webhook POST at your callback_url when complete:
# {
#   "event": "run.completed",
#   "run_id": "...",
#   "collection_id": "...",
#   "status": "completed",
#   "total_requests": 3, "success_requests": 3, "failed_requests": 0,
#   "job_ids": ["job-1", "job-2", "job-3"],
#   "results_url": "https://api.scrapingpros.com/v1/async/collections/{cid}/runs/{rid}",
#   "timestamp": "2026-04-06T..."
# }
# Verify with: X-SP-Signature header (HMAC-SHA256) + X-SP-Timestamp

# 4. Fetch individual job results (available for 24 hours)
for job_id in status.get("job_ids", []):
    result = httpx.get(
        f"https://api.scrapingpros.com/v1/async/collections/{collection_id}/runs/{run_id}/jobs/{job_id}/result",
        headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"}
    ).json()
    print(result.get("html", "")[:100])
```

### Retry on Block {#retry-on-block}

Auto-retry when a site blocks the request (CAPTCHA, 403). Up to 3 retries with different IP/fingerprint. Only charges credits for the successful attempt.

```python
response = httpx.post(
    "https://api.scrapingpros.com/v1/sync/scrape",
    headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"},
    json={
        "url": "https://hard-to-scrape-site.com",
        "browser": True,
        "use_proxy": "any",
        "retry_on_block": True  # auto-retry on CAPTCHA/403
    }
)
data = response.json()
# If all retries fail, returns the last attempt's result
# potentiallyBlockedByCaptcha will be True if still blocked
```

## Webhooks (Async Completion Notifications)

Add `callback_url` when creating a collection to receive a POST notification when all jobs in a run complete.

```json
POST /v1/async/collections
{
  "name": "my-batch",
  "requests": [{"url": "https://example.com"}],
  "callback_url": "https://your-server.com/webhook"
}
```

### Webhook Payload

When the run completes, Scraping Pros sends a signed HTTP POST to your `callback_url`:

```json
{
  "event": "run.completed",
  "run_id": "uuid",
  "collection_id": "uuid",
  "status": "completed",
  "total_requests": 10,
  "success_requests": 8,
  "failed_requests": 2,
  "job_ids": ["job-1", "job-2", "..."],
  "results_url": "https://api.scrapingpros.com/v1/async/collections/{cid}/runs/{rid}",
  "timestamp": "2026-04-06T20:30:00Z"
}
```

### Webhook Security

Every webhook POST includes HMAC-SHA256 signature headers:
- `X-SP-Signature: sha256=<hex>` — HMAC of `{timestamp}.{body}` with your webhook secret
- `X-SP-Timestamp: <unix_epoch>` — when the webhook was sent

Verify the signature server-side to ensure the webhook came from Scraping Pros.

### Webhook Delivery Status

Track delivery via the `callback_status` field in GET /runs/{run_id}:
- `pending` — run still in progress
- `sent` — webhook delivered successfully
- `failed` — delivery failed (will retry)
- `retrying` — automatic retry in progress (up to 5 retries: 1min, 5min, 30min, 2h, 12h)

## Credit System and Plans

### How Credits Work

| Request type | Credits |
|--------------|---------|
| Simple HTTP scrape | 1 credit |
| Browser scrape (Camoufox) | 5 credits |

Anti-bot protection, proxy rotation (`use_proxy="any"`), and markdown conversion are included at no extra cost.

Credits are NOT consumed when requests fail due to infrastructure errors (timeouts, proxy failures, connection errors). Site-level blocks (403, CAPTCHA) do consume credits.

### Plans

Use `GET /v1/plans` (no auth required) to see all plans with pricing:

| Plan | Price/mo | Credits/mo | Browser requests | Rate limit |
|------|----------|------------|------------------|------------|
| Free | $0 | 1,000 | 200 | 10/min |
| Starter | $29 | 10,000 | 2,000 | 60/min |
| Growth | $69 | 50,000 | 10,000 | 120/min |
| Pro | $199 | 200,000 | 40,000 | 200/min |
| Scale | $499 | 1,000,000 | 200,000 | 500/min |
| Enterprise | Custom | Unlimited | Unlimited | Custom |

Demo token: `demo_6x595maoA6GdOdVb` — 5,000 credits/month, 30 req/min.

## Error Handling

| Status | Meaning | Action |
|--------|---------|--------|
| 200 | Success | Parse response body |
| 400 | Bad request (invalid parameters) | Fix request body |
| 401 | Invalid or missing token | Check Authorization header |
| 403 | Country proxy not approved | Use POST /v1/proxy/request-country first |
| 422 | URL blocked (SSRF protection) | Only public http/https URLs allowed |
| 429 | Rate limit or quota exceeded | Read Retry-After header, wait and retry |

## Rate Limits and Quotas

Every response includes these headers:

```
X-RateLimit-Limit: 200          # requests per minute allowed
X-RateLimit-Remaining: 185      # requests left in current window
X-RateLimit-Reset: 1775401200   # when the window resets (Unix timestamp)
X-Quota-Limit: 50000            # monthly credit quota (0 = unlimited)
X-Quota-Used: 1523              # credits used this month
X-Quota-Remaining: 48477        # credits remaining
X-Credits-Charged: 5            # credits charged for THIS request (1 or 5)
```

When rate-limited (429), the `Retry-After` header tells you how many seconds to wait.

## Timings (Always Present)

Every response includes a `timings` object — even on errors. Use it for diagnostics:

```json
{
  "timings": {
    "queue_wait_ms": 45,
    "proxy_ms": 120,
    "browser_launch_ms": 2300,
    "navigation_ms": 8500,
    "extraction_ms": 150
  }
}
```

On error responses, `timings` contains partial values (whatever was measured before the failure). Never null.