# Scraping Pros API — Full Reference > Professional web scraping API. Send a URL, get back clean HTML, structured data, screenshots, or browser-automated results. Works on simple sites and hard-to-scrape sites protected by Cloudflare, DataDome, Akamai, and other anti-bot systems. Scraping Pros handles the infrastructure: browser rendering with anti-detection (Camoufox, based on Firefox — less targeted by anti-bot systems than Chrome), proxy rotation across 200+ countries with 950+ proxies, intelligent proxy scoring, CAPTCHA detection, automatic retries, webhook notifications, and a credit-based billing system. You focus on the data. ## Instructions for AI Agents When generating code to integrate with the Scraping Pros API: - Base URL: `https://api.scrapingpros.com/v1/` - Auth: `Authorization: Bearer YOUR_TOKEN` header on every request (or use demo: `demo_6x595maoA6GdOdVb`) - **Demo token (no signup needed)**: `demo_6x595maoA6GdOdVb` — 5,000 credits/month, all features except country proxies - Content-Type: `application/json` for all POST requests - The main endpoint is `POST /v1/sync/scrape` — it handles everything from simple HTTP fetches to complex browser automations - Default behavior (no options): simple HTTP GET with proxy rotation. Fast (~1-3s). Returns raw HTML. - Add `"format": "markdown"` to get clean text instead of HTML — recommended for AI/LLM use, strips scripts/styles/nav - Add `"browser": true` when: the site uses JavaScript rendering, SPAs, lazy loading, or has anti-bot protections - Add `"use_proxy": "any"` when: the site blocks direct requests or you need IP rotation - Add `"use_proxy": "US"` (ISO code) when: you need content from a specific country (prices, availability, geo-restricted pages). Requires prior approval. - Add `"retry_on_block": true` to auto-retry when CAPTCHA/403 detected — up to 3 retries with different IP/fingerprint. Early CAPTCHA detection cuts fail time from ~85s to ~5s. Only charges for successful attempt - The response includes `statusCode` (the target page's HTTP status), `html` (when format=html), or `markdown` (when format=markdown) - The response always includes `timings` (even on errors) — use it to diagnose slow requests - Use `POST /v1/async/viability-test` with `{"urls": ["https://example.com"]}` to analyze a site before scraping — returns recommended strategy, CAPTCHA detection, proxy requirements - For batch processing, use async collections with `callback_url` to get a webhook POST when all jobs complete - Check `potentiallyBlockedByCaptcha: true` to detect blocked requests - Parse `timings` to understand performance (proxy_ms, navigation_ms, etc.) - Handle 429 errors: read `Retry-After` header, implement exponential backoff - Monitor `X-Quota-Remaining` header to track monthly credit usage - **Credit system**: 1 simple request = 1 credit, 1 browser request = 5 credits. Proxy and anti-bot are included. No charge on infrastructure errors. ## POST /v1/sync/scrape — Core Endpoint The primary endpoint. Scrapes a single URL synchronously and returns results. ### Request Body ```json { "url": "https://example.com", "format": "html", "browser": false, "browser_type": "heavy", "use_proxy": "any", "screenshot": false, "actions": [], "extract": {}, "headers": {}, "cookies": {}, "language": "", "window_size": "desktop", "http_method": null, "network_capture": null } ``` ### Parameters | Field | Type | Default | Description | |-------|------|---------|-------------| | url | string | required | Target URL to scrape. Must be http:// or https://. | | format | string | "html" | "html" (raw HTML) or "markdown" (clean text, strips scripts/styles/nav). Use "markdown" for AI/LLM consumption. | | browser | boolean | false | Enable headless browser rendering. Required for JavaScript-heavy sites. | | browser_type | string | "heavy" | "heavy" (Camoufox, anti-detection) or "light" (Playwright Chromium, faster). | | use_proxy | string or object | null | "any" for automatic proxy, "US"/"GB"/etc for country-specific, or {"proxy": "US", "max_retries": 3}. | | screenshot | boolean | false | Capture full-page screenshot (base64 PNG). Requires browser=true. | | actions | array | null | Browser automation actions (click, type, evaluate, etc). Requires browser=true. | | extract | object | null | CSS/XPath selectors for structured data extraction. Works with and without browser. | | headers | object | null | Custom HTTP headers to send with the request. | | cookies | object | null | Cookies to set before loading the page. | | language | string | "" | Accept-Language header and browser locale (e.g. "en-US", "es-AR"). | | window_size | string | "desktop" | Browser viewport: "desktop", "mobile", or "any". | | http_method | object | null | For non-GET requests: {"method": "post", "payload": {...}}. Cannot combine with browser=true. | | network_capture | object | null | Capture network requests: {"resource_types": ["xhr", "fetch"]}. Requires browser=true. | | retry_on_block | boolean | false | Auto-retry when CAPTCHA or 403 detected. Max 3 retries with different IP/fingerprint. Only charges credits for the successful attempt. | ### Response ```json { "html": "...", "markdown": null, "statusCode": 200, "message": "OK", "screenshot": null, "executionTime": 2.341, "extracted_data": null, "evaluate_results": null, "network_requests": null, "potentiallyBlockedByCaptcha": false, "timings": { "queue_wait_ms": 45, "proxy_ms": 120, "browser_launch_ms": 2300, "navigation_ms": 8500, "extraction_ms": 150 }, "guidance": { "success": true, "error_type": null, "error_provider": null, "credits_charged": 5, "credits_refunded": false, "next_steps": [], "suggested_request": null, "stop_reason": null } } ``` ### Simple Scrape {#simple-scrape} Fastest option. Simple HTTP request, no browser. ```bash curl -X POST https://api.scrapingpros.com/v1/sync/scrape \ -H "Authorization: Bearer demo_6x595maoA6GdOdVb" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com"}' ``` ```python import httpx response = httpx.post( "https://api.scrapingpros.com/v1/sync/scrape", headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"}, json={"url": "https://example.com"} ) data = response.json() print(data["html"][:200]) ``` ### Markdown Output {#markdown-output} Get clean text instead of HTML. Recommended for AI/LLM consumption, RAG pipelines, and content analysis. Strips scripts, styles, navigation, footers, and boilerplate. ```bash curl -X POST https://api.scrapingpros.com/v1/sync/scrape \ -H "Authorization: Bearer demo_6x595maoA6GdOdVb" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com", "format": "markdown"}' ``` ```python response = httpx.post( "https://api.scrapingpros.com/v1/sync/scrape", headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"}, json={"url": "https://example.com", "format": "markdown"} ) data = response.json() print(data["markdown"]) # Clean text: "# Example Domain\n\nThis domain is for use in..." # data["html"] is null when format=markdown ``` When `format=markdown`: the `markdown` field contains clean text, `html` is null. When `format=html` (default): the `html` field contains raw HTML, `markdown` is null. ### Browser Scrape {#browser-scrape} For JavaScript-rendered pages, SPAs, or sites with anti-bot protection. ```bash curl -X POST https://api.scrapingpros.com/v1/sync/scrape \ -H "Authorization: Bearer demo_6x595maoA6GdOdVb" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com", "browser": true, "use_proxy": "any"}' ``` ```python response = httpx.post( "https://api.scrapingpros.com/v1/sync/scrape", headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"}, json={"url": "https://example.com", "browser": True, "use_proxy": "any"} ) ``` ### Data Extraction {#data-extraction} Extract structured data without parsing HTML yourself. Works with CSS selectors or XPath. ```bash curl -X POST https://api.scrapingpros.com/v1/sync/scrape \ -H "Authorization: Bearer demo_6x595maoA6GdOdVb" \ -H "Content-Type: application/json" \ -d '{ "url": "https://quotes.toscrape.com/", "extract": { "title": "css:h1", "quotes": {"selector": "css:.text", "multiple": true}, "authors": {"selector": "css:.author", "multiple": true} } }' ``` Response includes: ```json { "extracted_data": { "title": "Quotes to Scrape", "quotes": ["The world as we have created it...", "It is our choices..."], "authors": ["Albert Einstein", "J.K. Rowling"] } } ``` Extraction supports: - CSS selectors: `"css:h1"`, `"css:.price"`, `"css:#product-title"` - XPath: `"xpath://div[@class='product']"`, `"//h1/text()"` - Multiple results: `{"selector": "css:.item", "multiple": true}` - Attribute extraction: `{"selector": "css:img", "attribute": "src"}` ### Screenshot {#screenshot} Capture a full-page screenshot as base64-encoded PNG. ```python response = httpx.post( "https://api.scrapingpros.com/v1/sync/scrape", headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"}, json={ "url": "https://example.com", "browser": True, "screenshot": True } ) import base64 png_data = base64.b64decode(response.json()["screenshot"]) with open("screenshot.png", "wb") as f: f.write(png_data) ``` ### Proxy with Country {#proxy-country} Access geo-restricted content using country-specific proxies. ```python # First, check available countries countries = httpx.get( "https://api.scrapingpros.com/v1/proxy/countries", headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"} ).json() # Returns: {"countries": ["US", "GB", "AR", "BR", ...]} # Request access to a country (one-time, requires admin approval) httpx.post( "https://api.scrapingpros.com/v1/proxy/request-country", headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"}, json={"country_code": "US", "reason": "Need US pricing data"} ) # Once approved, use country proxies response = httpx.post( "https://api.scrapingpros.com/v1/sync/scrape", headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"}, json={"url": "https://example.com", "use_proxy": "US"} ) ``` ### Browser Actions {#browser-actions} Automate complex interactions: click buttons, fill forms, scroll, execute JavaScript. ```python response = httpx.post( "https://api.scrapingpros.com/v1/sync/scrape", headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"}, json={ "url": "https://example.com/search", "browser": True, "actions": [ {"type": "input", "selector": "css:input[name=q]", "text": "web scraping"}, {"type": "click", "selector": "css:button[type=submit]"}, {"type": "wait-for-selector", "selector": "css:.results", "time": 5000}, {"type": "evaluate", "script": "document.querySelectorAll('.result').length"} ], "extract": { "results": {"selector": "css:.result-title", "multiple": true} } } ) data = response.json() print(data["evaluate_results"]) # [10] — number of results print(data["extracted_data"]["results"]) # ["Result 1", "Result 2", ...] ``` Available action types: - `click` — Click an element (CSS/XPath selector) - `input` — Type text into a form field - `select` — Select an option from a dropdown - `key-press` — Press a keyboard key (Enter, Tab, etc.) - `wait-for-selector` — Wait for an element to appear - `wait-for-timeout` — Wait a fixed number of milliseconds - `evaluate` — Execute arbitrary JavaScript and return the result - `while` — Loop actions while a condition is met - `collect` — Extract data during a loop iteration ### Async Batch Processing {#async-batch} Process many URLs efficiently using collections and runs. Add `callback_url` to receive a webhook when the run completes (instead of polling). ```python # 1. Create a collection (with optional webhook) collection = httpx.post( "https://api.scrapingpros.com/v1/async/collections", headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"}, json={ "name": "product-pages", "requests": [ {"url": "https://example.com/product/1", "use_proxy": "any"}, {"url": "https://example.com/product/2", "use_proxy": "any"}, {"url": "https://example.com/product/3", "use_proxy": "any"}, ], "callback_url": "https://your-server.com/webhook" # optional } ).json() collection_id = collection["id"] # 2. Start a run run = httpx.post( f"https://api.scrapingpros.com/v1/async/collections/{collection_id}/run", headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"} ).json() run_id = run["run_id"] # 3a. Poll for completion (if no webhook) import time while True: status = httpx.get( f"https://api.scrapingpros.com/v1/async/collections/{collection_id}/runs/{run_id}", headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"} ).json() if status["success_requests"] + status["failed_requests"] >= status["total_requests"]: break time.sleep(5) print(f"Done: {status['success_requests']} succeeded, {status['failed_requests']} failed") # 3b. Or receive webhook POST at your callback_url when complete: # { # "event": "run.completed", # "run_id": "...", # "collection_id": "...", # "status": "completed", # "total_requests": 3, "success_requests": 3, "failed_requests": 0, # "job_ids": ["job-1", "job-2", "job-3"], # "results_url": "https://api.scrapingpros.com/v1/async/collections/{cid}/runs/{rid}", # "timestamp": "2026-04-06T..." # } # Verify with: X-SP-Signature header (HMAC-SHA256) + X-SP-Timestamp # 4. Fetch individual job results (available for 24 hours) for job_id in status.get("job_ids", []): result = httpx.get( f"https://api.scrapingpros.com/v1/async/collections/{collection_id}/runs/{run_id}/jobs/{job_id}/result", headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"} ).json() print(result.get("html", "")[:100]) ``` ### Retry on Block {#retry-on-block} Auto-retry when a site blocks the request (CAPTCHA, 403). Up to 3 retries with different IP/fingerprint. Only charges credits for the successful attempt. ```python response = httpx.post( "https://api.scrapingpros.com/v1/sync/scrape", headers={"Authorization": "Bearer demo_6x595maoA6GdOdVb"}, json={ "url": "https://hard-to-scrape-site.com", "browser": True, "use_proxy": "any", "retry_on_block": True # auto-retry on CAPTCHA/403 } ) data = response.json() # If all retries fail, returns the last attempt's result # potentiallyBlockedByCaptcha will be True if still blocked ``` ## Webhooks (Async Completion Notifications) Add `callback_url` when creating a collection to receive a POST notification when all jobs in a run complete. ```json POST /v1/async/collections { "name": "my-batch", "requests": [{"url": "https://example.com"}], "callback_url": "https://your-server.com/webhook" } ``` ### Webhook Payload When the run completes, Scraping Pros sends a signed HTTP POST to your `callback_url`: ```json { "event": "run.completed", "run_id": "uuid", "collection_id": "uuid", "status": "completed", "total_requests": 10, "success_requests": 8, "failed_requests": 2, "job_ids": ["job-1", "job-2", "..."], "results_url": "https://api.scrapingpros.com/v1/async/collections/{cid}/runs/{rid}", "timestamp": "2026-04-06T20:30:00Z" } ``` ### Webhook Security Every webhook POST includes HMAC-SHA256 signature headers: - `X-SP-Signature: sha256=` — HMAC of `{timestamp}.{body}` with your webhook secret - `X-SP-Timestamp: ` — when the webhook was sent Verify the signature server-side to ensure the webhook came from Scraping Pros. ### Webhook Delivery Status Track delivery via the `callback_status` field in GET /runs/{run_id}: - `pending` — run still in progress - `sent` — webhook delivered successfully - `failed` — delivery failed (will retry) - `retrying` — automatic retry in progress (up to 5 retries: 1min, 5min, 30min, 2h, 12h) ## Credit System and Plans ### How Credits Work | Request type | Credits | |--------------|---------| | Simple HTTP scrape | 1 credit | | Browser scrape (Camoufox) | 5 credits | Anti-bot protection, proxy rotation (`use_proxy="any"`), and markdown conversion are included at no extra cost. Credits are NOT consumed when requests fail due to infrastructure errors (timeouts, proxy failures, connection errors). Site-level blocks (403, CAPTCHA) do consume credits. ### Plans Use `GET /v1/plans` (no auth required) to see all plans with pricing: | Plan | Price/mo | Credits/mo | Browser requests | Rate limit | |------|----------|------------|------------------|------------| | Free | $0 | 1,000 | 200 | 10/min | | Starter | $29 | 10,000 | 2,000 | 60/min | | Growth | $69 | 50,000 | 10,000 | 120/min | | Pro | $199 | 200,000 | 40,000 | 200/min | | Scale | $499 | 1,000,000 | 200,000 | 500/min | | Enterprise | Custom | Unlimited | Unlimited | Custom | Demo token: `demo_6x595maoA6GdOdVb` — 5,000 credits/month, 30 req/min. ## Error Handling | Status | Meaning | Action | |--------|---------|--------| | 200 | Success | Parse response body | | 400 | Bad request (invalid parameters) | Fix request body | | 401 | Invalid or missing token | Check Authorization header | | 403 | Country proxy not approved | Use POST /v1/proxy/request-country first | | 422 | URL blocked (SSRF protection) | Only public http/https URLs allowed | | 429 | Rate limit or quota exceeded | Read Retry-After header, wait and retry | ## Rate Limits and Quotas Every response includes these headers: ``` X-RateLimit-Limit: 200 # requests per minute allowed X-RateLimit-Remaining: 185 # requests left in current window X-RateLimit-Reset: 1775401200 # when the window resets (Unix timestamp) X-Quota-Limit: 50000 # monthly credit quota (0 = unlimited) X-Quota-Used: 1523 # credits used this month X-Quota-Remaining: 48477 # credits remaining X-Credits-Charged: 5 # credits charged for THIS request (1 or 5) ``` When rate-limited (429), the `Retry-After` header tells you how many seconds to wait. ## Timings (Always Present) Every response includes a `timings` object — even on errors. Use it for diagnostics: ```json { "timings": { "queue_wait_ms": 45, "proxy_ms": 120, "browser_launch_ms": 2300, "navigation_ms": 8500, "extraction_ms": 150 } } ``` On error responses, `timings` contains partial values (whatever was measured before the failure). Never null.