What You Can Build with PeterParser

Real scenarios. Actual API calls. Specific features that solve specific problems — not vague “AI-powered document processing” promises.

Automate Accounts Payable & Receivable

Invoice preset + webhooks = zero manual data entry

The Problem

Your AP team manually keys invoices into your ERP. Each invoice takes 3-5 minutes. At 500 invoices/month, that's 40+ hours of data entry — plus a 4% error rate that causes payment disputes.

How PeterParser Solves It

Send invoice PDFs to the /v2/documents endpoint with the `invoice` preset. PeterParser extracts vendor name, line items, totals, tax, PO numbers, and payment terms into clean JSON. Set a webhook_url and results POST to your system automatically when done.

API Call

curl -X POST https://api.peterparser.com/v2/documents \
  -H "X-API-Key: pp_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "base64": "<invoice_pdf_base64>",
    "document_type": "invoice",
    "extraction_preset": "invoice",
    "mode": "async",
    "webhook_url": "https://yourapp.com/api/invoices/ingest"
  }'

What You Get

  • 16-field invoice preset: vendor, customer, line items, tax, totals, PO numbers
  • 99.5% table accuracy on line item extraction
  • Char-level grounding — click any amount to see where it appears in the PDF
  • Async processing with webhook delivery and HMAC signature verification
Impact

3-5 min/invoice → 2 seconds. 99.5% accuracy vs 96% manual.

vs Alternatives

LlamaParse extracts text but doesn't provide structured JSON with custom schemas. Nanonets requires GPU infrastructure. PeterParser gives you a preset + webhook in one API call.

KYC & Identity Verification

ID extraction + PII redaction in a single API call

The Problem

Your onboarding flow requires users to upload government IDs. You need to extract name, DOB, expiry, and document number — but you also need to mask sensitive data before storing it in your logs or audit trail.

How PeterParser Solves It

Use the `identity_document` preset with `pii.detect: true` and `pii.mask: true`. PeterParser extracts all ID fields and returns both the raw extraction AND a PII-masked version. SSN, DOB, and address are automatically detected and redacted with your chosen mask character.

API Call

curl -X POST https://api.peterparser.com/v2/documents \
  -H "X-API-Key: pp_live_..." \
  -d '{
    "base64": "<drivers_license_base64>",
    "document_type": "driver_license",
    "extraction_preset": "identity_document",
    "pii": {
      "detect": true,
      "mask": true,
      "mask_char": "█",
      "types": ["ssn", "date_of_birth", "address"]
    },
    "grounding": { "enabled": true }
  }'

What You Get

  • Supports driver licenses, passports, ID cards, and ID verification documents
  • 9 PII types detected: SSN, credit card, phone, email, address, name, DOB, bank account, IP
  • Masked output with configurable mask character — store safely in logs
  • Source grounding proves where each field was found on the document
Impact

$0.10/document. PII detection adds $0.002/page.

vs Alternatives

Most parsing APIs extract OR redact, not both. PeterParser returns structured data with grounding AND masks PII in a single pass. No need for a separate PII service.

RAG Pipeline Document Ingestion

Parse → chunk → embed in one call

The Problem

You're building a RAG system and need to ingest thousands of PDFs into your vector store. Raw text extraction loses table structure. Chunking by character count breaks mid-sentence. And you need metadata for filtering.

How PeterParser Solves It

PeterParser preserves table structure and reading order. Enable `chunking.enabled: true` with semantic or sentence-based splitting. Each chunk comes with char offsets for precise retrieval. Use the fast lane (`pre_processing: false`) for text-heavy docs where layout doesn't matter.

API Call

curl -X POST https://api.peterparser.com/v2/documents \
  -H "X-API-Key: pp_live_..." \
  -d '{
    "url": "https://example.com/whitepaper.pdf",
    "output_format": "markdown",
    "chunking": {
      "enabled": true,
      "max_chunk_size": 1500,
      "overlap": 200,
      "strategy": "semantic"
    },
    "classify": { "enabled": true },
    "summarize": true
  }'

What You Get

  • Three chunking strategies: semantic, fixed, sentence-based
  • Configurable chunk size (100-10,000 chars) and overlap (0-500 chars)
  • Auto document classification for metadata filtering in your vector store
  • AI-generated summary for each document
  • Fast lane for text-heavy docs — 10x faster, lower cost
Impact

1,000 docs/hour with the full pipeline. 5,000/hour on fast lane.

vs Alternatives

Unstructured offers chunking but with lower table precision. LlamaParse doesn't chunk natively — you need LlamaIndex. PeterParser handles parse + chunk + classify + summarize in one API call.

Bank Statement Reconciliation at Scale

1000-page statements → structured transactions in minutes

The Problem

Your lending platform processes bank statements for underwriting. Statements range from 2 to 1,000+ pages. Manual extraction is impossible at scale, and most APIs choke on documents over 50 pages.

How PeterParser Solves It

PeterParser automatically routes large documents through chunked parallel extraction. A 1000-page statement gets split into ~34 chunks, each processed in parallel. Transactions are merged, deduplicated, and returned as a single JSON array. The `bank_statement` preset captures account info, balances, and every transaction.

API Call

curl -X POST https://api.peterparser.com/v2/documents \
  -H "X-API-Key: pp_live_..." \
  -d '{
    "url": "https://secure.bank.com/statement.pdf",
    "document_type": "bank_statement",
    "extraction_preset": "bank_statement",
    "mode": "async",
    "webhook_url": "https://yourapp.com/api/statements/ready"
  }'

# Monitor all your jobs in real-time:
curl -N -H "X-API-Key: pp_live_..." \
  "https://api.peterparser.ai/v2/events?ttl=600"

What You Get

  • Chunked parallel extraction — 1000+ pages handled automatically
  • Bank statement preset: account holder, account number, period, opening/closing balance, every transaction with date/description/amount/type/category
  • SSE real-time events — one connection monitors all your async jobs
  • Transactions deduplicated across chunks automatically
Impact

1000-page statement processed in ~3 minutes (async). $0.75/document flat rate.

vs Alternatives

Most APIs have a 50-page limit or timeout on large documents. Reducto handles large docs but charges per page. PeterParser auto-chunks, merges results, and charges a flat per-document rate for bank statements.

Contract Analysis with Source Grounding

Extract clauses and prove where every value came from

The Problem

Your legal team reviews 200 vendor contracts per quarter. They need to extract payment terms, liability limits, termination clauses, and governing law — and the extraction must be auditable, showing exactly where each value was found.

How PeterParser Solves It

The `contract` preset extracts parties, key terms, dates, and signatures. Enable `grounding.enabled: true` to get char-level source references for every extracted field. Each grounding ref includes the field name, extracted value, source text with context, character positions, page number, and confidence score.

API Call

curl -X POST https://api.peterparser.com/v2/documents \
  -H "X-API-Key: pp_live_..." \
  -d '{
    "base64": "<contract_pdf_base64>",
    "document_type": "contract",
    "extraction_preset": "contract",
    "grounding": {
      "enabled": true,
      "include_source_text": true,
      "include_confidence": true
    }
  }'

# Response includes:
# "grounding": [
#   {
#     "field": "key_terms.payment_amount",
#     "value": 150000,
#     "source_text": "...total compensation of $150,000 per annum...",
#     "char_start": 4521,
#     "char_end": 4529,
#     "page": 3,
#     "confidence": 1.0
#   }
# ]

What You Get

  • Contract preset: parties, roles, effective/expiry dates, auto-renewal, payment terms, termination, liability, governing law, signatures
  • Char-level grounding with page number, character offsets, and surrounding context
  • Confidence scores for each extracted value
  • Handles multi-party contracts with nested party arrays
Impact

$0.05/page. A 30-page contract costs $1.50 with full grounding.

vs Alternatives

No other parsing API offers char-level grounding out of the box. LlamaParse gives you text. Docsumo gives you key-value pairs. PeterParser gives you structured extraction with an audit trail showing exactly where every value came from.

Bulk Tax Form Processing (W2 / 1099)

Process thousands of tax forms during filing season

The Problem

During tax season, your accounting firm receives thousands of W2s and 1099s from clients. Each needs employer info, employee info, all wage boxes, and state tax info extracted into your tax software.

How PeterParser Solves It

Use `document_type: auto` to let PeterParser detect whether each document is a W2 or 1099. The `w2_tax` and `1099_tax` presets extract all IRS fields including employer EIN, SSN (last four only), all wage boxes, federal/state/local tax withholding, and control numbers. Process in bulk with async mode and SSE to monitor progress.

API Call

# Submit batch of tax forms
for file in tax_forms/*.pdf; do
  curl -X POST https://api.peterparser.com/v2/documents/upload \
    -H "X-API-Key: pp_live_..." \
    -F "file=@$file" \
    -F "document_type=auto" \
    -F "pii_detect=true" \
    -F "mode=async"
done

# Monitor all completions from one SSE stream
curl -N -H "X-API-Key: pp_live_..." \
  "https://api.peterparser.ai/v2/events?ttl=3600"

What You Get

  • Auto-detection distinguishes W2 from 1099 automatically
  • W2 preset: all wage boxes (1-6), state/local info, employer EIN, control number
  • 1099 preset: payer/recipient TIN, nonemployee compensation, withholding
  • PII detection masks SSN to last-four only in output
  • SSE stream monitors entire batch from one connection
Impact

$0.30/document flat rate. 1000 W2s = $300, processed in ~20 minutes.

vs Alternatives

Google Document AI charges per page and requires GCP setup. Nanonets needs training data. PeterParser works out of the box with zero training — the W2 preset knows every IRS field.

Website Scraping to Structured Data

Any URL → structured JSON with CSS selectors

The Problem

You need to extract product data, pricing, or content from competitor websites or public listings. Traditional scraping gives you raw HTML. You need structured, typed data.

How PeterParser Solves It

The website parsing endpoint fetches any URL, extracts content, and returns structured data. Use CSS selectors for precision extraction. Crawl depth 1-3 for multi-page scraping. Output as JSON, Markdown, or text — ready for your database or LLM context.

API Call

curl -X POST https://api.peterparser.com/v2/documents/website \
  -H "X-API-Key: pp_live_..." \
  -d '{
    "url": "https://example.com/products/widget",
    "extract_links": true,
    "extract_images": true,
    "extract_metadata": true,
    "output_format": "json",
    "custom_selectors": {
      "price": ".product-price",
      "title": "h1.product-title",
      "specs": ".specifications li"
    },
    "max_depth": 2
  }'

What You Get

  • CSS selector-based custom extraction — target exactly the data you need
  • Crawl depth 1-3 for multi-page scraping
  • Extracts links, images, and meta tags automatically
  • Output as JSON, Markdown, or plain text
  • $0.005/page — scrape 10,000 pages for $50
Impact

$0.005/page. Sub-second response for single pages.

vs Alternatives

Firecrawl and Jina focus on scraping but don't offer document parsing. PeterParser handles both websites AND documents through the same API, same key, same billing.

Medical Records Processing (HIPAA-Ready)

Extract clinical data with automatic PII masking

The Problem

Your healthtech platform needs to digitize patient intake forms, lab reports, and clinical notes. You must extract structured medical data while ensuring all PHI (Protected Health Information) is properly handled.

How PeterParser Solves It

The `medical_record` preset extracts patient info, visit details, vitals, diagnoses (with ICD codes), medications, allergies, lab results with reference ranges, and care plans. Enable PII masking to automatically redact patient names, DOBs, MRNs, and addresses before the data hits your storage.

API Call

curl -X POST https://api.peterparser.com/v2/documents \
  -H "X-API-Key: pp_live_..." \
  -d '{
    "base64": "<medical_record_base64>",
    "document_type": "medical",
    "extraction_preset": "medical_record",
    "pii": {
      "detect": true,
      "mask": true,
      "types": ["name", "date_of_birth", "address", "phone", "ssn"]
    },
    "grounding": { "enabled": true }
  }'

What You Get

  • Medical preset: patient demographics, visit type, vitals (BP, HR, temp, weight, height), diagnoses with codes, medications with dosage/frequency/route, allergies, lab results with reference ranges, plan, follow-up
  • PII masking for HIPAA compliance — mask PHI before storage
  • Source grounding for clinical audit trails
  • OCR handles scanned clinical documents and faxed records
Impact

$0.05/page. A 10-page patient record costs $0.50.

vs Alternatives

AWS Textract Medical exists but requires AWS infrastructure and separate PII services. PeterParser handles extraction + PII + grounding in one call with no cloud vendor lock-in.

100 Free Credits. No Credit Card.

Parse your first document in under 60 seconds. Every preset, every feature — available immediately.

Get Your API Key