What You Can Build with PeterParser
Real scenarios. Actual API calls. Specific features that solve specific problems — not vague “AI-powered document processing” promises.
Automate Accounts Payable & Receivable
Invoice preset + webhooks = zero manual data entry
The Problem
Your AP team manually keys invoices into your ERP. Each invoice takes 3-5 minutes. At 500 invoices/month, that's 40+ hours of data entry — plus a 4% error rate that causes payment disputes.
How PeterParser Solves It
Send invoice PDFs to the /v2/documents endpoint with the `invoice` preset. PeterParser extracts vendor name, line items, totals, tax, PO numbers, and payment terms into clean JSON. Set a webhook_url and results POST to your system automatically when done.
API Call
curl -X POST https://api.peterparser.com/v2/documents \
-H "X-API-Key: pp_live_..." \
-H "Content-Type: application/json" \
-d '{
"base64": "<invoice_pdf_base64>",
"document_type": "invoice",
"extraction_preset": "invoice",
"mode": "async",
"webhook_url": "https://yourapp.com/api/invoices/ingest"
}'What You Get
- →16-field invoice preset: vendor, customer, line items, tax, totals, PO numbers
- →99.5% table accuracy on line item extraction
- →Char-level grounding — click any amount to see where it appears in the PDF
- →Async processing with webhook delivery and HMAC signature verification
3-5 min/invoice → 2 seconds. 99.5% accuracy vs 96% manual.
LlamaParse extracts text but doesn't provide structured JSON with custom schemas. Nanonets requires GPU infrastructure. PeterParser gives you a preset + webhook in one API call.
KYC & Identity Verification
ID extraction + PII redaction in a single API call
The Problem
Your onboarding flow requires users to upload government IDs. You need to extract name, DOB, expiry, and document number — but you also need to mask sensitive data before storing it in your logs or audit trail.
How PeterParser Solves It
Use the `identity_document` preset with `pii.detect: true` and `pii.mask: true`. PeterParser extracts all ID fields and returns both the raw extraction AND a PII-masked version. SSN, DOB, and address are automatically detected and redacted with your chosen mask character.
API Call
curl -X POST https://api.peterparser.com/v2/documents \
-H "X-API-Key: pp_live_..." \
-d '{
"base64": "<drivers_license_base64>",
"document_type": "driver_license",
"extraction_preset": "identity_document",
"pii": {
"detect": true,
"mask": true,
"mask_char": "█",
"types": ["ssn", "date_of_birth", "address"]
},
"grounding": { "enabled": true }
}'What You Get
- →Supports driver licenses, passports, ID cards, and ID verification documents
- →9 PII types detected: SSN, credit card, phone, email, address, name, DOB, bank account, IP
- →Masked output with configurable mask character — store safely in logs
- →Source grounding proves where each field was found on the document
$0.10/document. PII detection adds $0.002/page.
Most parsing APIs extract OR redact, not both. PeterParser returns structured data with grounding AND masks PII in a single pass. No need for a separate PII service.
RAG Pipeline Document Ingestion
Parse → chunk → embed in one call
The Problem
You're building a RAG system and need to ingest thousands of PDFs into your vector store. Raw text extraction loses table structure. Chunking by character count breaks mid-sentence. And you need metadata for filtering.
How PeterParser Solves It
PeterParser preserves table structure and reading order. Enable `chunking.enabled: true` with semantic or sentence-based splitting. Each chunk comes with char offsets for precise retrieval. Use the fast lane (`pre_processing: false`) for text-heavy docs where layout doesn't matter.
API Call
curl -X POST https://api.peterparser.com/v2/documents \
-H "X-API-Key: pp_live_..." \
-d '{
"url": "https://example.com/whitepaper.pdf",
"output_format": "markdown",
"chunking": {
"enabled": true,
"max_chunk_size": 1500,
"overlap": 200,
"strategy": "semantic"
},
"classify": { "enabled": true },
"summarize": true
}'What You Get
- →Three chunking strategies: semantic, fixed, sentence-based
- →Configurable chunk size (100-10,000 chars) and overlap (0-500 chars)
- →Auto document classification for metadata filtering in your vector store
- →AI-generated summary for each document
- →Fast lane for text-heavy docs — 10x faster, lower cost
1,000 docs/hour with the full pipeline. 5,000/hour on fast lane.
Unstructured offers chunking but with lower table precision. LlamaParse doesn't chunk natively — you need LlamaIndex. PeterParser handles parse + chunk + classify + summarize in one API call.
Bank Statement Reconciliation at Scale
1000-page statements → structured transactions in minutes
The Problem
Your lending platform processes bank statements for underwriting. Statements range from 2 to 1,000+ pages. Manual extraction is impossible at scale, and most APIs choke on documents over 50 pages.
How PeterParser Solves It
PeterParser automatically routes large documents through chunked parallel extraction. A 1000-page statement gets split into ~34 chunks, each processed in parallel. Transactions are merged, deduplicated, and returned as a single JSON array. The `bank_statement` preset captures account info, balances, and every transaction.
API Call
curl -X POST https://api.peterparser.com/v2/documents \
-H "X-API-Key: pp_live_..." \
-d '{
"url": "https://secure.bank.com/statement.pdf",
"document_type": "bank_statement",
"extraction_preset": "bank_statement",
"mode": "async",
"webhook_url": "https://yourapp.com/api/statements/ready"
}'
# Monitor all your jobs in real-time:
curl -N -H "X-API-Key: pp_live_..." \
"https://api.peterparser.ai/v2/events?ttl=600"What You Get
- →Chunked parallel extraction — 1000+ pages handled automatically
- →Bank statement preset: account holder, account number, period, opening/closing balance, every transaction with date/description/amount/type/category
- →SSE real-time events — one connection monitors all your async jobs
- →Transactions deduplicated across chunks automatically
1000-page statement processed in ~3 minutes (async). $0.75/document flat rate.
Most APIs have a 50-page limit or timeout on large documents. Reducto handles large docs but charges per page. PeterParser auto-chunks, merges results, and charges a flat per-document rate for bank statements.
Contract Analysis with Source Grounding
Extract clauses and prove where every value came from
The Problem
Your legal team reviews 200 vendor contracts per quarter. They need to extract payment terms, liability limits, termination clauses, and governing law — and the extraction must be auditable, showing exactly where each value was found.
How PeterParser Solves It
The `contract` preset extracts parties, key terms, dates, and signatures. Enable `grounding.enabled: true` to get char-level source references for every extracted field. Each grounding ref includes the field name, extracted value, source text with context, character positions, page number, and confidence score.
API Call
curl -X POST https://api.peterparser.com/v2/documents \
-H "X-API-Key: pp_live_..." \
-d '{
"base64": "<contract_pdf_base64>",
"document_type": "contract",
"extraction_preset": "contract",
"grounding": {
"enabled": true,
"include_source_text": true,
"include_confidence": true
}
}'
# Response includes:
# "grounding": [
# {
# "field": "key_terms.payment_amount",
# "value": 150000,
# "source_text": "...total compensation of $150,000 per annum...",
# "char_start": 4521,
# "char_end": 4529,
# "page": 3,
# "confidence": 1.0
# }
# ]What You Get
- →Contract preset: parties, roles, effective/expiry dates, auto-renewal, payment terms, termination, liability, governing law, signatures
- →Char-level grounding with page number, character offsets, and surrounding context
- →Confidence scores for each extracted value
- →Handles multi-party contracts with nested party arrays
$0.05/page. A 30-page contract costs $1.50 with full grounding.
No other parsing API offers char-level grounding out of the box. LlamaParse gives you text. Docsumo gives you key-value pairs. PeterParser gives you structured extraction with an audit trail showing exactly where every value came from.
Bulk Tax Form Processing (W2 / 1099)
Process thousands of tax forms during filing season
The Problem
During tax season, your accounting firm receives thousands of W2s and 1099s from clients. Each needs employer info, employee info, all wage boxes, and state tax info extracted into your tax software.
How PeterParser Solves It
Use `document_type: auto` to let PeterParser detect whether each document is a W2 or 1099. The `w2_tax` and `1099_tax` presets extract all IRS fields including employer EIN, SSN (last four only), all wage boxes, federal/state/local tax withholding, and control numbers. Process in bulk with async mode and SSE to monitor progress.
API Call
# Submit batch of tax forms
for file in tax_forms/*.pdf; do
curl -X POST https://api.peterparser.com/v2/documents/upload \
-H "X-API-Key: pp_live_..." \
-F "file=@$file" \
-F "document_type=auto" \
-F "pii_detect=true" \
-F "mode=async"
done
# Monitor all completions from one SSE stream
curl -N -H "X-API-Key: pp_live_..." \
"https://api.peterparser.ai/v2/events?ttl=3600"What You Get
- →Auto-detection distinguishes W2 from 1099 automatically
- →W2 preset: all wage boxes (1-6), state/local info, employer EIN, control number
- →1099 preset: payer/recipient TIN, nonemployee compensation, withholding
- →PII detection masks SSN to last-four only in output
- →SSE stream monitors entire batch from one connection
$0.30/document flat rate. 1000 W2s = $300, processed in ~20 minutes.
Google Document AI charges per page and requires GCP setup. Nanonets needs training data. PeterParser works out of the box with zero training — the W2 preset knows every IRS field.
Website Scraping to Structured Data
Any URL → structured JSON with CSS selectors
The Problem
You need to extract product data, pricing, or content from competitor websites or public listings. Traditional scraping gives you raw HTML. You need structured, typed data.
How PeterParser Solves It
The website parsing endpoint fetches any URL, extracts content, and returns structured data. Use CSS selectors for precision extraction. Crawl depth 1-3 for multi-page scraping. Output as JSON, Markdown, or text — ready for your database or LLM context.
API Call
curl -X POST https://api.peterparser.com/v2/documents/website \
-H "X-API-Key: pp_live_..." \
-d '{
"url": "https://example.com/products/widget",
"extract_links": true,
"extract_images": true,
"extract_metadata": true,
"output_format": "json",
"custom_selectors": {
"price": ".product-price",
"title": "h1.product-title",
"specs": ".specifications li"
},
"max_depth": 2
}'What You Get
- →CSS selector-based custom extraction — target exactly the data you need
- →Crawl depth 1-3 for multi-page scraping
- →Extracts links, images, and meta tags automatically
- →Output as JSON, Markdown, or plain text
- →$0.005/page — scrape 10,000 pages for $50
$0.005/page. Sub-second response for single pages.
Firecrawl and Jina focus on scraping but don't offer document parsing. PeterParser handles both websites AND documents through the same API, same key, same billing.
Medical Records Processing (HIPAA-Ready)
Extract clinical data with automatic PII masking
The Problem
Your healthtech platform needs to digitize patient intake forms, lab reports, and clinical notes. You must extract structured medical data while ensuring all PHI (Protected Health Information) is properly handled.
How PeterParser Solves It
The `medical_record` preset extracts patient info, visit details, vitals, diagnoses (with ICD codes), medications, allergies, lab results with reference ranges, and care plans. Enable PII masking to automatically redact patient names, DOBs, MRNs, and addresses before the data hits your storage.
API Call
curl -X POST https://api.peterparser.com/v2/documents \
-H "X-API-Key: pp_live_..." \
-d '{
"base64": "<medical_record_base64>",
"document_type": "medical",
"extraction_preset": "medical_record",
"pii": {
"detect": true,
"mask": true,
"types": ["name", "date_of_birth", "address", "phone", "ssn"]
},
"grounding": { "enabled": true }
}'What You Get
- →Medical preset: patient demographics, visit type, vitals (BP, HR, temp, weight, height), diagnoses with codes, medications with dosage/frequency/route, allergies, lab results with reference ranges, plan, follow-up
- →PII masking for HIPAA compliance — mask PHI before storage
- →Source grounding for clinical audit trails
- →OCR handles scanned clinical documents and faxed records
$0.05/page. A 10-page patient record costs $0.50.
AWS Textract Medical exists but requires AWS infrastructure and separate PII services. PeterParser handles extraction + PII + grounding in one call with no cloud vendor lock-in.
Legal Case Chronology Extraction
Any legal document → structured timeline JSON with char-level grounding
The Problem
Your litigation support team manually reads complaints, motions, and case files to build case chronologies. Each document takes hours to review. Approximate dates get lost, Bates number references are inconsistent, and there's no structured format for the timeline — just Word docs and sticky notes.
How PeterParser Solves It
The `legal_timeline` document type extracts a full structured chronology: case summary with parties and jurisdiction, plus a timeline array with every datable event. Each event includes date handling (exact, approximate, range, unknown), event classification, display category (court, communication, evidence, medical, financial, witness, other), legal concept tags, confidence levels, party involvement, monetary amounts, statute citations, and full citation with page numbers, source snippets, and AI-generated summaries for provenance. Use async mode for large filings.
API Call
curl -X POST https://api.peterparser.com/v2/documents \
-H "X-API-Key: pp_live_..." \
-H "Content-Type: application/json" \
-d '{
"base64": "<complaint_pdf_base64>",
"document_type": "legal_timeline",
"mode": "async",
"webhook_url": "https://yourapp.com/api/timelines/ready",
"grounding": { "enabled": true },
"summarize": true
}'What You Get
- →Structured timeline with case summary, parties, jurisdiction, and chronological events
- →Approximate date handling — circa dates, date ranges, and unknown precision tracked separately
- →Char-level grounding with confidence scores for every extracted event
- →Event classification: filing, hearing, deposition, order, judgment, settlement, and more
- →Bates numbers, statute citations, and case references extracted automatically
- →Async batch processing for large case files with webhook delivery
$0.10/page. A 50-page complaint processed in ~2 minutes (async).
CaseFleet, DISCO, and Everchron are closed SaaS platforms with no API. PeterParser is the only REST API that returns structured timeline JSON with char-level grounding, approximate date handling, and async/webhook support.
100 Free Credits. No Credit Card.
Parse your first document in under 60 seconds. Every preset, every feature — available immediately.
Get Your API Key