How PeterParser Handles 1000-Page PDFs

A 1000-page PDF converted to markdown can exceed 2 million characters. Even large-context AI models degrade on massive prompts. Most APIs just fail.

The Strategy

When a document exceeds 30 pages, PeterParser automatically:

Splits the markdown into chunks of ~30 pages, breaking on page boundaries
Sends each chunk to the AI engine independently (up to 5 in parallel)
Merges results into a single structured output

The Merge Strategy

→ Lists (entities, line items, transactions) — concatenated and deduplicated
→ Scalars (title, date, document type) — first non-null value wins
→ Summaries — concatenated across chunks
→ Nested objects — deep-merged, list fields appended

The output includes _chunks_merged: 34 so you know exactly how many chunks were used.

Performance

A 1000-page PDF: ~34 chunks, processed 5 at a time. The extraction stage completes in roughly the time of processing 5-7 pages sequentially — because of parallelism. The bottleneck shifts to the document conversion stage, which is still sequential.

Combined with async processing and SSE notifications, users submit the document and get notified when it's done. No timeouts, no context overflow, no manual splitting.