February 27, 2026·5 min read
How PeterParser Handles 1000-Page PDFs
A 1000-page PDF converted to markdown can exceed 2 million characters. Even large-context AI models degrade on massive prompts. Most APIs just fail.
The Strategy
When a document exceeds 30 pages, PeterParser automatically:
- Splits the markdown into chunks of ~30 pages, breaking on page boundaries
- Sends each chunk to the AI engine independently (up to 5 in parallel)
- Merges results into a single structured output
The Merge Strategy
- → Lists (entities, line items, transactions) — concatenated and deduplicated
- → Scalars (title, date, document type) — first non-null value wins
- → Summaries — concatenated across chunks
- → Nested objects — deep-merged, list fields appended
The output includes _chunks_merged: 34 so you know exactly how many chunks were used.
Performance
A 1000-page PDF: ~34 chunks, processed 5 at a time. The extraction stage completes in roughly the time of processing 5-7 pages sequentially — because of parallelism. The bottleneck shifts to the document conversion stage, which is still sequential.
Combined with async processing and SSE notifications, users submit the document and get notified when it's done. No timeouts, no context overflow, no manual splitting.