AI / LLM PDF Data Extraction for Romanian Timber Auctions
A multi-pass extraction pipeline that turns scanned, hand-formatted Romanian timber-auction PDFs into clean, structured data, narrowing the problem one pass at a time so hallucinations have nowhere to hide.
Every forest district publishes its own way
Romanian forest districts are legally required, under HG 715/2017, to publish timber-auction announcements before selling timber. In practice, every district does this differently. The PDFs are photographed or scanned physical papers. Layouts vary wildly across 16+ counties. Some are clear, some are faded, some have tiny fonts or rotated pages.
The data inside is tabular (lot numbers, tree species, timber volumes, assortments, starting prices, locations), but there is no consistent structure to rely on. Single-pass extraction proved prone to hallucinations and unreliable reads.
Narrow the problem one pass at a time
The extraction runs as a multi-pass pipeline. Each pass builds on the previous one and saves intermediate artifacts for auditability, so the intermediate outputs at each step make debugging and validation straightforward.
Structured output, cross-checks, and a confidence budget
The LLM is Claude: API in production, Claude Code interface during development to save on API costs during iteration. A multi-step validation process catches hallucinations: structured output schemas, cross-referencing between passes, and confidence scoring that triggers the high-DPI re-read pass for values that don't look trustworthy yet.
Because the pipeline is deterministic at the seams (each pass has a well-defined input and output), failures can be inspected at the exact stage they occur, rather than re-running the whole extraction from scratch.
Integration tests against the real Claude API
Testing uses pytest for both unit and integration tests. The integration tests call the actual Claude API and compare against validated expected outputs; the responses turned out to be stable enough for this to work as a reliable regression suite, which is not something you can take for granted with LLMs.
Feeding a SaaS aggregator
Output is YAML and/or CSV. Built for a SaaS platform aggregating Romanian forestry data, the extractor will be integrated into their scraping pipeline, with similar modules planned for other document types.
| surface | covers |
|---|---|
| input | scanned / photographed PDFs · 16+ counties · highly variable formats |
| fields | lot numbers · species · volumes · assortments · starting prices · locations |
| pipeline | 5 passes + merge · artifacts persisted per pass |
| validation | structured schemas · cross-pass checks · confidence-triggered re-read |
| output | YAML · CSV · consumed by the client's scraping pipeline |