AI / LLM PDF Data Extraction for Romanian Timber Auctions

A multi-pass extraction pipeline that turns scanned, hand-formatted Romanian timber-auction PDFs into clean, structured data, narrowing the problem one pass at a time so hallucinations have nowhere to hide.

role: Design & implementation: pipeline, prompting, validation, test harness
sector: Forestry data · Public-auction monitoring · SaaS aggregator
regulation: HG 715/2017 · Romanian forest districts (ocoale silvice)
stack: Python · Claude (API + Claude Code) · pytest · YAML / CSV
input: photographed / scanned PDFs · 16+ counties · highly variable layouts

01 / Problem

Every forest district publishes its own way

Romanian forest districts are legally required, under HG 715/2017, to publish timber-auction announcements before selling timber. In practice, every district does this differently. The PDFs are photographed or scanned physical papers. Layouts vary wildly across 16+ counties. Some are clear, some are faded, some have tiny fonts or rotated pages.

The data inside is tabular (lot numbers, tree species, timber volumes, assortments, starting prices, locations), but there is no consistent structure to rely on. Single-pass extraction proved prone to hallucinations and unreliable reads.

02 / Approach

Narrow the problem one pass at a time

The extraction runs as a multi-pass pipeline. Each pass builds on the previous one and saves intermediate artifacts for auditability, so the intermediate outputs at each step make debugging and validation straightforward.

pipeline · 5 passes + mergeorientation → raw → table → schema → re-read → merge

each pass narrows the problem · uncertainty flags drive a targeted high-DPI re-read · every stage persists its own artifact for audit

Each pass answers one question. Orientation first, then raw text, then structure, then semantics, then a targeted re-read for whatever is still uncertain. Splitting the work this way is a reliable way to keep hallucinations out; some steps can still be combined later if cost or latency pushes back.

03 / Catching hallucinations

Structured output, cross-checks, and a confidence budget

The LLM is Claude: API in production, Claude Code interface during development to save on API costs during iteration. A multi-step validation process catches hallucinations: structured output schemas, cross-referencing between passes, and confidence scoring that triggers the high-DPI re-read pass for values that don't look trustworthy yet.

Because the pipeline is deterministic at the seams (each pass has a well-defined input and output), failures can be inspected at the exact stage they occur, rather than re-running the whole extraction from scratch.

04 / Testing LLM output like a contract

Integration tests against the real Claude API

Testing uses pytest for both unit and integration tests. The integration tests call the actual Claude API and compare against validated expected outputs; the responses turned out to be stable enough for this to work as a reliable regression suite, which is not something you can take for granted with LLMs.

05 / Output & integration

Feeding a SaaS aggregator

Output is YAML and/or CSV. Built for a SaaS platform aggregating Romanian forestry data, the extractor will be integrated into their scraping pipeline, with similar modules planned for other document types.

surface	covers
input	scanned / photographed PDFs · 16+ counties · highly variable formats
fields	lot numbers · species · volumes · assortments · starting prices · locations
pipeline	5 passes + merge · artifacts persisted per pass
validation	structured schemas · cross-pass checks · confidence-triggered re-read
output	YAML · CSV · consumed by the client's scraping pipeline

06 / Tech stack

Tools

Python
Claude API
Claude Code
pytest
YAML
CSV