~/work/pdf-data-extraction-for-romanian-timber-auctions.md
Case study

AI / LLM PDF Data Extraction for Romanian Timber Auctions

A multi-pass extraction pipeline that turns scanned, hand-formatted Romanian timber-auction PDFs into clean, structured data, narrowing the problem one pass at a time so hallucinations have nowhere to hide.

role
Design & implementation: pipeline, prompting, validation, test harness
sector
Forestry data · Public-auction monitoring · SaaS aggregator
regulation
HG 715/2017 · Romanian forest districts (ocoale silvice)
stack
Python · Claude (API + Claude Code) · pytest · YAML / CSV
input
photographed / scanned PDFs · 16+ counties · highly variable layouts
01 / Problem

Every forest district publishes its own way

Romanian forest districts are legally required, under HG 715/2017, to publish timber-auction announcements before selling timber. In practice, every district does this differently. The PDFs are photographed or scanned physical papers. Layouts vary wildly across 16+ counties. Some are clear, some are faded, some have tiny fonts or rotated pages.

The data inside is tabular (lot numbers, tree species, timber volumes, assortments, starting prices, locations), but there is no consistent structure to rely on. Single-pass extraction proved prone to hallucinations and unreliable reads.

02 / Approach

Narrow the problem one pass at a time

The extraction runs as a multi-pass pipeline. Each pass builds on the previous one and saves intermediate artifacts for auditability, so the intermediate outputs at each step make debugging and validation straightforward.

pipeline · 5 passes + mergeorientation → raw → table → schema → re-read → merge
claude · structured outputs · confidence scoring PASS 1 orient detect · correct PASS 2 raw transcribe PASS 3 tables structured rows PASS 4 schema species · volumes · prices PASS 5 re-read @ 400 dpi only uncertain values MERGE yaml · csv final output INTERMEDIATE ARTIFACTS · PER PASS rotated pages · raw text · table JSON · schema YAML · confidence scores · re-read values
each pass narrows the problem · uncertainty flags drive a targeted high-DPI re-read · every stage persists its own artifact for audit
Each pass answers one question. Orientation first, then raw text, then structure, then semantics, then a targeted re-read for whatever is still uncertain. Splitting the work this way is a reliable way to keep hallucinations out; some steps can still be combined later if cost or latency pushes back.
03 / Catching hallucinations

Structured output, cross-checks, and a confidence budget

The LLM is Claude: API in production, Claude Code interface during development to save on API costs during iteration. A multi-step validation process catches hallucinations: structured output schemas, cross-referencing between passes, and confidence scoring that triggers the high-DPI re-read pass for values that don't look trustworthy yet.

Because the pipeline is deterministic at the seams (each pass has a well-defined input and output), failures can be inspected at the exact stage they occur, rather than re-running the whole extraction from scratch.

04 / Testing LLM output like a contract

Integration tests against the real Claude API

Testing uses pytest for both unit and integration tests. The integration tests call the actual Claude API and compare against validated expected outputs; the responses turned out to be stable enough for this to work as a reliable regression suite, which is not something you can take for granted with LLMs.

05 / Output & integration

Feeding a SaaS aggregator

Output is YAML and/or CSV. Built for a SaaS platform aggregating Romanian forestry data, the extractor will be integrated into their scraping pipeline, with similar modules planned for other document types.

surfacecovers
input scanned / photographed PDFs · 16+ counties · highly variable formats
fields lot numbers · species · volumes · assortments · starting prices · locations
pipeline 5 passes + merge · artifacts persisted per pass
validation structured schemas · cross-pass checks · confidence-triggered re-read
output YAML · CSV · consumed by the client's scraping pipeline
06 / Tech stack

Tools

  • Python
  • Claude API
  • Claude Code
  • pytest
  • YAML
  • CSV
~/work
01 / 01 navigate · Esc close