Entity extraction from documents — sounds simple, right? Just find the name, date, and reference numbers.
But in the real world, it’s anything but. PDFs are messy. Layouts vary. Sometimes text is embedded, other times it’s just a scanned image. Add multilingual content, signatures, and stamps, and you have a recipe for disaster if you’re relying on traditional parsers or rule-based extraction.
That’s why we turned to something better: EasyOCR + LLMs — a powerful, no-rules-needed pipeline that handles scanned documents and intelligently understands content.
In this blog, we’ll show how this combo gives you:
- Layout-aware, coordinate-rich extraction from scanned PDFs
- Intelligent entity classification and confidence scoring
- Zero manual templates or position rules
The Core Idea
Instead of:
- Parsing PDFs with pdfplumber or PyMuPDF
- Writing fragile rules like “look above the word ‘Name’”
- Predefining layouts with regex or x-y positions
We simply:
- Use EasyOCR to extract all visible text (and coordinates)
- Pass the raw OCR output to a language model (LLM) for smart entity extraction
And it works across healthcare forms, contracts, invoices, and more.
Tech Stack
- EasyOCR: Extracts all text from a PDF page image with bounding boxes and confidence scores
- Any LLM (GPT-3.5, Claude, Gemma-2B, Mistral): Interprets raw OCR content, labels entities, and provides semantic confidence
PDF2Image: Converts PDF to images for OCR
The Unified Pipeline
from pdf2image import convert_from_path
# Step 1: Convert PDF to Image
images = convert_from_path("sample.pdf")
# Step 2: OCR with EasyOCR
import easyocr
reader = easyocr.Reader(['en'])
ocr_results = reader.readtext(images[0], detail=1) # (box, text, confidence)
# Step 3: Prepare Data for LLM
ocr_text = "\n".join([text for _, text, _ in ocr_results])
# Step 4: Send to LLM
from openai import OpenAI
client = OpenAI()
prompt = f"""
Below is the raw OCR text from a document:
{ocr_text}
Extract and return structured entities in JSON format:
- patient_name
- age
- date
- doctor_name
- medicine_names
Respond only in JSON. Include confidence for each field.
"""
completion = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": prompt}
]
)
print(completion.choices[0].message.content)
{
"patient_name": {"value": "Amit Sharma", "confidence": 0.94},
"age": {"value": "42", "confidence": 0.91},
"date": {"value": "12/02/2024", "confidence": 0.88},
"doctor_name": {"value": "Dr. S. Mehta", "confidence": 0.93},
"medicine_names": ["Cefixime", "Paracetamol"]
}
Why This Works (Even Better Than Layout Models)
Feature | Traditional Parsers | EasyOCR + LLM |
---|---|---|
Handles scanned PDFs | ❌ | ✅ |
Works with variable layouts | ❌ | ✅ |
Requires rules/templates | ✅ | ❌ |
Understands semantic meaning | ❌ | ✅ |
Output format (JSON, clean data) | ❌ | ✅ |
Coordinates for each entity | Partially | ✅ (EasyOCR) |
You don’t need:
- Template logic
- Keyword matching
- Manual bounding box filtering
LLMs do the reasoning. OCR gives the layout. Together, they create a full document understanding pipeline.
Use Cases Where This Shines
- Healthcare: Extract patient info, medication, and dates from prescriptions
- Finance: Get invoice numbers, totals, vendor names — even from scanned bills
- Legal: Extract clause titles, dates, and signatures from contracts
- Logistics: Pull tracking IDs, shipment details, receiver names
Advantages
- Works out-of-the-box on diverse layouts
- Portable and scalable (even on a CPU with smaller LLMs)
- Doesn’t require fine-tuning if using well-prompted LLMs
- Highly explainable: you can show extracted boxes + entity labels to users
Final Thoughts
With EasyOCR + LLM, you don’t need complicated parsing pipelines or layout training. Just extract everything, let the LLM think, and you’ll get clean, confident entities — without a single line of rule-based logic.
This approach is production-ready, fast to iterate, and incredibly adaptable to new domains.
Need help building AI applications or software for your organization? Contact us today to discover the best solutions tailored to your needs!