Jellyfish Technologies Logo

Smart Entity Extraction from PDFs with EasyOCR + LLM — No Rules, Just Results

Smart Entity Extraction from PDFs with EasyOCR + LLM

Entity extraction from documents — sounds simple, right? Just find the name, date, and reference numbers.

But in the real world, it’s anything but. PDFs are messy. Layouts vary. Sometimes text is embedded, other times it’s just a scanned image. Add multilingual content, signatures, and stamps, and you have a recipe for disaster if you’re relying on traditional parsers or rule-based extraction.

That’s why we turned to something better: EasyOCR + LLMs — a powerful, no-rules-needed pipeline that handles scanned documents and intelligently understands content.

In this blog, we’ll show how this combo gives you:

  • Layout-aware, coordinate-rich extraction from scanned PDFs
  • Intelligent entity classification and confidence scoring
  • Zero manual templates or position rules

The Core Idea

Instead of:

  • Parsing PDFs with pdfplumber or PyMuPDF
  • Writing fragile rules like “look above the word ‘Name’”
  • Predefining layouts with regex or x-y positions

We simply:

  1. Use EasyOCR to extract all visible text (and coordinates)
  2. Pass the raw OCR output to a language model (LLM) for smart entity extraction

And it works across healthcare forms, contracts, invoices, and more.

Tech Stack

  • EasyOCR: Extracts all text from a PDF page image with bounding boxes and confidence scores
  • Any LLM (GPT-3.5, Claude, Gemma-2B, Mistral): Interprets raw OCR content, labels entities, and provides semantic confidence

PDF2Image: Converts PDF to images for OCR

The Unified Pipeline


from pdf2image import convert_from_path

# Step 1: Convert PDF to Image
images = convert_from_path("sample.pdf")

# Step 2: OCR with EasyOCR
import easyocr

reader = easyocr.Reader(['en'])
ocr_results = reader.readtext(images[0], detail=1)  # (box, text, confidence)

# Step 3: Prepare Data for LLM
ocr_text = "\n".join([text for _, text, _ in ocr_results])

# Step 4: Send to LLM
from openai import OpenAI

client = OpenAI()

prompt = f"""
Below is the raw OCR text from a document:

{ocr_text}

Extract and return structured entities in JSON format:

- patient_name
- age
- date
- doctor_name
- medicine_names

Respond only in JSON. Include confidence for each field.
"""

completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": prompt}
    ]
)

print(completion.choices[0].message.content)
  
Output:

{
  "patient_name": {"value": "Amit Sharma", "confidence": 0.94},
  "age": {"value": "42", "confidence": 0.91},
  "date": {"value": "12/02/2024", "confidence": 0.88},
  "doctor_name": {"value": "Dr. S. Mehta", "confidence": 0.93},
  "medicine_names": ["Cefixime", "Paracetamol"]
}
  

Why This Works (Even Better Than Layout Models)

FeatureTraditional ParsersEasyOCR + LLM
Handles scanned PDFs
Works with variable layouts
Requires rules/templates
Understands semantic meaning
Output format (JSON, clean data)
Coordinates for each entityPartially✅ (EasyOCR)

You don’t need:

  • Template logic
  • Keyword matching
  • Manual bounding box filtering

LLMs do the reasoning. OCR gives the layout. Together, they create a full document understanding pipeline.

Use Cases Where This Shines

  • Healthcare: Extract patient info, medication, and dates from prescriptions
  • Finance: Get invoice numbers, totals, vendor names — even from scanned bills
  • Legal: Extract clause titles, dates, and signatures from contracts
  • Logistics: Pull tracking IDs, shipment details, receiver names

Advantages

  • Works out-of-the-box on diverse layouts
  • Portable and scalable (even on a CPU with smaller LLMs)
  • Doesn’t require fine-tuning if using well-prompted LLMs
  • Highly explainable: you can show extracted boxes + entity labels to users

Final Thoughts

With EasyOCR + LLM, you don’t need complicated parsing pipelines or layout training. Just extract everything, let the LLM think, and you’ll get clean, confident entities — without a single line of rule-based logic.

This approach is production-ready, fast to iterate, and incredibly adaptable to new domains.

Need help building AI applications or software for your organization? Contact us today to discover the best solutions tailored to your needs!

Share this article
Want to speak with our solution experts?
Jellyfish Technologies

Modernize Legacy System With AI: A Strategy for CEOs

Download the eBook and get insights on CEOs growth strategy

    Let's Talk

    We believe in solving complex business challenges of the converging world, by using cutting-edge technologies.