In enterprise AI, the real challenge isn’t just answering questions. It’s about extracting clean, structured, and reliable data like invoice totals, patient records, legal clauses, or job applicant details from all sorts of unstructured documents.
The Bottleneck We Faced
Before adopting Graph RAG, our entity recognition relied heavily on encoder-based models. These models could work well but came with a serious downside:
- Training was slow and expensive — For every new document format, we needed at least two months of training, fine-tuning, and reinforcement learning to make the model production-ready.
- Poor scalability — Each format (invoice, prescription, contract, etc.) demanded its own tailored training loop.
In fast-moving business environments, this was a blocker. We couldn’t afford to spend two months per document type.
The Need for Speed
We needed a system that could:
- Extract structured data with minimal training
- Work across different domains
- Validate and output data ready for APIs or databases
- Scale rapidly as new formats emerged
Enter Graph RAG + Pydantic
We rebuilt our pipeline around a Graph-based Retrieval-Augmented Generation (Graph RAG) system combined with Pydantic for data validation. Here’s what changed:
1. One Document, Minimal Training
Instead of training encoders for each new format, Graph RAG let us use a single example document to define relationships and structure. That was enough to start extracting meaningful entities.
2. Graph Structure = Contextual Intelligence
In this setup, each document type became a node in a graph, complete with:
- A domain-specific retriever
- A prompt template aligned with expected fields
- A Pydantic model for schema validation
These nodes could talk to each other, allowing for cross-field reasoning and fallback logic.
3. Schema-Aware Prompts + LLMs
We crafted dynamic prompts that told the LLM exactly what fields we were after, based on the schema. For example:
from pydantic import BaseModel, Field
class MedicalEntity(BaseModel):
patient_name: str = Field(..., description="Full name of the patient")
age: int = Field(..., description="Age of the patient")
diagnosis: str = Field(..., description="Medical diagnosis")
medication: list[str] = Field(..., description="List of prescribed medications")
date: str = Field(..., description="Date of prescription")
Prompt: “Using the context, extract the following fields in JSON format with their associated descriptions: {schema_fields}”
4. Real-Time Validation
The LLM-generated output was passed directly into a Pydantic model:
entity = MedicalEntity.parse_obj(llm_output)
If validation failed or a field was missing, we triggered smart retries using sub-queries within the graph: “What date was this prescription issued?”
How the System Works
Step-by-step:
- Upload Any Document — PDF, image, or text.
- Text Extraction & Embedding — OCR + embedding (via FAISS, Qdrant, etc.).
- Domain Classification — LLM-based classification to determine domain (e.g., finance, legal).
- Graph Routing — The Document is sent to the relevant graph node.
- Prompt + LLM — Context + prompt generates field-level JSON output.
- Pydantic Validation — Ensures data integrity and structure.
The Modular Graph Node
class GraphNode:
def __init__(self, retriever, schema, prompt_template):
self.retriever = retriever
self.schema = schema
self.prompt_template = prompt_template
def run(self, query, context):
prompt = self.prompt_template.format(schema_fields=self.schema.schema_json())
result = query_llm(prompt, context)
return self.schema.parse_obj(result)
Adding a new format is as simple as:
- Defining a new Pydantic model
- Creating a retriever
- Registering a graph node
Real-World Example: Prescription Extraction
- User uploads a prescription
- OCR converts it to text
- Classifier says: healthcare
- Sent to healthcare graph node
- Prompt: “Extract patient_name, age, diagnosis, date, medications”
- LLM responds, Pydantic validates
- Missing field? Follow-up question triggered automatically
Why This Works So Well
Feature | Encoder Models | Graph RAG + Pydantic |
---|---|---|
Time to production | 2 months | Few hours |
Handles multiple domains | ❌ | ✅ |
Schema-aware extraction | ❌ | ✅ |
Cross-field reasoning | ❌ | ✅ |
Smart retry for missing fields | ❌ | ✅ |
Validated JSON output | ❌ | ✅ |
Final Thoughts
Encoder-based systems got us part of the way, but the cost and rigidity were too high. With Graph RAG + Pydantic, we transformed our pipeline into a fast, flexible, and schema-aware system.
Now, we can:
- Launch new formats with minimal effort
- Deliver production-ready extraction within hours
- Ensure every output is validated, structured, and API-ready
This shift didn’t just make us faster. It made us smarter.