Jellyfish Technologies Logo

Structuring Multi-Domain Entity Extraction with Graph RAG + Pydantic

structuring-multi-domain-entity-extraction-with-graph-rag-pydantic

In enterprise AI, the real challenge isn’t just answering questions. It’s about extracting clean, structured, and reliable data like invoice totals, patient records, legal clauses, or job applicant details from all sorts of unstructured documents.

The Bottleneck We Faced

Before adopting Graph RAG, our entity recognition relied heavily on encoder-based models. These models could work well but came with a serious downside:

  • Training was slow and expensive — For every new document format, we needed at least two months of training, fine-tuning, and reinforcement learning to make the model production-ready.

  • Poor scalability — Each format (invoice, prescription, contract, etc.) demanded its own tailored training loop.

In fast-moving business environments, this was a blocker. We couldn’t afford to spend two months per document type.

The Need for Speed

We needed a system that could:

  • Extract structured data with minimal training
  • Work across different domains
  • Validate and output data ready for APIs or databases
  • Scale rapidly as new formats emerged

Enter Graph RAG + Pydantic

We rebuilt our pipeline around a Graph-based Retrieval-Augmented Generation (Graph RAG) system combined with Pydantic for data validation. Here’s what changed:

1. One Document, Minimal Training

Instead of training encoders for each new format, Graph RAG let us use a single example document to define relationships and structure. That was enough to start extracting meaningful entities.

2. Graph Structure = Contextual Intelligence

In this setup, each document type became a node in a graph, complete with:

  • A domain-specific retriever
  • A prompt template aligned with expected fields
  • A Pydantic model for schema validation

These nodes could talk to each other, allowing for cross-field reasoning and fallback logic.

3. Schema-Aware Prompts + LLMs

We crafted dynamic prompts that told the LLM exactly what fields we were after, based on the schema. For example:

from pydantic import BaseModel, Field

class MedicalEntity(BaseModel):
    patient_name: str = Field(..., description="Full name of the patient")
    age: int = Field(..., description="Age of the patient")
    diagnosis: str = Field(..., description="Medical diagnosis")
    medication: list[str] = Field(..., description="List of prescribed medications")
    date: str = Field(..., description="Date of prescription")

Prompt: “Using the context, extract the following fields in JSON format with their associated descriptions: {schema_fields}”

4. Real-Time Validation

The LLM-generated output was passed directly into a Pydantic model:

entity = MedicalEntity.parse_obj(llm_output)

If validation failed or a field was missing, we triggered smart retries using sub-queries within the graph: “What date was this prescription issued?”

How the System Works

Step-by-step:

  1. Upload Any Document — PDF, image, or text.
  2. Text Extraction & Embedding — OCR + embedding (via FAISS, Qdrant, etc.).
  3. Domain Classification — LLM-based classification to determine domain (e.g., finance, legal).
  4. Graph Routing — The Document is sent to the relevant graph node.
  5. Prompt + LLM — Context + prompt generates field-level JSON output.
  6. Pydantic Validation — Ensures data integrity and structure.

The Modular Graph Node

class GraphNode:
    def __init__(self, retriever, schema, prompt_template):
        self.retriever = retriever
        self.schema = schema
        self.prompt_template = prompt_template

    def run(self, query, context):
        prompt = self.prompt_template.format(schema_fields=self.schema.schema_json())
        result = query_llm(prompt, context)
        return self.schema.parse_obj(result)

Adding a new format is as simple as:

  • Defining a new Pydantic model
  • Creating a retriever
  • Registering a graph node

Real-World Example: Prescription Extraction

  • User uploads a prescription
  • OCR converts it to text
  • Classifier says: healthcare
  • Sent to healthcare graph node
  • Prompt: “Extract patient_name, age, diagnosis, date, medications”
  • LLM responds, Pydantic validates
  • Missing field? Follow-up question triggered automatically

Why This Works So Well

FeatureEncoder ModelsGraph RAG + Pydantic
Time to production2 monthsFew hours
Handles multiple domains
Schema-aware extraction
Cross-field reasoning
Smart retry for missing fields
Validated JSON output

Final Thoughts

Encoder-based systems got us part of the way, but the cost and rigidity were too high. With Graph RAG + Pydantic, we transformed our pipeline into a fast, flexible, and schema-aware system.

Now, we can:

  • Launch new formats with minimal effort
  • Deliver production-ready extraction within hours
  • Ensure every output is validated, structured, and API-ready

This shift didn’t just make us faster. It made us smarter.

Share this article
Want to speak with our solution experts?
Jellyfish Technologies

Modernize Legacy System With AI: A Strategy for CEOs

Download the eBook and get insights on CEOs growth strategy

    Let's Talk

    We believe in solving complex business challenges of the converging world, by using cutting-edge technologies.