Jellyfish Technologies Logo

How Fine-Tuning a Model Solved a Domain-Specific Language Processing Problem

How Fine-Tuning a Model Solved a Domain-Specific Language Understanding Problem

In the age of large language models (LLMs), generic models like GPT, Mistral, and LLaMA are capable of completing a wide array of tasks. However, these models often fail to perform optimally in specialized domains such as legal, medical, scientific, or industrial documentation. In such cases, fine-tuning becomes essential. In this blog, we’ll walk through a real-world technical problem where fine-tuning a language model on a domain-specific dataset dramatically improved performance, including the full process, tooling, and lessons learned.

The Problem: Inconsistent and Incorrect Legal Responses

Let’s say you’re developing a legal research assistant — a chatbot for Indian criminal law that helps junior advocates and law students query specific Indian Penal Code (IPC) sections and fetch relevant judgments. You use a state-of-the-art LLM like Mistral-7B, known for its performance on reasoning tasks. However, the moment users ask questions like:

  • “What is the punishment under Section 307 IPC?”
  • “Can you give past judgments for Section 375 IPC?”

…the model starts hallucinating. Sometimes it merges laws from unrelated sections, sometimes it generates punishments that don’t exist, and in many cases, it provides general legal advice rather than specific statutory responses.

To confirm the issue, we first tested the base Google/gemma-2b-it model on several IPC-related prompts, and it consistently failed to generate accurate responses.

Root Cause: The base model was trained on general-purpose data, not on any authoritative Indian legal corpus. It lacked structured knowledge about Indian criminal law, legal terminology, and statutory sections.

The Goal

To fine-tune the base model so that it:

  • Responds accurately to IPC-related questions.
  • Understands the structure of law sections (title, explanation, punishment).
  • Fetches relevant summaries of historical judgments.
  • Avoids hallucinating or giving irrelevant generic legal content.

Dataset Creation and Structure

We designed a dataset around the Indian Penal Code and past Supreme Court/High Court judgments. The dataset was structured into three columns:

  • Instruction – What the user might ask.

  • Input (Context) – Additional details or context.

  • Response – A well-structured, factual, legal response.

This structure followed the instruction tuning format, allowing the model to learn how to respond to task-specific queries given additional context.

Here’s an example:


{
"instruction": "What is the punishment for attempt to murder under IPC?",
"input": "",
"response": "Section 307 states that whoever does any act with intent to cause death, and such act causes hurt, shall be punished with imprisonment up to 10 years and may also be liable to fine. If the hurt is caused, the punishment may extend to life imprisonment."
}
  

To enhance judgment-based retrieval, we added hundreds of such examples with summaries of real Indian judgments sourced from free legal repositories, converted into layman-friendly text.

Fine-Tuning the Model

We chose Google/gemma-2b-it for its balance of efficiency and performance. Unlike frameworks like Unsloth, we used the Trainer API from HuggingFace Transformers, which provided robust options for custom training, evaluation, and metric logging. This was done using instruction fine-tuning, where the model is trained to respond to clearly defined tasks framed as instructions.

Tools & Libraries

  • Transformers & Datasets – For tokenization, model loading, and Trainer setup
  • PEFT (LoRA) – For low-rank adaptation to avoid full model training
  • WandB – For logging, evaluation, and experiment tracking
  • PyTorch – Core framework for training

Training Configuration (LoRA)

base_model: google/gemma-2b-it
load_in_4bit: true
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ["q_proj", "v_proj"]
dataset: ipc_legal_data.json
max_seq_length: 2048
num_train_epochs: 3
per_device_train_batch_size: 8
learning_rate: 2e-5
evaluation_strategy: steps
save_strategy: steps
logging_steps: 50

We also added custom evaluation callbacks using the Trainer API to assess model output against gold responses at every checkpoint.

Results Before vs. After Fine-Tuning

QueryBase Model (Before)Fine-Tuned Model (After)
“Punishment under Sec 307 IPC?”“Attempt to murder is a serious offense and usually punished based on severity.” (vague)“Section 307 IPC prescribes punishment up to 10 years or life imprisonment with fine if hurt is caused.” (accurate)
“Give relevant judgments for Section 375 IPC”“Rape is a punishable crime. Seek legal help.”“In State of Rajasthan v. N.K., the Supreme Court held that consent under fear is not valid. In another case…”
“Explain Sec 34 with an example”“This section talks about people acting together.”“Section 34 talks about common intention. E.g., if A and B assault C with a shared plan, both are liable.”

The fine-tuned model understood statutory structures, legal precedent referencing, and even formatting of responses to suit legal professionals.

Data Challenges Faced

  1. Inconsistent Formatting in Raw Data
     IPC data sourced from different websites had inconsistent styles — some included line breaks, some had Unicode bullets, and others had HTML tags. We created a custom Python parser to standardize all laws into a uniform JSON structure.
  2. Judgment Noise
     Court judgments are long, full of procedural language and party names. We used a combination of sentence segmentation + named entity filtering to isolate just the ratio decidendi (reason for judgment).
  3. Repeating Prompt Issue
    Initially, the model started repeating the prompt in the output. We solved it by restructuring training examples using the prompt template:
### Instruction:
{instruction}
### Context:
{input}
### Response:
{response}

Evaluation and Testing

We created an internal evaluation set of 50 unseen legal queries across IPC sections, legal maxims, and judgment summaries.

Evaluation metrics included:

  • Factual Accuracy – Matched against law textbooks.
  • Relevance – Measured by the number of hallucinated sentences.
  • Fluency – Grammatical correctness and coherence.

Key Learnings

  1. Domain fine-tuning is not optional for high-risk fields like law or medicine.
     Even powerful LLMs need explicit exposure to structured domain knowledge to perform well.
  2. Good data > Large data.
     A small but well-structured dataset of ~2000 examples gave better results than 50k scraped legal documents.
  3. Prompt formatting matters.
     Clear separation between instruction, context, and response ensures the model learns expected behavior.
  4. Instruction tuning enhances usability.
     Models trained on clearly defined instruction-response pairs are easier to integrate into downstream applications like chatbots.
  5. Trainer API provides modularity.
     Using HuggingFace’s Trainer API gave us better logging, evaluation checkpoints, and reproducibility than running custom training loops.

Conclusion

Fine-tuning a dataset is not just about getting the model to perform a new task — it’s about aligning the model’s behavior with the nuanced requirements of your domain. In this case, by crafting a high-quality legal dataset and carefully fine-tuning it on a well-suited model using HuggingFace’s Trainer API and instruction-based fine-tuning, we turned a hallucination-prone assistant into a reliable legal research tool.

Whether you’re working in finance, engineering, healthcare, or law, domain-specific fine-tuning can be your secret weapon for model accuracy and user trust.

Need help building AI applications or software for your organization? Contact us today to discover the best solutions tailored to your needs!

Share this article
Want to speak with our solution experts?
Jellyfish Technologies

Modernize Legacy System With AI: A Strategy for CEOs

Download the eBook and get insights on CEOs growth strategy

    Let's Talk

    We believe in solving complex business challenges of the converging world, by using cutting-edge technologies.