OpenHathi: Fine Tuning Hindi LLM With QLoRA on Google Colab

Table of Contents
Banner image representing for blog written on OpenHathi LLM

In this article, we are going to talk about OpenHathi, a Hindi large language model, and how it is a game-changer for India’s A.I. infrastructure and also explore fine-tuning the OpenHathi-7B-Hi-v0.1-Base model using  HuggingFace available on using Transformers, QLoRA, BitsandBytes, and PEFT over Hindi-QA Dataset.

What is OpenHathi and Why is it a game changer for India's AI infrastructure?

OpenHathi, a project by Sarvam AI in collaboration with academics from AI4 Bharat, is capable of matching GPT-3.5’s prowess in Indic languages while retaining strong English skills it aims to promote Indian language AI through open models and datasets. 

The lack of Indic language support in popular open models like Llama and Mistral has hindered innovation, so the project seeks to address this issue by releasing Hindi-specific models. The scarcity of high-quality and diverse Indic language content and challenges related to tokenization are some factors contributing to this gap. 

Overcoming Obstacles

To overcome the ove mentioned obstacles, the project explores ways to efficiently add support for new languages to existing models using smaller datasets, less computational power during training, and lower resource consumption during inference. 

This model is trained under compute and data constraints to demonstrate that we can achieve GPT-3.5-like performance on Indic languages with a frugal budget. It is built on top of Llama2-7B and extends its tokenizer to 48K tokens. 

We divide our training into two phases:

1) Embedding alignment: aligns the randomly initialized Hindi embeddings, and

2) Bilingual language modeling: teaches the model to attend cross-lingually across tokens.

This model performs as well if not better than GPT-3.5 on various Hindi tasks while maintaining its English performance. Along with standard NLG tasks, it has also been evaluated on a variety of non-academic, real-world tasks.

Tests reveal OpenHathi excels over GPT-3.5 in both Devanagari and Latinised scripts, making this breakthrough possible thanks to collaborations with academics from AI4Bharat and conversational data provided by fellow startup KissanAI.

OpenHathi: A game changer for India’s AI movement

OpenHathi’s focus on simplifying and optimizing tokenization for Hindi is a game changer for the India AI movement for several reasons:

Improved accuracy

Tokenization involves dividing text into meaningful units called tokens, such as words or phrases. In Indic languages like Hindi, the grammar and structure differ significantly from Western languages. By simplifying and optimizing this process specifically for Hindi, OpenHathi ensures that AI models can better understand and accurately interpret Hindi texts.

Increased efficiency

Effective tokenization reduces computational costs by minimizing unnecessary processing steps, leading to faster and more efficient AI models.

Broader applicability

As digital adoption grows nationwide, there’s an increasing demand for AI solutions that cater to local needs and languages. Improved Hindi tokenization in AI applications enhances their reach and utility.

OpenHathi isn’t just a language model; it signifies linguistic inclusivity. Handling English, Hindi, and Hinglish, OpenHathi, featured here, embodies diversity with its fusion of two words, ‘Open’ and ‘Hathi.’

‘Open’ reflects its open-source nature, accessible on HuggingFace for creators, entrepreneurs, and activists. ‘Hathi,’ Hindi for elephant, highlights the model’s size and capabilities, creatively associating it with strength, intelligence, and power.


Now, we are going to discuss the methodology that will be used for fine-tuning the OpenHathi LLM on the Hindi QA dataset using a custom prompt template from Chat LLaMA.

Our methodology involves utilizing several libraries to optimize the process of fine-tuning large language models (LLMs) for specific downstream tasks.

Here's a breakdown of the things we're using


We’re implementing Quantized Low-Rank Adaptation (QLoRA) using this technique, which freezes the pre-trained model weights and introduces trainable rank decomposition matrices into each layer of the Transformer architecture. QLoRA significantly reduces the number of trainable parameters required for downstream tasks, making it easier to fine-tune LLMs on a single GPU.


BitsandBytes library which provides quantization techniques at both 8-bit and 4-bit levels. With BitsandBytes, we don’t need to rely on a separate calibration dataset or perform post-processing. Instead, weight quantization happens automatically upon loading the model, allowing us to save significant amounts of memory while maintaining high performance.


A HuggingFace Transformers library, which simplifies the entire process of loading, training, and saving models. Its flexible and easy-to-use design makes it suitable for a wide range of NLP tasks, including natural language generation, question answering, etc.


Finally, we’re using the Parameter Efficient Fine-Tuning (PEFT) library developed by Hugging Face. PEFT enables us to dramatically decrease both computational and storage expenses associated with fine-tuning LLMs by only fine-tuning a limited subset of parameters rather than every component in the network.

Now just Understand what is QLoRA and how QLoRA works

As We know large language models (LLMs) have “learned” a lot during their pre-training, their vast knowledge often lacks the fine edge of specialization for niche tasks. This is where fine-tuning comes into play, tailoring these models to specific needs. However, fine-tuning has its costs, especially with the hefty size of modern LLMs and their GPU Usage. 

So What is the Solution?

QLoRA is an efficient finetuning approach that reduces memory usage enough to finetune a large language model on a single GPU while preserving full 16-bit finetuning task performance. This technique also allows those small sets of parameters to be added efficiently into the model itself, which means you can do fine-tuning on lots of data sets, potentially, and swap these “adapters” into your model when necessary.

QLoRA backpropagates gradients through a frozen, 4-bit quantized pre-trained language model into Low Rank Adapters (LoRA).QLoRa is a new method that goes three steps further by introducing:

  • 4-bit quantization: This is a method that improves upon quantile quantization by reducing the quantization error.
  • Double quantization: This is a technique that combines 4-bit quantization with 8-bit quantization to further reduce the memory footprint.
QLoRA representational image

The main idea behind QLoRA is to apply low-rank factorization to the weight matrices of a language model before quantizing them to reduce the number of bits required while minimizing loss in performance. This approach has two primary benefits: reducing storage space and lowering computational complexity during inference time. 

By applying low-rank decomposition followed by quantization, the resulting compressed weights can take up fewer bits than standard floating point representations, leading to faster computation times and less memory usage. Moreover, because QLoRA employs low-rank factorization instead of traditional pruning techniques, there is no loss in representation capability; hence, the original model’s expressivity is preserved.

Code Implementation

Let’s Get Started on Fine Tuning OpenHathi-Hi-v0.1 in Google Collab

Install All the Required Packages

					!pip install -q -U torch
!pip install -q -U bitsandbytes
!pip install -q -U datasets
!pip install transformers==4.31
!pip install -q -U git+
!pip install -q -U git+
!pip install -q -U git+
!pip install -q -U sentencepiece

To know more about these packages you can check out the hugging face documentation.

					import pandas as pd
import bitsandbytes as bnb
from functools import partial
import os
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training,PeftConfig,PeftModel
import torch
from transformers import AutoModelForCausalLM,LlamaForCausalLM,LlamaTokenizer, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, \
   DataCollatorForLanguageModeling, Trainer, TrainingArguments, TextStreamer
from datasets import load_dataset,Dataset

Prepare the Function Required for Data Preparation and Training

This function is used to load a model and tokenizer using Transformers and Torch in bitsandbytes configuration from a specific directory into memory with device mapping. 

# Loading Model and Tokenizer with a GPU limit of at most 8 GB
def load_model(model_name, bnb_config):
   n_gpus = torch.cuda.device_count()
   max_memory = f'{8000}MB'

   model = LlamaForCausalLM.from_pretrained(
       device_map="auto",  # Efficiently dispatch the model on available resources
       max_memory={i: max_memory for i in range(n_gpus)},
   tokenizer = LlamaTokenizer.from_pretrained(model_name, cache_dir=cache_dir)

   # Needed for LLaMA tokenizer
   tokenizer.pad_token = tokenizer.eos_token

   return model, tokenizer

These two functions are used to create the configuration for Bits and Bytes Quantization and LoRA Tuning.

					   tokenizer = LlamaTokenizer.from_pretrained(model_name, cache_dir=cache_dir)

   # Needed for LLaMA tokenizer
   tokenizer.pad_token = tokenizer.eos_token

   return model, tokenizer

These two functions are used to create the configuration for Bits and Bytes Quantization and LoRA Tuning.
# Create a BitsAndBytesConfig for quantization
def create_bnb_config():
   # Configure BitsAndBytes quantization with specific settings
   bnb_config = BitsAndBytesConfig(
       load_in_4bit=True,                    # Load weights in 4-bit format
       bnb_4bit_use_double_quant=True,       # Use double quantization for 4-bit
       bnb_4bit_quant_type="nf4",           # 4-bit quantization type
       bnb_4bit_compute_dtype=torch.bfloat16, # Compute data type for 4-bit

   return bnb_config

# Create a Parameter-Efficient Fine-Tuning config for your model
def create_peft_config(modules):
   Create Parameter-Efficient Fine-Tuning config for your model
   :param modules: Names of the modules to apply Lora to
   # Configure Lora (Parameter-Efficient Fine-Tuning) with specific settings
   config = LoraConfig(
       r=16,                # Dimension of the updated matrices
       lora_alpha=64,       # Parameter for scaling
       target_modules=modules, # Names of the modules to apply Lora to
       lora_dropout=0.05,    # Dropout probability
       bias="none",         # Bias type
       task_type="CAUSAL_LM", # Task type (Causal Language Modeling in this case)
   return config


This function is for creating the prompt template required for instruction fine-tuning of the model using llama chat format and Tokenizing the formatted data using Tokenizer.

					def create_prompt_formats(sample):
   Format various fields of the sample ('instruction', 'context', 'response')
   Then concatenate them using two newline characters
   :param sample: Sample dictionary

   system_prompt = '''तुम एक सहायक हो जो सटीक और संक्षेपित उत्तर प्रदान करता है। कृपया प्रदान किए गए पाठ में सूचना ढूंढ़ें और सवाल का संक्षेप में उत्तर दें। अगर आपको उत्तर नहीं पता है, तो एक से ज्यादा वाक्य में बस बताएं कि आप नहीं जानते।'''

   B_INST, E_INST = "[INST]", "[/INST]"
   B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n"

   user_prompt = sample['question']
   context = sample['context']
   response = sample['answer']

   prompt = f"{B_INST} {B_SYS} {system_prompt.strip()} {E_SYS} {user_prompt.strip()} {E_INST} \n\n Response: {response}"

   return prompt

def generate_and_tokenize_prompt(data_point):
   full_prompt = create_prompt_formats(data_point)
   tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True,max_length=1024)
   return tokenized_full_prompt


This function is used to find all trainable modules from the model for LoRA Config and print the trainable parameters in the model.

					def find_all_linear_names(model):
   cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
   lora_module_names = set()
   for name, module in model.named_modules():
       if isinstance(module, cls):
           names = name.split('.')
           lora_module_names.add(names[0] if len(names) == 1 else names[-1])

   if 'lm_head' in lora_module_names:  # needed for 16-bit
   return list(lora_module_names)

def print_trainable_parameters(model, use_4bit=False):
   Prints the number of trainable parameters in the model.
   trainable_params = 0
   all_param = 0
   for _, param in model.named_parameters():
       num_params = param.numel()
       # if using DS Zero 3 and the weights are initialized empty
       if num_params == 0 and hasattr(param, "ds_numel"):
           num_params = param.ds_numel

       all_param += num_params
       if param.requires_grad:
           trainable_params += num_params
   if use_4bit:
       trainable_params /= 2
       f"all params: {all_param:,d} || trainable params: {trainable_params:,d} || trainable%: {100 * trainable_params / all_param}"


Now Enough with the writing Function and Let’s use these functions for real!

					dataset = load_dataset("HydraIndicLM/Hindi_Train_ClosedDomainQA")

cache_dir = "/content/drive/My Drive/hugging_cache" # Model Location

model_name = "sarvamai/OpenHathi-7B-Hi-v0.1-Base"
bnb_config = create_bnb_config() # Creating Configuration

model, tokenizer = load_model(model_name, bnb_config)

training_data = dataset["train"].shuffle().map(generate_and_tokenize_prompt)

Here we have loaded the dataset using load_dataset function from the dataset library and also loaded the model and tokenizer with bnb configuration and tokenized the formatted dataset.

Let's get our hands dirty with the training process!

					def train(model, tokenizer, dataset, output_dir):
   # Apply preprocessing to the model to prepare it by
   # 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning

   # 2 - Using the prepare_model_for_kbit_training method from PEFT
   model = prepare_model_for_kbit_training(model)

   # Get lora module names
   modules = find_all_linear_names(model)

   # Create PEFT config for these modules and wrap the model to PEFT
   peft_config = create_peft_config(modules)
   model = get_peft_model(model, peft_config)

   # Print information about the percentage of trainable parameters

   # Training parameters
   trainer = Trainer(
           lr_scheduler_type ="cosine",
           warmup_ratio = 0.03,
       data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)

   model.config.use_cache = False  # re-enable for inference to speed up predictions for similar inputs

   ### SOURCE
   # Verifying the datatypes before training

   dtypes = {}
   for _, p in model.named_parameters():
       dtype = p.dtype
       if dtype not in dtypes: dtypes[dtype] = 0
       dtypes[dtype] += p.numel()
   total = 0
   for k, v in dtypes.items(): total+= v
   for k, v in dtypes.items():
       print(k, v, v/total)

   do_train = True

   # Launch training

   if do_train:
       train_result = trainer.train()
       metrics = train_result.metrics
       trainer.log_metrics("train", metrics)
       trainer.save_metrics("train", metrics)


   # Saving model
   print("Saving last checkpoint of the model...")
   os.makedirs(output_dir, exist_ok=True)

   # Free memory for merging weights
   # del model
   del trainer
   import gc


Let's start the Training!

					output_dir = "results/llama2/final_checkpoint"
train(model, tokenizer, dataset, output_dir)


Here is Loss During the Model Training!

Unfortunately, it is possible that the latest weights are not the best. To solve this problem, you can implement a EarlyStoppingCallback, from transformers, during your fine-tuning. This will enable you to regularly test your model on the validation set, if you have one, and keep only the best weights. But it would require an eval dataset. You may split the dataset into train and test using the datasets library.

Once we have our fine-tuned weights, we can build our fine-tuned model and save it to a new directory, with its associated tokenizer. By performing these steps, we can have a memory-efficient fine-tuned model and tokenizer ready for inference!


In this blog, we have discussed India’s first Hindi LLM called OpenHathi and fine-tuned the LLM over the Hindi-QA dataset using QLoRA Peft Technique for efficient GPU usage and inference in Google Colab. Sarvam AI’s OpenHathi-Hi-v0.1 represents a significant leap forward in the field of language models, catering to the unique linguistic needs of the Indian market. Its superior performance in Hindi, combined with English proficiency, positions it as a game-changer for various industries, especially those seeking to tap into India’s diverse linguistic landscape.

Share this blog


What do you think?

Contact Us Today for
Inquiries & Assistance

We are happy to answer your queries, propose solution to your technology requirements & help your organization navigate its next.
Your benefits:
What happens next?
We’ll promptly review your inquiry and respond
Our team will guide you through solutions

We will share you the proposal & kick off post your approval

Schedule a Free Consultation

Related articles