Fine-Tuning Large Language Models (LLMs) ( DistilGPT-2)

Transfer Learning and Fine-Tuning Large Language Models

In this post, we will explore the concept of Transfer Learning, its connection to fine-tuning large language models (LLMs), and step-by-step instructions to fine-tune DistilGPT-2.


What is Transfer Learning?

Transfer Learning is a powerful machine learning technique where a model trained on one task is reused as the starting point for another, often related, task.

  • It enables leveraging pre-trained models trained on large datasets to solve new but related problems.
  • Transfer Learning helps reduce training time, data requirements, and computational costs significantly.
  • Example: Using a language model trained on Wikipedia to generate news articles.

The following diagram shows the flow of Transfer Learning:

Target DatasetPretrained LLMFine-tuned LLM

In this process, the original model parameters are updated during fine-tuning, which can be computationally expensive but highly effective.


Connection with Fine-Tuning LLMs

  • Large Language Models (LLMs) like GPT-2 and DistilGPT-2 are typically pre-trained on extensive text corpora.
  • Fine-tuning means adjusting the weights of a pre-trained LLM to specialize it for a specific task or dataset.
  • Fine-tuning is a type of Transfer Learning where the base model’s knowledge is refined to improve performance on domain-specific tasks.
  • This process enhances the model’s capability to generate accurate and context-relevant responses for specialized applications.

Fine-Tuning DistilGPT-2

DistilGPT-2 is a lightweight version of GPT-2 that is faster and requires fewer computational resources while retaining most of GPT-2’s performance.

Steps to Fine-Tune DistilGPT-2:

  • Load the pre-trained DistilGPT-2 model.
  • Prepare your custom text dataset.
  • Use the Hugging Face Trainer API for fine-tuning.

This makes DistilGPT-2 a great choice for applications where computational resources are limited.


Fine-Tuning Steps (1–4)

Step 1: Import Required Libraries

from datasets import Dataset from transformers import GPT2Tokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling 

Step 2: Prepare Dataset

You can create a dataset using a Python dictionary:

data = {'text': ["Example sentence 1.", "Example sentence 2."]} dataset = Dataset.from_dict(data) 

Step 3: Load and Configure Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2') tokenizer.pad_token = tokenizer.eos_token 

Step 4: Load Model

model = AutoModelForCausalLM.from_pretrained('distilgpt2') model.resize_token_embeddings(len(tokenizer)) 

Fine-Tuning Steps (5–7)

Step 5: Tokenize Dataset

Create a tokenization function that applies padding and truncation. Use:

dataset = dataset.map(tokenize_function, batched=True)

Step 6: Prepare Data Collator

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) 

Step 7: Define Training Arguments

training_args = TrainingArguments( output_dir="./results", per_device_train_batch_size=2, num_train_epochs=3, save_steps=500, logging_steps=500 ) 

Fine-Tuning Steps (8–10)

Step 8: Initialize Trainer

trainer = Trainer( model=model, args=training_args, train_dataset=dataset, data_collator=data_collator ) 

Step 9: Train the Model

trainer.train() 

Step 10: Save the Fine-Tuned Model

model.save_pretrained("./fine_tuned_model") tokenizer.save_pretrained("./fine_tuned_model") 

Why Fine-Tuning is Important

Fine-tuning allows developers to:

  • Specialize large models for niche applications.
  • Improve performance with limited data.
  • Optimize models for low-resource environments using smaller versions like DistilGPT-2.
  • Save time and computational resources by building on pre-trained knowledge.

This is especially useful for applications in healthcare, legal, education, and custom chatbots where domain-specific knowledge is essential.


Advantages of Transfer Learning and Fine-Tuning

  • Reduces the need for large labeled datasets.
  • Enables faster convergence during training.
  • Achieves higher accuracy on specialized tasks compared to training from scratch.
  • Allows for the reuse of state-of-the-art models like GPT-2, BERT, and their smaller variants.

Complete Code for FineTuning

from datasets import Dataset
from transformers import GPT2Tokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling

# Prepare the dataset
data = {
‘text’: [
“Bindeshwar Singh Kushwaha is passionate about artificial intelligence and machine learning.”,
“He belongs to Basuhari village, post Deoria, in Ghazipur district.”,
“Bindeshwar is dedicated to developing research in the field of AI and robotics.”,
“He loves teaching complex mathematical concepts and making them easy to understand.”,
“Bindeshwar has experience working with Python, PyTorch, and deep learning models.”,
“His goal is to contribute impactful research papers in the field of generative AI.”,
“Bindeshwar aims to build lightweight models that can run efficiently on limited hardware.”,
“He is working on fine-tuning language models using small datasets on CPU-based systems.”,
“Coming from Ghazipur, he is motivated to bring cutting-edge AI research to rural areas.”,
“Bindeshwar Singh Kushwaha dreams of training and inspiring the next generation of AI researchers.”
]
}

# Load the dataset
dataset = Dataset.from_dict(data)

# Load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(‘distilgpt2’)

# Add padding token
tokenizer.pad_token = tokenizer.eos_token

# Load model
model = AutoModelForCausalLM.from_pretrained(‘distilgpt2’)
model.resize_token_embeddings(len(tokenizer))

# Tokenization function
def tokenize_function(examples):
return tokenizer(examples[‘text’], truncation=True, padding=’max_length’, max_length=64)

# Tokenize dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=[‘text’])

# Data collator
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False, # For causal language modeling
)

# Training arguments
training_args = TrainingArguments(
output_dir=’./results-distilgpt2′,
overwrite_output_dir=True,
num_train_epochs=5,
per_device_train_batch_size=2,
save_steps=10,
save_total_limit=2,
prediction_loss_only=True,
logging_steps=5,
)

# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
data_collator=data_collator,
)

# Start training
trainer.train()

# Save the fine-tuned model
model.save_pretrained(‘./finetuned-distilgpt2-custom’)
tokenizer.save_pretrained(‘./finetuned-distilgpt2-custom’)

_________________________________________________________

Complete Code for Inference

from transformers import GPT2Tokenizer, AutoModelForCausalLM

# Load the fine-tuned model and tokenizer
model_path = ‘./finetuned-distilgpt2-custom’
tokenizer = GPT2Tokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

# Text generation function
def generate_text(prompt, max_length=100, num_return_sequences=1):
inputs = tokenizer(prompt, return_tensors=’pt’)
outputs = model.generate(
inputs[‘input_ids’],
max_length=max_length,
num_return_sequences=num_return_sequences,
pad_token_id=tokenizer.eos_token_id,
temperature=0.90,
top_p=0.95,
top_k=150
)
return [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

# Example prompt
prompt = “Coming from Ghazipur, His name is Bindeshwar ”
generated_texts = generate_text(prompt)

# Print the generated text
for idx, text in enumerate(generated_texts):
print(f”Generated Text {idx + 1}:\n{text}\n”)

 


PDF

finetundistilgpt2

Video

Follow PostNetwork Academy


Thank You!

Stay connected to learn more about Transfer Learning, Fine-Tuning, and cutting-edge AI techniques.

©Postnetwork-All rights reserved.