In this blog post, I will explore how to fine-tune a large language model using LoRA (Low-Rank Adaptation). I will use the bloom-3b
model from Hugging Face and perform fine-tuning on the SQuAD v2 dataset.
Setup and Installation
First, I need to install the necessary libraries. This includes bitsandbytes
, datasets
, accelerate
, loralib
, and peft
.
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/peft.git
!pip install -q git+https://github.com/huggingface/transformers.git
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
Model Preparation
Load the bloom-3b
model and tokenizer, and prepare the model for fine-tuning.
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"bigscience/bloom-3b",
torch_dtype=torch.float16,
device_map='auto'
)
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-3b")
for param in model.parameters():
param.requires_grad = False
if param.ndim == 1:
param.data = param.data.to(torch.float32)
model.gradient_checkpointing_enable()
model.enable_input_require_grads()
class CastOutputFloat(nn.Sequential):
def forward(self, x):
return super().forward(x).to(torch.float32)
model.lm_head = CastOutputFloat(model.lm_head)
LoRA Configuration
Configure the model to use LoRA for fine-tuning. This involves setting up a LoRA configuration and applying it to the model.
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["query_key_value"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
def print_trainable_parameters(model):
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(f"trainable parameters: {trainable_params} || all parameters: {all_param} || percentage: {trainable_params/all_param*100:.2f}%")
print_trainable_parameters(model)
Dataset Preparation
Load the SQuAD v2 dataset and preprocess it for training.
from datasets import load_dataset
qa_dataset = load_dataset("squad_v2")
def create_prompt(context, question, answer):
if len(answer["text"]) < 1:
answer_text = "Cannot answer"
else:
answer_text = answer["text"][0]
prompt_template = f"### CONTEXT\n{context}\n\n### QUESTION\n{question}\n\n### ANSWER\n{answer_text}</s>"
return prompt_template
mapped_qa_dataset = qa_dataset.map(
lambda samples: tokenizer(
create_prompt(samples['context'], samples['question'], samples['answers'])))
Training the Model
Set up the training arguments and train the model using the Trainer
class from Hugging Face Transformers.
import transformers
trainer = transformers.Trainer(
model=model,
train_dataset=mapped_qa_dataset["train"],
args=transformers.TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=100,
max_steps=100,
learning_rate=1e-3,
fp16=True,
logging_steps=1,
output_dir='outputs',
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False
trainer.train()
Model Deployment
Login to Hugging Face and push the fine-tuned model to the Hugging Face Hub.
HUGGING_FACE_USER_NAME = "Mohammedxo51"
from huggingface_hub import notebook_login
notebook_login()
model_name = "squad-bloom-3b"
model.push_to_hub(f"{HUGGING_FACE_USER_NAME}/{model_name}", use_auth_token=True)
Inference
Load the fine-tuned model and tokenizer, and perform inference to answer questions based on provided context.
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
peft_model_id = f"{HUGGING_FACE_USER_NAME}/{model_name}"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=False, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
qa_model = PeftModel.from_pretrained(model, peft_model_id)
from IPython.display import display, Markdown
def make_inference(context, question):
prompt = f"### CONTEXT\n{context}\n\n### QUESTION\n{question}\n\n### ANSWER\n"
inputs = tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to(qa_model.device) for k, v in inputs.items()}
with torch.cuda.amp.autocast():
output_tokens = qa_model.generate(**inputs, max_new_tokens=200, pad_token_id=tokenizer.eos_token_id)
answer = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
display(Markdown(answer))
context = "Some context"
question = "A question about the context?"
make_inference(context, question)
Conclusion
In this blog post, I covered the steps to fine-tune a large language model using LoRA. I demonstrated how to set up the environment, prepare the model, configure LoRA, preprocess the dataset, train the model, deploy it to the Hugging Face Hub, and perform inference. This approach allows for efficient fine-tuning with a significantly reduced number of trainable parameters.