Fine-Tuning Llama2
- 2 min read

Fine-Tuning Llama2

On this page
Introduction

In this tutorial, I will walk through the steps to fine-tune LLaMA2 using the Hugging Face Transformers library, along with LoRA (Low-Rank Adaptation) to make the process more efficient.

Setting Up the Environment

Start by setting up the environment and importing the necessary libraries.

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

Loading the Dataset

Load the dataset that will be used for fine-tuning.

dataset_name = "mlabonne/guanaco-llama2-1k"
dataset = load_dataset(dataset_name, split="train")

Configuring the Model for Fine-Tuning

BitsAndBytes Configuration

Configure the BitsAndBytes settings to enable 4-bit quantization for efficient training.

model_name = "NousResearch/Llama-2-7b-hf"
use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

Loading the Base Model and Tokenizer

Load the base LLaMA model and tokenizer.

device_map = {"": 0}

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Applying LoRA Configuration

Configure LoRA for parameter-efficient fine-tuning.

lora_r = 64
lora_alpha = 16
lora_dropout = 0.1

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

Defining Training Arguments

Set up the training arguments for the fine-tuning process.

output_dir = "./results"
num_train_epochs = 1
fp16 = False
bf16 = True
per_device_train_batch_size = 4
gradient_accumulation_steps = 1
save_steps = 0
logging_steps = 25
learning_rate = 2e-4
weight_decay = 0.001
optim = "paged_adamw_32bit"
lr_scheduler_type = "cosine"
max_steps = -1
warmup_ratio = 0.03
group_by_length = True

training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=0.3,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

Fine-Tuning the Model

Set up the trainer and start the fine-tuning process.

max_seq_length = None
packing = False

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

trainer.train()

Saving the Fine-Tuned Model

Save the fine-tuned model for future use.

new_model = "Llama-2-7b-chat-finetune"
trainer.model.save_pretrained(new_model)

Testing the Fine-Tuned Model

Generate text using the fine-tuned model to verify its performance.

logging.set_verbosity(logging.CRITICAL)

prompt = "How to fly a plane?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

Conclusion

I have successfully fine-tuned the LLaMA2 model using LoRA and the Hugging Face Transformers library. This process enables efficient model adaptation even with limited computational resources.