- 3 min read

Fine-Tuning LLaMA 3.1 with Unsloth

On this page
Introduction

Introduction to LLM Inference

In this post, I will walk through how to fine-tune Meta's LLaMA 3.1 8B model using Unsloth, a library optimized for efficient LLM training. I will cover everything from installing dependencies to training and saving the fine-tuned model.

1. Setting Up the Environment

Before we begin fine-tuning, we need to install the required packages:

%%capture
!pip install unsloth
!pip install datasets
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
  • unsloth is the primary library used for loading and fine-tuning the LLaMA model efficiently.
  • datasets helps us handle and preprocess text datasets.
  • We uninstall and reinstall unsloth from its latest GitHub version to ensure we have the newest features and bug fixes.

2. Model Configuration and Loading

from unsloth import FastLanguageModel
import torch

# Configuration
max_seq_length = 8192  # Setting the context length to 8192
dtype = torch.bfloat16  
load_in_4bit = False 

model_name = "unsloth/Meta-Llama-3.1-8B"

# Load the model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)
  • We define a sequence length of 8192, which means the model can process long-context data.
  • dtype = torch.bfloat16 sets bfloat16 as the precision type, reducing memory usage.
  • The model is loaded using FastLanguageModel.from_pretrained(), which fetches Meta LLaMA 3.1 8B.

3. Applying LoRA for Efficient Fine-Tuning

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, 
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0, 
    bias = "none",    
    use_gradient_checkpointing = "unsloth", 
    random_state = 3407,
    use_rslora = False,  
    loftq_config = None, 
)

We apply LoRA (Low-Rank Adaptation) to reduce the number of trainable parameters:

  • r = 16: The rank of LoRA updates (trade-off between memory and adaptability).
  • lora_alpha = 16: A scaling factor for LoRA layers.
  • use_gradient_checkpointing = "unsloth": Reduces memory usage during training.
  • This allows efficient fine-tuning without modifying the entire model.

4. Loading and Preprocessing the Dataset

from datasets import Dataset

# Load dataset from JSON file
dataset = Dataset.from_json("dataset.json")

5. Formatting the Dataset

custom_prompt = """Below is a prompt and its corresponding response. Write a completion that adheres to the response.

### Prompt:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token  

def formatting_prompts_func(examples):
    prompts = examples["prompt"]
    responses = examples["response"]
    texts = []
    for prompt, response in zip(prompts, responses):
        text = custom_prompt.format(prompt, response) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)
  • We format the dataset into a prompt-response structure, adding an EOS_TOKEN at the end to indicate completion.
  • This function ensures that our training data follows a structured format for proper fine-tuning.

6. Tokenizing the Dataset

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=8192)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Save tokenized dataset for fine-tuning
tokenized_dataset.save_to_disk("tokenized_dataset")

print("Dataset preprocessing complete. Ready for fine-tuning!")
  • The function tokenizes our formatted dataset, ensuring each sample fits within the 8192-token limit.
  • We truncate longer inputs and pad shorter ones to maintain uniformity.
  • The dataset is then saved for training

7. Fine-Tuning the Model

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=tokenized_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,  
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=2,  
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",  
    ),
)
trainer_stats = trainer.train()
  • We use SFTTrainer (Supervised Fine-Tuning) from trl to manage training.
  • gradient_accumulation_steps=4 helps optimize memory usage.
  • learning_rate=2e-4 sets the learning rate for gradual updates.
  • The optimizer adamw_8bit is used for efficiency.
  • The model is trained for 2 epochs with a batch size of 2 per GPU.

8. Saving the Fine-Tuned Model

model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
  • Saves the fine-tuned model in a 16-bit format to optimize storage.
  • The model is now ready for inference and further deployment.

Conclusion

Fine-tuning LLaMA 3.1 with Unsloth offers a powerful and memory-efficient way to adapt LLMs for custom use cases. By using LoRA, structured dataset preparation, and an optimized training approach, we can achieve high-quality results with limited resources.