Introduction to LLM Inference
In this post, I will walk through how to fine-tune Meta's LLaMA 3.1 8B model using Unsloth, a library optimized for efficient LLM training. I will cover everything from installing dependencies to training and saving the fine-tuned model.
1. Setting Up the Environment
Before we begin fine-tuning, we need to install the required packages:
%%capture
!pip install unsloth
!pip install datasets
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
unsloth
is the primary library used for loading and fine-tuning the LLaMA model efficiently.datasets
helps us handle and preprocess text datasets.- We uninstall and reinstall
unsloth
from its latest GitHub version to ensure we have the newest features and bug fixes.
2. Model Configuration and Loading
from unsloth import FastLanguageModel
import torch
# Configuration
max_seq_length = 8192 # Setting the context length to 8192
dtype = torch.bfloat16
load_in_4bit = False
model_name = "unsloth/Meta-Llama-3.1-8B"
# Load the model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)
- We define a sequence length of 8192, which means the model can process long-context data.
dtype = torch.bfloat16
sets bfloat16 as the precision type, reducing memory usage.- The model is loaded using
FastLanguageModel.from_pretrained()
, which fetches Meta LLaMA 3.1 8B.
3. Applying LoRA for Efficient Fine-Tuning
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 3407,
use_rslora = False,
loftq_config = None,
)
We apply LoRA (Low-Rank Adaptation) to reduce the number of trainable parameters:
r = 16
: The rank of LoRA updates (trade-off between memory and adaptability).lora_alpha = 16
: A scaling factor for LoRA layers.use_gradient_checkpointing = "unsloth"
: Reduces memory usage during training.- This allows efficient fine-tuning without modifying the entire model.
4. Loading and Preprocessing the Dataset
from datasets import Dataset
# Load dataset from JSON file
dataset = Dataset.from_json("dataset.json")
5. Formatting the Dataset
custom_prompt = """Below is a prompt and its corresponding response. Write a completion that adheres to the response.
### Prompt:
{}
### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
prompts = examples["prompt"]
responses = examples["response"]
texts = []
for prompt, response in zip(prompts, responses):
text = custom_prompt.format(prompt, response) + EOS_TOKEN
texts.append(text)
return {"text": texts}
dataset = dataset.map(formatting_prompts_func, batched=True)
- We format the dataset into a prompt-response structure, adding an EOS_TOKEN at the end to indicate completion.
- This function ensures that our training data follows a structured format for proper fine-tuning.
6. Tokenizing the Dataset
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=8192)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
# Save tokenized dataset for fine-tuning
tokenized_dataset.save_to_disk("tokenized_dataset")
print("Dataset preprocessing complete. Ready for fine-tuning!")
- The function tokenizes our formatted dataset, ensuring each sample fits within the 8192-token limit.
- We truncate longer inputs and pad shorter ones to maintain uniformity.
- The dataset is then saved for training
7. Fine-Tuning the Model
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=tokenized_dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
dataset_num_proc=2,
packing=False,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
num_train_epochs=2,
learning_rate=2e-4,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="outputs",
report_to="none",
),
)
trainer_stats = trainer.train()
- We use SFTTrainer (Supervised Fine-Tuning) from
trl
to manage training. gradient_accumulation_steps=4
helps optimize memory usage.learning_rate=2e-4
sets the learning rate for gradual updates.- The optimizer adamw_8bit is used for efficiency.
- The model is trained for 2 epochs with a batch size of 2 per GPU.
8. Saving the Fine-Tuned Model
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
- Saves the fine-tuned model in a 16-bit format to optimize storage.
- The model is now ready for inference and further deployment.
Conclusion
Fine-tuning LLaMA 3.1 with Unsloth offers a powerful and memory-efficient way to adapt LLMs for custom use cases. By using LoRA, structured dataset preparation, and an optimized training approach, we can achieve high-quality results with limited resources.