Hosting Llama 3.1 Locally with Dual RTX 4090

Hosting large language models (LLMs) like Llama on local hardware provides the flexibility to handle sensitive data in-house while maximizing performance using advanced GPUs like the RTX 4090. In this post, I’ll walk you through how to set up Llama (Meta-Llama-3.1-70B) on an Ubuntu server using dual RTX 4090 GPUs. I will use vLLM as the backend for efficient model serving.

Creating a Virtual Environment

python3 -m venv .venv
source .venv/bin/activate

This command creates and activates a virtual environment named .venv, which isolates dependencies and packages for your project.

Installing vLLM

pip install vllm

vLLM is a high-performance backend server specifically designed for efficient serving of large language models. It allows you to expose LLMs as APIs that can be queried from external applications. Unlike traditional model-serving frameworks, vLLM is optimized for low-latency inference and supports advanced features like tensor parallelism (for distributing models across multiple GPUs).

By installing vllm, I am setting up the infrastructure needed to handle model requests and interact with our Llama model via an API.

Installing Tokenizers

pip install tokenizers==0.19.0

The tokenizers library is crucial for breaking down input text into tokens that the model understands. Different models use different tokenization schemes, so it's essential to have the correct version. Here, I am installing version 0.19.0 to ensure compatibility with the Llama model.

Installing Tokenizers

python3 -m vllm.entrypoints.openai.api_server \
  --model neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 \
  --served-model-name meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.98 \  
  --host 0.0.0.0 \
  --port 8000 \
  --max_model_len 8192

This command launches the vLLM server with the necessary parameters to serve the Llama model. Let's break down each part in detail:

Loading the Quantized Model

--model neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16

Quantization reduces the precision of the model weights (in this case, 4-bit weights with 16-bit activations) to decrease the size and speed up inference without significantly impacting accuracy. The neuralmagic/Meta-Llama-3.1-70B-Instruct-quantized.w4a16 model is a quantized version of the 70B Llama model, which makes it feasible to host on even high-end hardware like dual RTX 4090 GPUs.

Defining the Model Name for Serving

--served-model-name meta-llama/Meta-Llama-3.1-70B-Instruct

This parameter specifies the name by which the model will be identified when making API requests. It is especially useful if you plan to serve multiple models in the future, as each can be uniquely named and referenced.

Setting Tensor Parallelism for Multi-GPU Use

--tensor-parallel-size 2

Since I have two RTX 4090 GPUs, I set the tensor parallel size to 2. Tensor parallelism allows the model to split its operations across multiple GPUs, effectively sharing the load and accelerating inference. The large size of the Llama model (70B parameters) makes this kind of parallelism essential for efficient processing.

Configuring GPU Memory Utilization

--gpu-memory-utilization 0.98

This parameter ensures that nearly all available GPU memory (98%) is utilized, maximizing the model’s performance without hitting memory limits. The dual RTX 4090 GPUs have enough memory to comfortably handle this high utilization rate, allowing me to process larger batches of data or longer sequences.

Setting the Host Address

By setting the host to 0.0.0.0, the server listens on all available network interfaces. This is crucial if you want to access the model API from external machines on your network, such as other servers or workstations.

Setting the Port

--port 8000

The port number defines where the API server will be accessible. Port 8000 is a common choice, but you can change it based on your network configuration or preferences.

Defining Maximum Model Length

--max_model_len 8192

This parameter sets the maximum length (in tokens) that the model can process in a single request. Large language models like Llama can handle extensive inputs, and setting this value to 8192 ensures that even longer sequences of text can be handled efficiently.

Conclusion

By following the steps above, I have successfully hosted the Llama model on an Ubuntu server using dual RTX 4090 GPUs. With vLLM acting as the backend server, I now have a scalable, high-performance API that can handle real-time Llama model inference requests. This setup leverages GPU parallelism and model quantization to optimize performance while ensuring that your hardware is fully utilized.

Now, I can integrate this locally hosted model into various applications, whether for research, development, or deployment in production environments.

Post In:

GenAI

Hosting Llama 3.1 Locally with Dual RTX 4090

Creating a Virtual Environment

Installing vLLM

Installing Tokenizers

Installing Tokenizers

Conclusion

Share:

Post In:

RAG Using Llama3

Graduation Logbook

You might also like

KV Cache in LLMs

0 results found in this keyword