Performance Optimization
This guide provides tips and techniques for optimizing the performance of local_llm_kit.
Hardware Considerations
GPU Selection
The choice of GPU significantly impacts inference performance:
NVIDIA RTX 30/40 Series: Excellent performance with consumer GPUs
NVIDIA A100/H100: Enterprise-grade performance for production deployments
AMD GPUs: Works with ROCm backend (limited support)
Recommended VRAM for different model sizes:
Model Size |
Recommended VRAM |
|---|---|
7B |
8GB+ (16GB optimal) |
13B |
16GB+ (24GB optimal) |
30B+ |
24GB+ (80GB optimal) |
CPU Optimization
For CPU inference:
Use CPUs with AVX2/AVX512 instruction sets
Allocate at least 16GB of RAM for medium-sized models
Set appropriate thread counts based on CPU cores
Memory Optimization
Quantization
Quantizing models significantly reduces memory usage:
from local_llm_kit import LLMClient
# Using GPTQ quantized model with Transformers backend
client = LLMClient(
model="llama2-7b-4bit",
backend="transformers",
model_path="TheBloke/Llama-2-7B-Chat-GPTQ",
quantization_config={
"bits": 4,
"group_size": 128
}
)
# Using GGUF quantized model with llama.cpp backend
client = LLMClient(
model="llama2-7b-q4_k_m",
backend="llama.cpp",
model_path="/path/to/llama-2-7b-chat.q4_k_m.gguf"
)
Efficient KV Cache Management
To optimize the key-value cache:
client = LLMClient(
model="llama2",
kv_cache_config={
"max_cache_size_mb": 1024, # Maximum KV cache size in MB
"enable_cache_cleaning": True # Automatically clear old entries
}
)
# For long-running applications, periodically clear the cache
client.clear_kv_cache()
Batch Processing
Process multiple prompts efficiently with batching:
from local_llm_kit import LLMClient
import concurrent.futures
client = LLMClient(
model="llama2",
max_batch_size=32 # Set based on GPU memory
)
prompts = [
"Write a poem about mountains.",
"Explain quantum physics.",
"What is the capital of France?",
# ... more prompts
]
# Option 1: Built-in batching
responses = client.batch_generate(
prompts=prompts,
max_tokens=100
)
# Option 2: Manual parallelization with threading
def process_prompt(prompt):
return client.chat.completions.create(
model="llama2",
messages=[{"role": "user", "content": prompt}]
)
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
results = list(executor.map(process_prompt, prompts))
GPU Optimizations
Utilize Tensor Parallelism
For multi-GPU setups, distribute model across GPUs:
client = LLMClient(
model="llama2-70b",
tensor_parallel_size=4, # Use 4 GPUs
device="cuda" # Automatically distribute across available GPUs
)
Flash Attention
Enable flash attention for faster computation:
client = LLMClient(
model="llama2",
backend="transformers",
use_flash_attention=True
)
Mixed Precision
Use FP16 or BFloat16 for faster computation:
client = LLMClient(
model="llama2",
backend="transformers",
dtype="bfloat16" # Or "float16" based on GPU support
)
Streaming Optimization
For streaming responses, optimize chunk size:
# Balance between latency and throughput with chunk size
for chunk in client.chat.completions.create(
model="llama2",
messages=[{"role": "user", "content": "Write a story"}],
stream=True,
chunk_token_size=16 # Smaller for lower latency, larger for better throughput
):
print(chunk.choices[0].delta.content or "", end="", flush=True)
Performance Benchmarking
Measure and optimize performance:
import time
from local_llm_kit import LLMClient
client = LLMClient(model="llama2")
prompt = "Explain the theory of relativity in simple terms."
# Warmup
client.chat.completions.create(
model="llama2",
messages=[{"role": "user", "content": "Hello"}],
max_tokens=10
)
# Benchmark
start_time = time.time()
response = client.chat.completions.create(
model="llama2",
messages=[{"role": "user", "content": prompt}],
max_tokens=100
)
end_time = time.time()
# Calculate metrics
generation_time = end_time - start_time
output_tokens = response.usage.completion_tokens
tokens_per_second = output_tokens / generation_time
print(f"Generation time: {generation_time:.2f}s")
print(f"Output tokens: {output_tokens}")
print(f"Tokens per second: {tokens_per_second:.2f}")
Common Performance Issues
Out of Memory: Reduce model size, enable quantization, or increase VRAM
Slow Inference: Try mixed precision, flash attention, or a faster backend
High CPU Usage: Limit thread count or switch to GPU inference
Batch Processing Bottlenecks: Tune batch size, use async processing
Advanced Configuration
For production deployments:
client = LLMClient(
model="llama2",
# Memory optimization
max_memory_mapping={
0: "24GiB", # GPU 0: 24GB
1: "24GiB" # GPU 1: 24GB
},
# Computation optimization
compute_dtype="bfloat16",
use_flash_attention=True,
# Cache settings
disk_cache_config={
"enable": True,
"cache_dir": "/path/to/cache",
"max_size_gb": 100
},
# Thread and batch settings
num_cpu_threads=8,
max_batch_size=16
)