Performance Optimization

This guide provides tips and techniques for optimizing the performance of local_llm_kit.

Hardware Considerations

GPU Selection

The choice of GPU significantly impacts inference performance:

NVIDIA RTX 30/40 Series: Excellent performance with consumer GPUs
NVIDIA A100/H100: Enterprise-grade performance for production deployments
AMD GPUs: Works with ROCm backend (limited support)

Recommended VRAM for different model sizes:

Model Size	Recommended VRAM
7B	8GB+ (16GB optimal)
13B	16GB+ (24GB optimal)
30B+	24GB+ (80GB optimal)

CPU Optimization

For CPU inference:

Use CPUs with AVX2/AVX512 instruction sets
Allocate at least 16GB of RAM for medium-sized models
Set appropriate thread counts based on CPU cores

Memory Optimization

Quantization

Quantizing models significantly reduces memory usage:

from local_llm_kit import LLMClient

# Using GPTQ quantized model with Transformers backend
client = LLMClient(
    model="llama2-7b-4bit",
    backend="transformers",
    model_path="TheBloke/Llama-2-7B-Chat-GPTQ",
    quantization_config={
        "bits": 4,
        "group_size": 128
    }
)

# Using GGUF quantized model with llama.cpp backend
client = LLMClient(
    model="llama2-7b-q4_k_m",
    backend="llama.cpp",
    model_path="/path/to/llama-2-7b-chat.q4_k_m.gguf"
)

Efficient KV Cache Management

To optimize the key-value cache:

client = LLMClient(
    model="llama2",
    kv_cache_config={
        "max_cache_size_mb": 1024,   # Maximum KV cache size in MB
        "enable_cache_cleaning": True  # Automatically clear old entries
    }
)

# For long-running applications, periodically clear the cache
client.clear_kv_cache()

Batch Processing

Process multiple prompts efficiently with batching:

from local_llm_kit import LLMClient
import concurrent.futures

client = LLMClient(
    model="llama2",
    max_batch_size=32  # Set based on GPU memory
)

prompts = [
    "Write a poem about mountains.",
    "Explain quantum physics.",
    "What is the capital of France?",
    # ... more prompts
]

# Option 1: Built-in batching
responses = client.batch_generate(
    prompts=prompts,
    max_tokens=100
)

# Option 2: Manual parallelization with threading
def process_prompt(prompt):
    return client.chat.completions.create(
        model="llama2",
        messages=[{"role": "user", "content": prompt}]
    )

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(process_prompt, prompts))

GPU Optimizations

Utilize Tensor Parallelism

For multi-GPU setups, distribute model across GPUs:

client = LLMClient(
    model="llama2-70b",
    tensor_parallel_size=4,  # Use 4 GPUs
    device="cuda"  # Automatically distribute across available GPUs
)

Flash Attention

Enable flash attention for faster computation:

client = LLMClient(
    model="llama2",
    backend="transformers",
    use_flash_attention=True
)

Mixed Precision

Use FP16 or BFloat16 for faster computation:

client = LLMClient(
    model="llama2",
    backend="transformers",
    dtype="bfloat16"  # Or "float16" based on GPU support
)

Streaming Optimization

For streaming responses, optimize chunk size:

# Balance between latency and throughput with chunk size
for chunk in client.chat.completions.create(
    model="llama2",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True,
    chunk_token_size=16  # Smaller for lower latency, larger for better throughput
):
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Performance Benchmarking

Measure and optimize performance:

import time
from local_llm_kit import LLMClient

client = LLMClient(model="llama2")

prompt = "Explain the theory of relativity in simple terms."

# Warmup
client.chat.completions.create(
    model="llama2",
    messages=[{"role": "user", "content": "Hello"}],
    max_tokens=10
)

# Benchmark
start_time = time.time()
response = client.chat.completions.create(
    model="llama2",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=100
)
end_time = time.time()

# Calculate metrics
generation_time = end_time - start_time
output_tokens = response.usage.completion_tokens
tokens_per_second = output_tokens / generation_time

print(f"Generation time: {generation_time:.2f}s")
print(f"Output tokens: {output_tokens}")
print(f"Tokens per second: {tokens_per_second:.2f}")

Common Performance Issues

Out of Memory: Reduce model size, enable quantization, or increase VRAM
Slow Inference: Try mixed precision, flash attention, or a faster backend
High CPU Usage: Limit thread count or switch to GPU inference
Batch Processing Bottlenecks: Tune batch size, use async processing

Advanced Configuration

For production deployments:

client = LLMClient(
    model="llama2",

    # Memory optimization
    max_memory_mapping={
        0: "24GiB",  # GPU 0: 24GB
        1: "24GiB"   # GPU 1: 24GB
    },

    # Computation optimization
    compute_dtype="bfloat16",
    use_flash_attention=True,

    # Cache settings
    disk_cache_config={
        "enable": True,
        "cache_dir": "/path/to/cache",
        "max_size_gb": 100
    },

    # Thread and batch settings
    num_cpu_threads=8,
    max_batch_size=16
)