Performance Optimization
=====================

This guide provides tips and techniques for optimizing the performance of ``local_llm_kit``.

Hardware Considerations
--------------------

GPU Selection
~~~~~~~~~~

The choice of GPU significantly impacts inference performance:

- **NVIDIA RTX 30/40 Series**: Excellent performance with consumer GPUs
- **NVIDIA A100/H100**: Enterprise-grade performance for production deployments
- **AMD GPUs**: Works with ROCm backend (limited support)

Recommended VRAM for different model sizes:

+-------------+----------------------+
| Model Size  | Recommended VRAM     |
+=============+======================+
| 7B          | 8GB+ (16GB optimal)  |
+-------------+----------------------+
| 13B         | 16GB+ (24GB optimal) |
+-------------+----------------------+
| 30B+        | 24GB+ (80GB optimal) |
+-------------+----------------------+

CPU Optimization
~~~~~~~~~~~~

For CPU inference:

- Use CPUs with AVX2/AVX512 instruction sets
- Allocate at least 16GB of RAM for medium-sized models
- Set appropriate thread counts based on CPU cores

Memory Optimization
----------------

Quantization
~~~~~~~~~

Quantizing models significantly reduces memory usage:

.. code-block:: python

   from local_llm_kit import LLMClient
   
   # Using GPTQ quantized model with Transformers backend
   client = LLMClient(
       model="llama2-7b-4bit",
       backend="transformers",
       model_path="TheBloke/Llama-2-7B-Chat-GPTQ",
       quantization_config={
           "bits": 4,
           "group_size": 128
       }
   )
   
   # Using GGUF quantized model with llama.cpp backend
   client = LLMClient(
       model="llama2-7b-q4_k_m",
       backend="llama.cpp",
       model_path="/path/to/llama-2-7b-chat.q4_k_m.gguf"
   )

Efficient KV Cache Management
~~~~~~~~~~~~~~~~~~~~~~~~~

To optimize the key-value cache:

.. code-block:: python

   client = LLMClient(
       model="llama2",
       kv_cache_config={
           "max_cache_size_mb": 1024,   # Maximum KV cache size in MB
           "enable_cache_cleaning": True  # Automatically clear old entries
       }
   )

   # For long-running applications, periodically clear the cache
   client.clear_kv_cache()

Batch Processing
-------------

Process multiple prompts efficiently with batching:

.. code-block:: python

   from local_llm_kit import LLMClient
   import concurrent.futures
   
   client = LLMClient(
       model="llama2",
       max_batch_size=32  # Set based on GPU memory
   )
   
   prompts = [
       "Write a poem about mountains.",
       "Explain quantum physics.",
       "What is the capital of France?",
       # ... more prompts
   ]
   
   # Option 1: Built-in batching
   responses = client.batch_generate(
       prompts=prompts,
       max_tokens=100
   )
   
   # Option 2: Manual parallelization with threading
   def process_prompt(prompt):
       return client.chat.completions.create(
           model="llama2",
           messages=[{"role": "user", "content": prompt}]
       )
   
   with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
       results = list(executor.map(process_prompt, prompts))

GPU Optimizations
--------------

Utilize Tensor Parallelism
~~~~~~~~~~~~~~~~~~~~~~~

For multi-GPU setups, distribute model across GPUs:

.. code-block:: python

   client = LLMClient(
       model="llama2-70b",
       tensor_parallel_size=4,  # Use 4 GPUs
       device="cuda"  # Automatically distribute across available GPUs
   )

Flash Attention
~~~~~~~~~~~

Enable flash attention for faster computation:

.. code-block:: python

   client = LLMClient(
       model="llama2",
       backend="transformers",
       use_flash_attention=True
   )

Mixed Precision
~~~~~~~~~~~~

Use FP16 or BFloat16 for faster computation:

.. code-block:: python

   client = LLMClient(
       model="llama2",
       backend="transformers",
       dtype="bfloat16"  # Or "float16" based on GPU support
   )

Streaming Optimization
------------------

For streaming responses, optimize chunk size:

.. code-block:: python

   # Balance between latency and throughput with chunk size
   for chunk in client.chat.completions.create(
       model="llama2",
       messages=[{"role": "user", "content": "Write a story"}],
       stream=True,
       chunk_token_size=16  # Smaller for lower latency, larger for better throughput
   ):
       print(chunk.choices[0].delta.content or "", end="", flush=True)

Performance Benchmarking
--------------------

Measure and optimize performance:

.. code-block:: python

   import time
   from local_llm_kit import LLMClient
   
   client = LLMClient(model="llama2")
   
   prompt = "Explain the theory of relativity in simple terms."
   
   # Warmup
   client.chat.completions.create(
       model="llama2",
       messages=[{"role": "user", "content": "Hello"}],
       max_tokens=10
   )
   
   # Benchmark
   start_time = time.time()
   response = client.chat.completions.create(
       model="llama2",
       messages=[{"role": "user", "content": prompt}],
       max_tokens=100
   )
   end_time = time.time()
   
   # Calculate metrics
   generation_time = end_time - start_time
   output_tokens = response.usage.completion_tokens
   tokens_per_second = output_tokens / generation_time
   
   print(f"Generation time: {generation_time:.2f}s")
   print(f"Output tokens: {output_tokens}")
   print(f"Tokens per second: {tokens_per_second:.2f}")

Common Performance Issues
---------------------

1. **Out of Memory**: Reduce model size, enable quantization, or increase VRAM
2. **Slow Inference**: Try mixed precision, flash attention, or a faster backend
3. **High CPU Usage**: Limit thread count or switch to GPU inference
4. **Batch Processing Bottlenecks**: Tune batch size, use async processing

Advanced Configuration
------------------

For production deployments:

.. code-block:: python

   client = LLMClient(
       model="llama2",
       
       # Memory optimization
       max_memory_mapping={
           0: "24GiB",  # GPU 0: 24GB
           1: "24GiB"   # GPU 1: 24GB
       },
       
       # Computation optimization
       compute_dtype="bfloat16",
       use_flash_attention=True,
       
       # Cache settings
       disk_cache_config={
           "enable": True,
           "cache_dir": "/path/to/cache",
           "max_size_gb": 100
       },
       
       # Thread and batch settings
       num_cpu_threads=8,
       max_batch_size=16
   )