Supported Models =============== This page documents the models supported by ``local_llm_kit`` and their configurations. Model Backends ------------ Transformers Backend ~~~~~~~~~~~~~~~~~~ The Transformers backend supports models from Hugging Face's Transformers library. Supported Model Types: - LLaMA and LLaMA-2 - Mistral - Falcon - MPT - GPTQ quantized models Configuration: .. code-block:: python client = LLMClient( model="llama2", backend="transformers", model_path="meta-llama/Llama-2-7b-chat-hf", device="cuda", # or "cpu" dtype="float16", # or "float32", "bfloat16" trust_remote_code=True ) llama.cpp Backend ~~~~~~~~~~~~~~~ The llama.cpp backend supports GGUF format models. Supported Features: - 4-bit, 5-bit, and 8-bit quantization - GPU acceleration - Metal support on macOS - Efficient CPU inference Configuration: .. code-block:: python client = LLMClient( model="llama2", backend="llama.cpp", model_path="/path/to/model.gguf", n_gpu_layers=32, # Number of layers to offload to GPU n_ctx=2048, # Context window size n_batch=512 # Batch size for prompt processing ) Model Configuration ----------------- Common Parameters ~~~~~~~~~~~~~~~ These parameters work with all model backends: .. code-block:: python client = LLMClient( model="llama2", temperature=0.7, # Randomness in generation (0.0 to 1.0) top_p=0.9, # Nucleus sampling parameter top_k=40, # Top-k sampling parameter repetition_penalty=1.1, # Penalty for repeating tokens max_tokens=100, # Maximum tokens to generate ) Memory Requirements ~~~~~~~~~~~~~~~~ Approximate memory requirements for different model sizes: +-------------+------------------+------------------+ | Model Size | FP16 (GPU) | 4-bit Quantized | +=============+==================+==================+ | 7B | ~14 GB | ~4 GB | +-------------+------------------+------------------+ | 13B | ~26 GB | ~7 GB | +-------------+------------------+------------------+ | 70B | ~140 GB | ~35 GB | +-------------+------------------+------------------+ Performance Tips -------------- GPU Acceleration ~~~~~~~~~~~~~~ For optimal GPU performance: 1. Use CUDA devices when available 2. Enable flash attention if supported 3. Use appropriate batch sizes 4. Monitor GPU memory usage .. code-block:: python client = LLMClient( model="llama2", device="cuda", use_flash_attention=True, max_batch_size=32 ) CPU Optimization ~~~~~~~~~~~~~ For CPU inference: 1. Use quantized models 2. Set appropriate thread count 3. Enable CPU optimizations .. code-block:: python client = LLMClient( model="llama2", device="cpu", threads=8, use_mmap=True, use_avx2=True ) Model Selection Guide ------------------ Choosing the right model depends on your use case: 1. Resource-Constrained Environments - Use 4-bit quantized 7B models - Consider CPU-optimized models - Reduce context length if possible 2. High-Performance Requirements - Use larger models (13B+) - Enable GPU acceleration - Optimize batch processing 3. Balanced Setup - Use 7B models with 8-bit quantization - Balance GPU/CPU usage - Adjust parameters based on workload Custom Model Integration --------------------- You can integrate custom models by: 1. Converting to GGUF format for llama.cpp 2. Using Hugging Face's model format 3. Implementing custom tokenizers Example: .. code-block:: python from local_llm_kit import LLMClient, CustomTokenizer # Custom tokenizer implementation class MyTokenizer(CustomTokenizer): def encode(self, text): # Implementation pass def decode(self, tokens): # Implementation pass # Use custom model client = LLMClient( model="custom", tokenizer=MyTokenizer(), model_path="/path/to/custom/model" ) Troubleshooting ------------- Common Issues: 1. Out of Memory - Reduce batch size - Use quantization - Decrease context length 2. Slow Performance - Check device utilization - Optimize model parameters - Consider model quantization 3. Model Loading Errors - Verify model path - Check format compatibility - Ensure sufficient resources