Supported Models
This page documents the models supported by local_llm_kit and their configurations.
Model Backends
Transformers Backend
The Transformers backend supports models from Hugging Face’s Transformers library.
- Supported Model Types:
LLaMA and LLaMA-2
Mistral
Falcon
MPT
GPTQ quantized models
Configuration:
client = LLMClient(
model="llama2",
backend="transformers",
model_path="meta-llama/Llama-2-7b-chat-hf",
device="cuda", # or "cpu"
dtype="float16", # or "float32", "bfloat16"
trust_remote_code=True
)
llama.cpp Backend
The llama.cpp backend supports GGUF format models.
- Supported Features:
4-bit, 5-bit, and 8-bit quantization
GPU acceleration
Metal support on macOS
Efficient CPU inference
Configuration:
client = LLMClient(
model="llama2",
backend="llama.cpp",
model_path="/path/to/model.gguf",
n_gpu_layers=32, # Number of layers to offload to GPU
n_ctx=2048, # Context window size
n_batch=512 # Batch size for prompt processing
)
Model Configuration
Common Parameters
These parameters work with all model backends:
client = LLMClient(
model="llama2",
temperature=0.7, # Randomness in generation (0.0 to 1.0)
top_p=0.9, # Nucleus sampling parameter
top_k=40, # Top-k sampling parameter
repetition_penalty=1.1, # Penalty for repeating tokens
max_tokens=100, # Maximum tokens to generate
)
Memory Requirements
Approximate memory requirements for different model sizes:
Performance Tips
GPU Acceleration
For optimal GPU performance:
Use CUDA devices when available
Enable flash attention if supported
Use appropriate batch sizes
Monitor GPU memory usage
client = LLMClient(
model="llama2",
device="cuda",
use_flash_attention=True,
max_batch_size=32
)
CPU Optimization
For CPU inference:
Use quantized models
Set appropriate thread count
Enable CPU optimizations
client = LLMClient(
model="llama2",
device="cpu",
threads=8,
use_mmap=True,
use_avx2=True
)
Model Selection Guide
Choosing the right model depends on your use case:
Resource-Constrained Environments - Use 4-bit quantized 7B models - Consider CPU-optimized models - Reduce context length if possible
High-Performance Requirements - Use larger models (13B+) - Enable GPU acceleration - Optimize batch processing
Balanced Setup - Use 7B models with 8-bit quantization - Balance GPU/CPU usage - Adjust parameters based on workload
Custom Model Integration
You can integrate custom models by:
Converting to GGUF format for llama.cpp
Using Hugging Face’s model format
Implementing custom tokenizers
Example:
from local_llm_kit import LLMClient, CustomTokenizer
# Custom tokenizer implementation
class MyTokenizer(CustomTokenizer):
def encode(self, text):
# Implementation
pass
def decode(self, tokens):
# Implementation
pass
# Use custom model
client = LLMClient(
model="custom",
tokenizer=MyTokenizer(),
model_path="/path/to/custom/model"
)
Troubleshooting
Common Issues:
Out of Memory - Reduce batch size - Use quantization - Decrease context length
Slow Performance - Check device utilization - Optimize model parameters - Consider model quantization
Model Loading Errors - Verify model path - Check format compatibility - Ensure sufficient resources