Supported Models

This page documents the models supported by local_llm_kit and their configurations.

Model Backends

Transformers Backend

The Transformers backend supports models from Hugging Face’s Transformers library.

Supported Model Types:

LLaMA and LLaMA-2
Mistral
Falcon
MPT
GPTQ quantized models

Configuration:

client = LLMClient(
    model="llama2",
    backend="transformers",
    model_path="meta-llama/Llama-2-7b-chat-hf",
    device="cuda",  # or "cpu"
    dtype="float16",  # or "float32", "bfloat16"
    trust_remote_code=True
)

llama.cpp Backend

The llama.cpp backend supports GGUF format models.

Supported Features:

4-bit, 5-bit, and 8-bit quantization
GPU acceleration
Metal support on macOS
Efficient CPU inference

Configuration:

client = LLMClient(
    model="llama2",
    backend="llama.cpp",
    model_path="/path/to/model.gguf",
    n_gpu_layers=32,  # Number of layers to offload to GPU
    n_ctx=2048,  # Context window size
    n_batch=512  # Batch size for prompt processing
)

Model Configuration

Common Parameters

These parameters work with all model backends:

client = LLMClient(
    model="llama2",
    temperature=0.7,  # Randomness in generation (0.0 to 1.0)
    top_p=0.9,  # Nucleus sampling parameter
    top_k=40,  # Top-k sampling parameter
    repetition_penalty=1.1,  # Penalty for repeating tokens
    max_tokens=100,  # Maximum tokens to generate
)

Memory Requirements

Approximate memory requirements for different model sizes:

Performance Tips

GPU Acceleration

For optimal GPU performance:

Use CUDA devices when available
Enable flash attention if supported
Use appropriate batch sizes
Monitor GPU memory usage

client = LLMClient(
    model="llama2",
    device="cuda",
    use_flash_attention=True,
    max_batch_size=32
)

CPU Optimization

For CPU inference:

Use quantized models
Set appropriate thread count
Enable CPU optimizations

client = LLMClient(
    model="llama2",
    device="cpu",
    threads=8,
    use_mmap=True,
    use_avx2=True
)

Model Selection Guide

Choosing the right model depends on your use case:

Resource-Constrained Environments - Use 4-bit quantized 7B models - Consider CPU-optimized models - Reduce context length if possible
High-Performance Requirements - Use larger models (13B+) - Enable GPU acceleration - Optimize batch processing
Balanced Setup - Use 7B models with 8-bit quantization - Balance GPU/CPU usage - Adjust parameters based on workload

Custom Model Integration

You can integrate custom models by:

Converting to GGUF format for llama.cpp
Using Hugging Face’s model format
Implementing custom tokenizers

Example:

from local_llm_kit import LLMClient, CustomTokenizer

# Custom tokenizer implementation
class MyTokenizer(CustomTokenizer):
    def encode(self, text):
        # Implementation
        pass

    def decode(self, tokens):
        # Implementation
        pass

# Use custom model
client = LLMClient(
    model="custom",
    tokenizer=MyTokenizer(),
    model_path="/path/to/custom/model"
)

Troubleshooting

Common Issues:

Out of Memory - Reduce batch size - Use quantization - Decrease context length
Slow Performance - Check device utilization - Optimize model parameters - Consider model quantization
Model Loading Errors - Verify model path - Check format compatibility - Ensure sufficient resources