Custom Models ============ This guide shows how to extend ``local_llm_kit`` to support custom models. Adding Support for Custom Models ---------------------------- ``local_llm_kit`` is designed to be extensible, allowing you to add support for custom models and backends. Custom Tokenizer -------------- You can implement a custom tokenizer by creating a class that implements the necessary encoding and decoding methods: .. code-block:: python from local_llm_kit import LLMClient from local_llm_kit.backends.base import BaseTokenizer class MyCustomTokenizer(BaseTokenizer): def __init__(self, model_path=None): super().__init__() # Initialize your tokenizer here # This could use a pretrained tokenizer or your own implementation self.model_path = model_path def encode(self, text): # Convert text to token IDs # Return a list of token IDs pass def decode(self, token_ids): # Convert token IDs back to text # Return a string pass def get_vocab_size(self): # Return the vocabulary size of your tokenizer pass # Use your custom tokenizer client = LLMClient( model="custom-model", tokenizer=MyCustomTokenizer(model_path="/path/to/tokenizer"), model_path="/path/to/model/weights" ) Custom Backend ----------- For more advanced customization, you can implement a custom backend: .. code-block:: python from local_llm_kit.backends.base import BaseBackend class MyCustomBackend(BaseBackend): def __init__(self, model_path, **kwargs): super().__init__() # Initialize your model here self.model_path = model_path # Load your model or set up your inference engine def generate(self, prompt, max_tokens=100, temperature=0.7, **kwargs): # Implement the generation logic for your model # Return a string containing the generated text pass def get_prompt_tokens(self, prompt): # Return the number of tokens in the prompt pass def get_completion_tokens(self, completion): # Return the number of tokens in the completion pass # Register your custom backend from local_llm_kit.llm import LLM LLM.register_backend("my-custom-backend", MyCustomBackend) # Use your custom backend client = LLMClient( model="custom-model", backend="my-custom-backend", model_path="/path/to/model/weights" ) Custom Prompt Formatting --------------------- You can also define custom prompt templates for your models: .. code-block:: python from local_llm_kit.prompt_formatting import register_prompt_formatter def my_custom_formatter(messages, add_generation_prompt=True): """ Format chat messages for a custom model architecture. """ formatted_prompt = "" for message in messages: role = message["role"] content = message["content"] if role == "system": formatted_prompt += f"<|system|>\n{content}\n" elif role == "user": formatted_prompt += f"<|user|>\n{content}\n" elif role == "assistant": formatted_prompt += f"<|assistant|>\n{content}\n" elif role == "function": formatted_prompt += f"<|function|>\n{content}\n" if add_generation_prompt: formatted_prompt += "<|assistant|>\n" return formatted_prompt # Register your custom formatter register_prompt_formatter("my-custom-model", my_custom_formatter) # Use your custom formatter client = LLMClient( model="my-custom-model", # Other parameters... ) Example: Integrating with vLLM --------------------------- Here's an example of integrating with the vLLM inference engine: .. code-block:: python from local_llm_kit.backends.base import BaseBackend class VLLMBackend(BaseBackend): def __init__(self, model_path, **kwargs): super().__init__() # Import vLLM here to avoid making it a hard dependency from vllm import LLM # Initialize vLLM engine self.engine = LLM( model=model_path, tensor_parallel_size=kwargs.get("tensor_parallel_size", 1), gpu_memory_utilization=kwargs.get("gpu_memory_utilization", 0.9), # Other vLLM parameters... ) def generate(self, prompt, max_tokens=100, temperature=0.7, **kwargs): from vllm import SamplingParams # Set up sampling parameters sampling_params = SamplingParams( temperature=temperature, max_tokens=max_tokens, top_p=kwargs.get("top_p", 1.0), # Other sampling parameters... ) # Generate text with vLLM outputs = self.engine.generate(prompt, sampling_params) # Extract generated text generated_text = outputs[0].outputs[0].text return generated_text # Register vLLM backend from local_llm_kit.llm import LLM LLM.register_backend("vllm", VLLMBackend) # Use vLLM backend client = LLMClient( model="llama2", backend="vllm", model_path="meta-llama/Llama-2-70b-chat-hf", tensor_parallel_size=4 # For multi-GPU inference )