Best Practices ============= This guide outlines best practices for using ``local_llm_kit`` effectively and efficiently. Model Selection ------------- Choosing the Right Model ~~~~~~~~~~~~~~~~~~~~~~ 1. Consider your requirements: * Task complexity * Response quality needs * Performance constraints * Resource availability 2. Match model size to available resources: * 7B models for most use cases * 13B+ models for higher quality * Quantized models for resource constraints 3. Consider specialization: * Chat models for dialogue * Code models for programming * Multi-lingual models for language tasks Performance Optimization --------------------- Memory Management ~~~~~~~~~~~~~~~ 1. Token Management: .. code-block:: python client = LLMClient(model="llama2") client.enable_memory(max_tokens=1000) # Regularly clear memory client.clear_memory() # Monitor token usage response = client.chat.completions.create(...) print(f"Used tokens: {response.usage.total_tokens}") 2. Batch Processing: .. code-block:: python # Process multiple prompts efficiently responses = [] for prompt in prompts: response = client.chat.completions.create( model="llama2", messages=[{"role": "user", "content": prompt}], max_tokens=50 # Limit response length ) responses.append(response) GPU Utilization ~~~~~~~~~~~~~ 1. Optimal Settings: .. code-block:: python client = LLMClient( model="llama2", device="cuda", use_flash_attention=True, max_batch_size=32, dtype="float16" ) 2. Memory Monitoring: * Use GPU monitoring tools * Adjust batch size based on memory * Consider gradient checkpointing Error Handling ------------ Robust Implementation ~~~~~~~~~~~~~~~~~~ 1. Basic Error Handling: .. code-block:: python try: response = client.chat.completions.create( model="llama2", messages=[{"role": "user", "content": "Hello"}] ) except Exception as e: logger.error(f"Chat completion failed: {e}") # Implement fallback behavior 2. Specific Error Types: .. code-block:: python from local_llm_kit.exceptions import ( ModelNotFoundError, TokenLimitError, InvalidRequestError ) try: # Your code here except ModelNotFoundError: # Handle missing model except TokenLimitError: # Handle token limit exceeded except InvalidRequestError: # Handle invalid parameters Retry Logic ~~~~~~~~~~ .. code-block:: python from tenacity import retry, stop_after_attempt, wait_exponential @retry(stop=stop_after_attempt(3), wait=wait_exponential()) def get_completion(prompt): return client.chat.completions.create( model="llama2", messages=[{"role": "user", "content": prompt}] ) Prompt Engineering --------------- Effective Prompts ~~~~~~~~~~~~~~~ 1. Clear Instructions: .. code-block:: python messages = [ { "role": "system", "content": "You are a helpful assistant. Provide clear, concise answers." }, { "role": "user", "content": "What is machine learning? Explain in simple terms." } ] 2. Context Management: .. code-block:: python # Add relevant context messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": "Remember this: I'm John."}, {"role": "assistant", "content": "Hello John!"}, {"role": "user", "content": "What's my name?"} ] Temperature Control ~~~~~~~~~~~~~~~~ 1. For Deterministic Responses: .. code-block:: python response = client.chat.completions.create( model="llama2", messages=messages, temperature=0.0, # More deterministic top_p=1.0 ) 2. For Creative Responses: .. code-block:: python response = client.chat.completions.create( model="llama2", messages=messages, temperature=0.8, # More creative top_p=0.9 ) Security Considerations -------------------- Input Validation ~~~~~~~~~~~~~~ 1. Sanitize Inputs: .. code-block:: python def sanitize_input(text): # Implement input sanitization return cleaned_text user_input = sanitize_input(raw_input) response = client.chat.completions.create( model="llama2", messages=[{"role": "user", "content": user_input}] ) 2. Content Filtering: .. code-block:: python def is_safe_content(text): # Implement content safety checks return is_safe if not is_safe_content(user_input): raise SecurityError("Unsafe content detected") Model Security ~~~~~~~~~~~~ 1. Model Access Control: .. code-block:: python # Use environment variables for sensitive paths model_path = os.getenv("LOCAL_LLM_KIT_MODEL_PATH") client = LLMClient( model="llama2", model_path=model_path ) 2. Rate Limiting: .. code-block:: python from ratelimit import limits, sleep_and_retry @sleep_and_retry @limits(calls=10, period=60) # 10 calls per minute def rate_limited_completion(prompt): return client.chat.completions.create( model="llama2", messages=[{"role": "user", "content": prompt}] ) Monitoring and Logging ------------------- Logging Setup ~~~~~~~~~~~ 1. Basic Logging: .. code-block:: python import logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger("local_llm_kit") logger.info("Initializing client...") client = LLMClient(model="llama2") 2. Detailed Logging: .. code-block:: python handler = logging.FileHandler("llm.log") handler.setFormatter(logging.Formatter( '%(asctime)s - %(name)s - %(levelname)s - %(message)s' )) logger.addHandler(handler) Performance Monitoring ~~~~~~~~~~~~~~~~~~ 1. Response Times: .. code-block:: python import time start_time = time.time() response = client.chat.completions.create(...) duration = time.time() - start_time logger.info(f"Response time: {duration:.2f}s") 2. Resource Usage: .. code-block:: python import psutil def log_resource_usage(): process = psutil.Process() logger.info(f"Memory usage: {process.memory_info().rss / 1024 / 1024:.2f} MB") logger.info(f"CPU usage: {process.cpu_percent()}%") Testing and Validation ------------------- Unit Testing ~~~~~~~~~~ 1. Basic Tests: .. code-block:: python import unittest class TestLLMClient(unittest.TestCase): def setUp(self): self.client = LLMClient(model="llama2") def test_completion(self): response = self.client.chat.completions.create( model="llama2", messages=[{"role": "user", "content": "Hello"}] ) self.assertIsNotNone(response) 2. Mock Testing: .. code-block:: python from unittest.mock import patch @patch("local_llm_kit.LLMClient") def test_with_mock(mock_client): mock_client.return_value.chat.completions.create.return_value = mock_response # Test implementation Integration Testing ~~~~~~~~~~~~~~~~ .. code-block:: python def test_end_to_end(): client = LLMClient(model="llama2") # Test chat completion response1 = client.chat.completions.create(...) # Test function calling response2 = client.chat.completions.create( functions=[function_spec], function_call="auto", ... ) # Test streaming response3 = client.chat.completions.create( stream=True, ... ) Deployment Best Practices ---------------------- Environment Setup ~~~~~~~~~~~~~~ 1. Dependencies: .. code-block:: bash pip install local-llm-kit[all] # Install all optional dependencies 2. Environment Variables: .. code-block:: bash export LOCAL_LLM_KIT_MODEL_PATH="/path/to/models" export LOCAL_LLM_KIT_CACHE_DIR="/path/to/cache" Production Configuration ~~~~~~~~~~~~~~~~~~~~~ 1. Load Balancing: .. code-block:: python clients = [ LLMClient(model="llama2", device=f"cuda:{i}") for i in range(torch.cuda.device_count()) ] 2. Health Checks: .. code-block:: python def health_check(): try: response = client.chat.completions.create( model="llama2", messages=[{"role": "user", "content": "test"}] ) return True except Exception: return False Remember to regularly review and update these practices based on your specific use case and requirements.