LLM Gateway Configuration

Claude Code can be configured to work with custom LLM gateways, enterprise endpoints, and alternative providers. This guide covers advanced configurations for routing LLM requests through custom infrastructure.

Understanding LLM Gateways

Common Gateway Scenarios

Enterprise Gateway

Route all LLM requests through corporate infrastructure for security and compliance

Cost Management

Track usage, implement quotas, and optimize model selection based on task requirements

Multi-Provider

Switch between different LLM providers (OpenAI, Anthropic, etc.) seamlessly

Edge Deployment

Use local or edge-deployed models for sensitive data or offline scenarios

Basic Gateway Configuration

Environment Configuration

Set custom endpoint

# Point to your LLM gateway
export CLAUDE_API_ENDPOINT=https://llm-gateway.company.com/v1
export CLAUDE_API_KEY=your-gateway-api-key

# For compatibility with OpenAI-style gateways
export OPENAI_API_BASE=https://llm-gateway.company.com/v1
export OPENAI_API_KEY=your-gateway-api-key

Configure authentication

# Bearer token authentication
export CLAUDE_AUTH_TYPE=bearer
export CLAUDE_AUTH_TOKEN=your-bearer-token

# Custom headers for enterprise gateways
export CLAUDE_CUSTOM_HEADERS='{"X-Department": "Engineering", "X-Project": "ProjectName"}'

Test the connection

claude "Test gateway connection:
- Verify endpoint is reachable
- Check authentication
- Test model availability
- Validate response format"

Enterprise Gateway Patterns

Load Balancing

Multi-Instance Configuration

# HAProxy configuration for LLM load balancing
cat > haproxy.cfg << EOF
global
    daemon

defaults
    mode http
    timeout connect 5000ms
    timeout client 300000ms  # 5 minutes for LLM responses
    timeout server 300000ms

frontend llm_frontend
    bind *:8080
    default_backend llm_servers

backend llm_servers
    balance leastconn  # Best for varying response times
    option httpchk GET /health
    server llm1 llm1.internal:8000 check weight 100
    server llm2 llm2.internal:8000 check weight 100
    server llm3 llm3.internal:8000 check weight 50  # Less powerful instance
EOF

# Configure Claude Code to use load balancer
export CLAUDE_API_ENDPOINT=http://localhost:8080/v1

Request Routing

Model selection based on task

# Custom gateway router example
from flask import Flask, request, jsonify
import requests

app = Flask(__name__)

@app.route('/v1/completions', methods=['POST'])
def route_request():
    data = request.json
    prompt_length = len(data.get('prompt', ''))

    # Route based on prompt complexity
    if prompt_length < 500:
        # Use faster, cheaper model for simple tasks
        model = 'claude-instant'
        endpoint = 'http://instant-llm:8000'
    elif 'code' in data.get('prompt', '').lower():
        # Use specialized code model
        model = 'claude-opus-4.5'
        endpoint = 'http://opus-llm:8000'
    else:
        # Default model
        model = 'claude-sonnet-4.5'
        endpoint = 'http://sonnet-llm:8000'

    # Forward request
    response = requests.post(
        f"{endpoint}/v1/completions",
        json={**data, 'model': model},
        headers=request.headers
    )

    return response.json()

Implement circuit breaker

claude "Create a circuit breaker gateway that:
- Implements retry logic with exponential backoff
- Falls back to secondary endpoints
- Monitors success rates
- Provides health status endpoint"

Security & Compliance

PII Detection

# Gateway middleware for PII filtering
@app.before_request
def check_pii():
    if request.method == 'POST':
        data = request.get_json()
        prompt = data.get('prompt', '')

        # Check for PII patterns
        if detect_pii(prompt):
            return jsonify({
                'error': 'PII detected in prompt',
                'type': 'security_violation'
            }), 400

Audit Logging

# Comprehensive request logging
@app.after_request
def log_request(response):
    log_entry = {
        'timestamp': datetime.utcnow(),
        'user': request.headers.get('X-User-ID'),
        'department': request.headers.get('X-Department'),
        'model': request.json.get('model'),
        'prompt_tokens': count_tokens(request.json.get('prompt')),
        'response_tokens': count_tokens(response.json.get('text')),
        'latency': response.headers.get('X-Process-Time'),
        'status': response.status_code
    }
    audit_logger.info(json.dumps(log_entry))
    return response

Custom Model Deployment

Local Model Serving

# Install and run Ollama
curl -fsSL https://ollama.ai/install.sh | sh
ollama serve

# Pull models
ollama pull llama2
ollama pull codellama

# Configure Claude Code
export CLAUDE_API_ENDPOINT=http://localhost:11434/api
export CLAUDE_MODEL_NAME=codellama

# Test local model
claude "Test local Ollama model:
- Generate code snippet
- Explain functionality
- Check response quality"

# Run vLLM server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --port 8000 \
    --max-model-len 4096

# Configure Claude Code
export CLAUDE_API_ENDPOINT=http://localhost:8000/v1
export CLAUDE_MODEL_NAME=meta-llama/Llama-2-7b-chat-hf

# Performance tuning
export VLLM_NUM_GPUS=2
export VLLM_TENSOR_PARALLEL_SIZE=2

# Text Generation Inference by HuggingFace
docker run --gpus all --shm-size 1g \
    -p 8080:80 \
    -v $PWD/data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-2-7b-chat-hf \
    --max-input-length 2048 \
    --max-total-tokens 4096

# Configure endpoint
export CLAUDE_API_ENDPOINT=http://localhost:8080
export CLAUDE_API_FORMAT=tgi  # Special format handling

Edge Deployment

Offline-First Configuration

# Configure fallback chain
export CLAUDE_ENDPOINTS=(
    "http://localhost:8000/v1"  # Local model first
    "https://edge-server.local/v1"  # Edge server
    "https://llm-gateway.company.com/v1"  # Corporate gateway
    "https://api.anthropic.com/v1"  # Direct API fallback
)

# Implement fallback logic
claude "Create a wrapper script that:
- Tries each endpoint in order
- Falls back on connection failure
- Caches successful endpoints
- Monitors latency and adjusts order
- Provides offline capabilities"

Advanced Configurations

Multi-Provider Setup

Configure provider routing

providers:
  anthropic:
    endpoint: https://api.anthropic.com
    api_key: ${ANTHROPIC_API_KEY}
    models: [claude-opus-4.5, claude-sonnet-4.5]
  openai:
    endpoint: https://api.openai.com
    api_key: ${OPENAI_API_KEY}
    models: [gpt-4, gpt-3.5-turbo]
  cohere:
    endpoint: https://api.cohere.ai
    api_key: ${COHERE_API_KEY}
    models: [command-r-plus]

routing_rules:
  - pattern: "code.*"
    preferred_model: claude-opus-4.5
    fallback: [gpt-4, command-r-plus]
  - pattern: "chat.*"
    preferred_model: gpt-3.5-turbo
    fallback: [claude-sonnet-4.5]

Implement smart routing

claude "Create a smart router that:
- Routes by task complexity
- Implements cost optimization
- Tracks token usage per model
- Provides fallback options"

Cost Optimization

Token Counting

# Pre-flight token estimation
from tiktoken import encoding_for_model

def estimate_cost(prompt, model):
    encoder = encoding_for_model(model)
    prompt_tokens = len(encoder.encode(prompt))

    # Estimate response tokens
    estimated_response = prompt_tokens * 0.75

    # Calculate cost
    costs = {
        'claude-opus-4.5': {'input': 0.015, 'output': 0.075},
        'claude-sonnet-4.5': {'input': 0.003, 'output': 0.015},
        'gpt-4': {'input': 0.03, 'output': 0.06}
    }

    model_cost = costs.get(model, {'input': 0.001, 'output': 0.002})
    total_cost = (prompt_tokens * model_cost['input'] +
                  estimated_response * model_cost['output']) / 1000

    return {
        'prompt_tokens': prompt_tokens,
        'estimated_response_tokens': estimated_response,
        'estimated_cost': total_cost
    }

Budget Controls

# Implement spending limits
class BudgetGateway:
    def __init__(self, daily_limit=100, user_limits=None):
        self.daily_limit = daily_limit
        self.user_limits = user_limits or {}
        self.usage = defaultdict(float)

    def check_budget(self, user, estimated_cost):
        # Check daily limit
        if self.usage['total'] + estimated_cost > self.daily_limit:
            raise BudgetExceeded("Daily limit reached")

        # Check user limit
        user_limit = self.user_limits.get(user, float('inf'))
        if self.usage[user] + estimated_cost > user_limit:
            raise BudgetExceeded(f"User {user} limit reached")

        return True

    def record_usage(self, user, actual_cost):
        self.usage['total'] += actual_cost
        self.usage[user] += actual_cost

Monitoring & Observability

Gateway Metrics

# Prometheus metrics for gateway monitoring
from prometheus_client import Counter, Histogram, Gauge

# Define metrics
request_count = Counter('llm_requests_total', 'Total LLM requests',
                       ['model', 'status', 'user'])
request_duration = Histogram('llm_request_duration_seconds',
                           'LLM request duration', ['model'])
active_requests = Gauge('llm_active_requests', 'Active LLM requests')
token_usage = Counter('llm_tokens_total', 'Total tokens used',
                     ['model', 'type'])

# Instrument gateway
@app.route('/v1/completions', methods=['POST'])
def handle_completion():
    with request_duration.labels(model=request.json['model']).time():
        active_requests.inc()
        try:
            response = process_request(request.json)
            request_count.labels(
                model=request.json['model'],
                status='success',
                user=request.headers.get('X-User-ID')
            ).inc()
            return response
        except Exception as e:
            request_count.labels(
                model=request.json['model'],
                status='error',
                user=request.headers.get('X-User-ID')
            ).inc()
            raise
        finally:
            active_requests.dec()

Health Checks

Comprehensive Health Monitoring

claude "Create health check system that monitors:
- Endpoint availability
- Response times
- Error rates
- Token usage trends"

Best Practices

Troubleshooting

Connection Issues

# Debug gateway connection
claude "Debug gateway connection:
- Test network connectivity
- Verify authentication
- Check SSL certificates
- Analyze request/response headers"

Performance Testing

# Benchmark gateway performance
claude "Create performance test that:
- Measures latency for different models
- Tests concurrent request handling
- Monitors resource usage
- Generates performance report"

Next Steps

Explore related advanced topics:

Proxy Configuration for network setup
Cost Control for budget management
Monitoring Costs for usage tracking

Remember: LLM gateways provide powerful capabilities for enterprise deployments. Start simple and add complexity as needed, always keeping security and reliability as top priorities.

LLM Gateway Configuration

LLM Gateway Configuration

Understanding LLM Gateways

Common Gateway Scenarios

Basic Gateway Configuration

Environment Configuration

Popular Gateway Solutions

Enterprise Gateway Patterns

Load Balancing

Request Routing

Security & Compliance

Custom Model Deployment

Local Model Serving

Edge Deployment

Advanced Configurations

Multi-Provider Setup

Cost Optimization

Monitoring & Observability

Gateway Metrics

Health Checks

Best Practices

Troubleshooting

Connection Issues

Performance Testing

Next Steps