RiteLLM
Overview
RiteLLM is a high-performance LLM (Large Language Model) gateway that provides a unified interface for interacting with multiple LLM providers. Built with Rust and exposed through elegant Python bindings, RiteLLM combines the speed of compiled systems programming with the ease of Python development.
✨ Key Features
- 🚀 Unified LLM Gateway: Single, consistent API for multiple LLM providers
- 🔌 Provider Support: Currently supports OpenAI, with more providers coming soon (Anthropic, Google, Cohere, and more)
- ⚡ Rust-Powered Performance: Core engine built in Rust for maximum speed and efficiency
- 📊 First-Class Observability: Built-in integration with Weights & Biases Weave for seamless tracing, monitoring, and debugging
- 🐍 Pythonic Interface: Clean, intuitive Python API that feels native to the ecosystem
- 🔒 Type-Safe: Full type hints for better IDE support and code quality
- 🌊 Streaming Support: Real-time streaming responses for better user experience
🚀 Installation
Install RiteLLM using pip:
💻 Quick Start
Basic Usage
from ritellm import completion
# Define your messages
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what Rust is in one sentence."}
]
# Make a completion request
response = completion(
model="openai/gpt-3.5-turbo",
messages=messages,
temperature=0.7,
max_tokens=100
)
# Access the response
print(response["choices"][0]["message"]["content"])
print(f"Tokens used: {response['usage']['total_tokens']}")
Streaming Responses
For real-time streaming of responses as they are generated:
from ritellm import completion
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short poem about Rust."}
]
# Enable streaming
response = completion(
model="openai/gpt-3.5-turbo",
messages=messages,
stream=True # Enable streaming
)
# Stream the response
for chunk in response:
if "choices" in chunk and len(chunk["choices"]) > 0:
content = chunk["choices"][0]["delta"].get("content", "")
if content:
print(content, end="", flush=True)
print() # New line after streaming completes
See the Streaming Guide for more details.
Async Usage
For concurrent requests and non-blocking API calls, use the async acompletion
function:
import asyncio
from ritellm import acompletion
async def main():
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
# Non-blocking async call
response = await acompletion(
model="openai/gpt-3.5-turbo",
messages=messages
)
print(response["choices"][0]["message"]["content"])
asyncio.run(main())
See the Async Usage Guide for more details on async mode and concurrent requests.
With Weave Tracing
RiteLLM has first-class support for Weave, enabling automatic tracing and monitoring of your LLM calls:
import weave
from ritellm import completion
# Initialize Weave
weave.init(project_name="my-llm-project")
# Wrap completion with Weave's op decorator for automatic tracing
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
response = weave.op(completion)(
model="openai/gpt-3.5-turbo",
messages=messages,
temperature=0.7
)
# Your calls are now automatically traced in Weave!
🙏 Gratitude
ritellm
is highly inspired by litellm and its simple API design.- Made with ❤️ and 🦀