Run AI agents entirely on your own hardware - no cloud, no API costs, complete privacy.
The easiest way to run local models on Windows.
# Using winget
winget install Ollama.Ollama
# Or download from https://ollama.ai
# Start the Ollama service
ollama serve
# In another terminal, pull models
ollama pull codellama:13b
ollama pull deepseek-coder:6.7b
| Model | Pull Command | VRAM | Notes |
|---|---|---|---|
| CodeLlama 7B | ollama pull codellama:7b |
8GB | Good starter |
| CodeLlama 13B | ollama pull codellama:13b |
16GB | Better quality |
| CodeLlama 34B | ollama pull codellama:34b |
40GB | Best CodeLlama |
| DeepSeek Coder 6.7B | ollama pull deepseek-coder:6.7b |
8GB | Excellent for size |
| DeepSeek Coder 33B | ollama pull deepseek-coder:33b |
40GB | Top tier |
| Qwen2.5 Coder 7B | ollama pull qwen2.5-coder:7b |
8GB | Very good |
| Qwen2.5 Coder 14B | ollama pull qwen2.5-coder:14b |
16GB | Excellent |
| StarCoder2 7B | ollama pull starcoder2:7b |
8GB | Code-focused |
# Make sure Ollama is running
ollama serve
# Run Ralph with Ollama
.\ralph.bat ollama
# Specify model
.\ralph.bat ollama -Model deepseek-coder:33b
# List available models
.\ralph.bat models ollama
# See what models you have
ollama list
# Remove a model
ollama rm codellama:7b
# See model info
ollama show codellama:13b
# Run model interactively (test it)
ollama run codellama:13b
GUI application for running local models - great for beginners.
Search for and download:
deepseek-coder-33b-instructcodellama-34b-instructqwen2.5-coder-14b-instructLook for GGUF format with quantization matching your VRAM:
.\ralph.bat lmstudio
Self-hosted, production-ready, OpenAI-compatible API.
# Using Docker (recommended)
docker run -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-12
# Or download binary from https://localai.io
# Download a model
curl -O https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GGUF/resolve/main/codellama-13b-instruct.Q4_K_M.gguf
# Place in models directory
mkdir models
move codellama-13b-instruct.Q4_K_M.gguf models/
# Create model config
echo '{"name": "codellama", "backend": "llama"}' > models/codellama.yaml
.\ralph.bat local -Endpoint http://localhost:8080/v1/chat/completions
Feature-rich UI with many backends.
# Clone repo
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
# Run installer
start_windows.bat
.\ralph.bat local -Endpoint http://localhost:5000/v1/chat/completions
High-performance inference server - best for serious local deployments.
# Requires WSL2 or Linux
pip install vllm
# Start server
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/deepseek-coder-33b-instruct \
--port 8000
.\ralph.bat local -Endpoint http://localhost:8000/v1/chat/completions
Run models on a more powerful machine and access from your dev machine.
# Using Ollama
ollama serve --host 0.0.0.0
# Using LM Studio
# Start server and check "Allow remote connections"
# Using LocalAI
docker run -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-12
# Find server IP
# On server: ipconfig | findstr IPv4
# Use with Ralph
.\ralph.bat network -Endpoint http://192.168.1.100:11434/api/chat -Model codellama:13b
# Or configure in ralph-config.json
Edit .ralph-scripts/ralph-config.json:
{
"agents": {
"network": {
"endpoint": "http://192.168.1.100:11434/api/chat",
"defaultModel": "deepseek-coder:33b",
"apiFormat": "ollama"
}
}
}
# Use smaller model
ollama pull codellama:7b
# Or use quantized version
ollama pull codellama:13b-q4_0
# Close other GPU apps
# Check GPU usage: nvidia-smi
# Check server is running
curl http://localhost:11434/api/tags
# Check firewall
netsh advfirewall firewall add rule name="Ollama" dir=in action=allow protocol=TCP localport=11434
Approximate tokens/second on RTX 4090:
| Model | Size | Tokens/sec |
|---|---|---|
| CodeLlama 7B | Q4 | 80-100 |
| CodeLlama 13B | Q4 | 50-70 |
| CodeLlama 34B | Q4 | 20-30 |
| DeepSeek 6.7B | Q4 | 70-90 |
| DeepSeek 33B | Q4 | 15-25 |
Use local models for iteration, cloud for complex tasks:
# Quick iterations with local model
.\ralph.bat ollama -Model codellama:13b
# Stuck? Switch to cloud for help
.\ralph.bat openai -Model gpt-4o
Or configure automatic fallback in ralph-config.json:
{
"fallbackAgent": "openai",
"fallbackOnError": true
}