My Local LLM Stack for Omarchy
After hitting Claude Code's rate limits, I decided to set up a local LLM stack. The goal: a fast, self-hosted alternative that works with my existing coding tools. Here's what I built.
The Stack
I run Omarchy (DHH's Arch-based Linux distro) on a machine with:
- NVIDIA RTX 5050 (8GB VRAM)
- AMD Ryzen 5 5600T
- 64GB RAM
The setup lives at github.com/bitclaw/local-llm-stack.
Why Local?
Several reasons pushed me toward local:
- Cost: No per-token billing
- Privacy: Code stays on my machine
- No rate limits: Unlimited inference
- Latency: Local network is fast
The tradeoff: weaker models than GPT-4o or Claude Opus. For coding tasks, 7B models work surprisingly well.
The Hardware Reality
The RTX 5050 has only 8GB VRAM, which limits what I can run on GPU. The sweet spot is CPU inference with 7B models.
| Model | Size | Performance |
|---|---|---|
| Qwen 7B | 4.4GB | 4.4 tokens/sec (CPU) |
| Qwen 14B | 8.5GB | Needs 10GB+ VRAM |
| Qwen 32B | 19.6GB | Needs 22GB+ VRAM |
For my use case, Qwen 7B is fine. I mainly use it as a fallback when hitting API limits.
Quick Setup
git clone https://github.com/bitclaw/local-llm-stack.git
cd local-llm-stack
# Install dependencies
./distros/omarchy/packages.sh
./distros/omarchy/install.sh
# Build llama.cpp
./engines/llama-cpp/install.sh
# Download model
./models/download-qwen.sh
# Start server (CPU mode for 8GB VRAM)
./scripts/start.sh llama-cpp
Server runs at http://localhost:8000 with OpenAI-compatible API.
Using with Claude Code / OpenCode
export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=sk-local
Configure your editor:
- Provider: OpenAI-compatible
- Base URL:
http://localhost:8000/v1 - Model:
qwen2.5-coder
What Works
The basics are solid:
- llama.cpp builds with CUDA support
- Qwen models download and run
- Server works in CPU mode
- Background daemon mode for persistent operation
I can now run coding tasks locally when I want to save API credits or need privacy.
What's Missing
The stack is minimal. Some things on my roadmap:
- Systemd service for auto-start
- Port conflict detection
- Model validation at startup
- VRAM auto-detection
- Docker support
The hardware constraints (8GB VRAM) mean I'm running everything on CPU. Better GPUs would open up 14B+ models with GPU acceleration.
VRAM Guide for Others
If you're setting this up on different hardware:
| GPU | VRAM | Recommendation | Max Layers |
|---|---|---|---|
| RTX 5050 | 8GB | 7B, CPU mode | 0 |
| RTX 4060 | 8GB | 7B | 20-28 |
| RTX 4070 | 12GB | 7B/14B | 28-35 |
| RTX 4080 | 16GB | 14B | all |
| RTX 4090 | 24GB | 14B/32B | all |
Bottom Line
For casual local inference, this works. The 7B Qwen model handles basic coding tasks without hitting cloud API limits.
Not replacing Claude Code Pro for serious work, but it's a useful fallback when I want to experiment or conserve credits.