My Local LLM Stack for Omarchy

After hitting Claude Code's rate limits, I decided to set up a local LLM stack. The goal: a fast, self-hosted alternative that works with my existing coding tools. Here's what I built.

The Stack

I run Omarchy (DHH's Arch-based Linux distro) on a machine with:

NVIDIA RTX 5050 (8GB VRAM)
AMD Ryzen 5 5600T
64GB RAM

The setup lives at github.com/bitclaw/local-llm-stack.

Why Local?

Several reasons pushed me toward local:

Cost: No per-token billing
Privacy: Code stays on my machine
No rate limits: Unlimited inference
Latency: Local network is fast

The tradeoff: weaker models than GPT-4o or Claude Opus. For coding tasks, 7B models work surprisingly well.

The Hardware Reality

The RTX 5050 has only 8GB VRAM, which limits what I can run on GPU. The sweet spot is CPU inference with 7B models.

Model	Size	Performance
Qwen 7B	4.4GB	4.4 tokens/sec (CPU)
Qwen 14B	8.5GB	Needs 10GB+ VRAM
Qwen 32B	19.6GB	Needs 22GB+ VRAM

For my use case, Qwen 7B is fine. I mainly use it as a fallback when hitting API limits.

Quick Setup

git clone https://github.com/bitclaw/local-llm-stack.git
cd local-llm-stack

# Install dependencies
./distros/omarchy/packages.sh
./distros/omarchy/install.sh

# Build llama.cpp
./engines/llama-cpp/install.sh

# Download model
./models/download-qwen.sh

# Start server (CPU mode for 8GB VRAM)
./scripts/start.sh llama-cpp

Server runs at http://localhost:8000 with OpenAI-compatible API.

Using with Claude Code / OpenCode

export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=sk-local

Configure your editor:

Provider: OpenAI-compatible
Base URL: http://localhost:8000/v1
Model: qwen2.5-coder

What Works

The basics are solid:

llama.cpp builds with CUDA support
Qwen models download and run
Server works in CPU mode
Background daemon mode for persistent operation

I can now run coding tasks locally when I want to save API credits or need privacy.

What's Missing

The stack is minimal. Some things on my roadmap:

Systemd service for auto-start
Port conflict detection
Model validation at startup
VRAM auto-detection
Docker support

The hardware constraints (8GB VRAM) mean I'm running everything on CPU. Better GPUs would open up 14B+ models with GPU acceleration.

VRAM Guide for Others

If you're setting this up on different hardware:

GPU	VRAM	Recommendation	Max Layers
RTX 5050	8GB	7B, CPU mode	0
RTX 4060	8GB	7B	20-28
RTX 4070	12GB	7B/14B	28-35
RTX 4080	16GB	14B	all
RTX 4090	24GB	14B/32B	all

Bottom Line

For casual local inference, this works. The 7B Qwen model handles basic coding tasks without hitting cloud API limits.

Not replacing Claude Code Pro for serious work, but it's a useful fallback when I want to experiment or conserve credits.

Repo: github.com/bitclaw/local-llm-stack