My Local LLM Stack for Omarchy

After hitting Claude Code's rate limits, I decided to set up a local LLM stack. The goal: a fast, self-hosted alternative that works with my existing coding tools. Here's what I built.

The Stack

I run Omarchy (DHH's Arch-based Linux distro) on a machine with:

  • NVIDIA RTX 5050 (8GB VRAM)
  • AMD Ryzen 5 5600T
  • 64GB RAM

The setup lives at github.com/bitclaw/local-llm-stack.

Why Local?

Several reasons pushed me toward local:

  1. Cost: No per-token billing
  2. Privacy: Code stays on my machine
  3. No rate limits: Unlimited inference
  4. Latency: Local network is fast

The tradeoff: weaker models than GPT-4o or Claude Opus. For coding tasks, 7B models work surprisingly well.

The Hardware Reality

The RTX 5050 has only 8GB VRAM, which limits what I can run on GPU. The sweet spot is CPU inference with 7B models.

ModelSizePerformance
Qwen 7B4.4GB4.4 tokens/sec (CPU)
Qwen 14B8.5GBNeeds 10GB+ VRAM
Qwen 32B19.6GBNeeds 22GB+ VRAM

For my use case, Qwen 7B is fine. I mainly use it as a fallback when hitting API limits.

Quick Setup

git clone https://github.com/bitclaw/local-llm-stack.git
cd local-llm-stack

# Install dependencies
./distros/omarchy/packages.sh
./distros/omarchy/install.sh

# Build llama.cpp
./engines/llama-cpp/install.sh

# Download model
./models/download-qwen.sh

# Start server (CPU mode for 8GB VRAM)
./scripts/start.sh llama-cpp

Server runs at http://localhost:8000 with OpenAI-compatible API.

Using with Claude Code / OpenCode

export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=sk-local

Configure your editor:

  • Provider: OpenAI-compatible
  • Base URL: http://localhost:8000/v1
  • Model: qwen2.5-coder

What Works

The basics are solid:

  • llama.cpp builds with CUDA support
  • Qwen models download and run
  • Server works in CPU mode
  • Background daemon mode for persistent operation

I can now run coding tasks locally when I want to save API credits or need privacy.

What's Missing

The stack is minimal. Some things on my roadmap:

  • Systemd service for auto-start
  • Port conflict detection
  • Model validation at startup
  • VRAM auto-detection
  • Docker support

The hardware constraints (8GB VRAM) mean I'm running everything on CPU. Better GPUs would open up 14B+ models with GPU acceleration.

VRAM Guide for Others

If you're setting this up on different hardware:

GPUVRAMRecommendationMax Layers
RTX 50508GB7B, CPU mode0
RTX 40608GB7B20-28
RTX 407012GB7B/14B28-35
RTX 408016GB14Ball
RTX 409024GB14B/32Ball

Bottom Line

For casual local inference, this works. The 7B Qwen model handles basic coding tasks without hitting cloud API limits.

Not replacing Claude Code Pro for serious work, but it's a useful fallback when I want to experiment or conserve credits.

Repo: github.com/bitclaw/local-llm-stack