The technical achievement: Got it down to 5.1MB by stripping everything
except pure inference. Written in Rust, uses llama.cpp's engine.
One feature I'm excited about: You can use LoRA adapters directly without
converting them. Just point to your .gguf base model and .gguf LoRA -
it handles the merge at runtime. Makes iterating on fine-tuned models
much faster since there's no conversion step.
Your data never leaves your machine. No telemetry. No accounts. Just a
tiny binary that makes GGUF models work with your AI coding tools.
Would love feedback on the auto-discovery feature - it finds your models
automatically so you don't need any configuration.
What's your local LLM setup? Are you using LoRA adapters for anything specific?
Shimmy is designed to be "invisible infrastructure" - the simplest possible way to get local inference working with your existing AI tools. llama-server gives you more control, llama-swap gives you multi-model management.
Key differences:
- Architecture: llama-swap = proxy + multiple servers, Shimmy = single server
- Resource usage: llama-swap runs multiple processes, Shimmy = one 50MB process
- Use case: llama-swap for managing many models, Shimmy for simplicity
Shimmy is for when you want the absolute minimum footprint - CI/CD pipelines, quick local testing, or systems where you can't install 680MB of dependencies.
Hey HN! I built this because I was tired of waiting 10 seconds for Ollama's 680MB binary to start just to run a 4GB model locally.
Quick demo - working VSCode + local AI in 30 seconds: curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/late... ./shimmy serve # Point VSCode/Cursor to localhost:11435
The technical achievement: Got it down to 5.1MB by stripping everything except pure inference. Written in Rust, uses llama.cpp's engine.
One feature I'm excited about: You can use LoRA adapters directly without converting them. Just point to your .gguf base model and .gguf LoRA - it handles the merge at runtime. Makes iterating on fine-tuned models much faster since there's no conversion step.
Your data never leaves your machine. No telemetry. No accounts. Just a tiny binary that makes GGUF models work with your AI coding tools.
Would love feedback on the auto-discovery feature - it finds your models automatically so you don't need any configuration.
What's your local LLM setup? Are you using LoRA adapters for anything specific?
You may have noticed already, but the link to the binary is throwing a 404.
This should be fixed now!
[dead]
Windows Defender tripped this for me, calling it out as Bearfoos trojan. Most likely a false positive, but jfyi.
Try cargo install or intentionally exclude, unsigned Rust binaries will do this.
Nice, a rust tool wrapping llama.cpp
how does it differ from llama-server?
and from llama-swap?
Shimmy is designed to be "invisible infrastructure" - the simplest possible way to get local inference working with your existing AI tools. llama-server gives you more control, llama-swap gives you multi-model management.
Shimmy is for when you want the absolute minimum footprint - CI/CD pipelines, quick local testing, or systems where you can't install 680MB of dependencies.
looks cool, ty! really great project will try this out.
[dead]