A lightweight, single-binary Rust inference server providing 100% OpenAI-API compatible endpoints for local GGUF models.
Shimmy is a lightweight, single-binary Rust inference server that provides 100% OpenAI-API compatible endpoints for local GGUF models. It solves the problem of vendor lock-in and privacy concerns by enabling developers to run language models locally with existing OpenAI SDKs and tools, offering a drop-in replacement that requires no code changes.
Developers and teams building AI applications who want privacy, cost control, and reliability by running language models locally instead of using cloud APIs. It's particularly valuable for those using tools like VSCode Copilot, Cursor, or Continue.dev with local models.
Developers choose Shimmy for its zero-dependency deployment, automatic GPU detection, and perfect OpenAI API compatibility that works with existing tools. Its unique selling point is being a single binary that's 142x smaller than alternatives like Ollama while offering advanced features like Mixture of Experts support for large models.
⚡ Python-free Rust inference server — OpenAI-API compatible. GGUF + SafeTensors, hot model swap, auto-discovery, single binary. FREE now, FREE forever.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Pre-built binaries include all GPU backends for automatic detection, requiring only a download and run command with no dependencies, as highlighted in the quick start.
Provides 100% compatible endpoints, allowing tools like VSCode Copilot and Cursor to work instantly by changing only the API base URL, with code examples for Python and Node.js SDKs.
Single binary is 4.8MB with sub-second startup and 50MB memory usage, making it 142x smaller than alternatives like Ollama, per the performance comparison table.
Supports Mixture of Experts to run 70B+ parameter models on consumer hardware through intelligent CPU/GPU hybrid processing, enabled with flags like --cpu-moe.
CPU-based vision processing is 5-10x slower than GPU, with the README warning of 15-45 seconds per image versus 2-8 seconds on GPU, limiting usability for vision tasks without acceleration.
Limited to GGUF format models, which may exclude newer or proprietary models not available in this open format, as auto-discovery focuses on GGUF files from sources like Hugging Face.
Building from source requires advanced setup including C++ compilers and GPU SDKs, with the README noting dependencies like LLVM on Windows and recommending pre-built binaries to avoid issues.