Official inference framework for 1-bit LLMs, enabling fast and lossless CPU/GPU inference with significant speed and energy efficiency gains.
bitnet.cpp is the official inference framework for 1-bit Large Language Models (LLMs) like BitNet b1.58. It provides a suite of optimized kernels that enable fast, lossless, and energy-efficient inference on CPUs and GPUs, making it possible to run large 1-bit models efficiently on local hardware. The framework significantly accelerates inference speeds while drastically reducing energy consumption compared to traditional approaches.
AI researchers, ML engineers, and developers working with or exploring 1-bit LLMs who need efficient inference for deployment on CPUs, GPUs, or edge devices.
Developers choose bitnet.cpp for its official support of 1-bit LLMs, delivering unmatched inference speed and energy efficiency through highly optimized kernels. Its ability to run massive models on a single CPU at practical speeds makes it uniquely valuable for edge and local AI deployment scenarios.
Official inference framework for 1-bit LLMs
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Achieves speedups of 1.37x to 6.17x on CPUs, with larger models seeing greater benefits, as documented in performance benchmarks.
Cuts energy consumption by 55.4% to 82.2% on x86 and ARM CPUs, making it highly efficient for edge and local deployment.
Enables running 100B parameter models on a single CPU at human-readable speeds (5-7 tokens/sec), per the technical report.
Latest updates add parallel kernels with configurable tiling and embedding quantization for an additional 1.15x to 2.1x speedup.
Supports official Microsoft BitNet models and other 1-bit LLMs from Hugging Face, including Falcon and Llama variants, as listed in the tables.
Requires specific tools like clang>=18, cmake, and conda, with Windows setup needing Visual Studio Developer Command Prompt, increasing setup overhead.
Only supports a handful of 1-bit LLMs, and the README admits using existing models to demonstrate capabilities, indicating a nascent and restricted selection.
NPU support is listed as 'will coming next', and kernel availability varies by model and CPU type, as shown in the support tables with missing checkmarks.
Built on llama.cpp, which can introduce build errors (e.g., std::chrono issues in log.cpp) requiring manual fixes, as noted in the FAQ.