A C/C++ library for efficient, cross-platform LLM inference with extensive hardware support and quantization.
llama.cpp is an open-source library for running large language model (LLM) inference locally using C/C++. It provides efficient, dependency-free execution of models like LLaMA, Mistral, and Gemma across diverse hardware, from consumer CPUs to enterprise GPUs. The project solves the need for performant, portable LLM inference without cloud dependencies.
Developers and researchers who need to deploy LLMs on local hardware, edge devices, or specialized infrastructure (e.g., Apple Silicon, embedded systems). It's also used by AI application builders creating offline-capable chat tools, coding assistants, or multimodal systems.
Developers choose llama.cpp for its unmatched performance per watt, extensive hardware support, and avoidance of Python/PyTorch overhead. Its quantization support enables running billion-parameter models on consumer hardware, while the permissively licensed codebase allows integration into commercial products.
LLM inference in C/C++
Optimized for Apple Silicon (Metal), x86 (AVX/AMX), NVIDIA GPUs (CUDA), and more, as detailed in the Supported backends section, enabling state-of-the-art performance on diverse hardware.
Supports 1.5-bit to 8-bit integer quantization, reducing memory usage and accelerating inference, with tools for conversion and quantization documented in the README.
Plain C/C++ implementation with no external dependencies, making it highly portable and easy to embed in resource-constrained or edge devices.
Compatible with hundreds of text and multimodal models like LLaMA, Mistral, and Gemma in GGUF format, with a detailed list provided in the README.
Includes llama-server for an OpenAI-compatible HTTP API, llama-cli for interactive use, and benchmarking tools, facilitating deployment in real-world applications.
The README highlights 'Recent API changes' with separate changelogs for libllama and llama-server, indicating frequent updates that can disrupt integrations.
Building from source requires C/C++ toolchains and multiple build guides, which can be daunting for teams accustomed to higher-level languages like Python.
Models must be converted to GGUF using Python scripts or online tools, adding an extra step and potential friction for those working with other formats.
While the README is extensive, the fast-paced development and scattered docs (e.g., separate guides for backends) can leave gaps for specific use cases.
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
Ready-to-run cloud templates for RAG, AI pipelines, and enterprise search with live data. 🐳Docker-friendly.⚡Always in sync with Sharepoint, Google Drive, S3, Kafka, PostgreSQL, real-time data APIs, and more.
Official inference framework for 1-bit LLMs
Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.