A C#/.NET library for efficient local inference of LLaMA and other large language models, based on llama.cpp.
LLamaSharp is a C#/.NET library that allows developers to run large language models like LLaMA and LLaVA locally on their devices. It solves the problem of needing cloud-based AI services by providing efficient, offline inference capabilities directly within .NET applications. The library is built on llama.cpp and supports both CPU and GPU acceleration for performance.
.NET developers and engineers who want to integrate local LLM inference into their applications, particularly those focused on privacy, cost reduction, or offline functionality. It's also suitable for AI researchers and hobbyists exploring on-device AI in the .NET ecosystem.
Developers choose LLamaSharp because it brings the performance and flexibility of llama.cpp to the .NET world with a convenient managed API. Its cross-platform support, pre-compiled backends, and integrations with frameworks like Semantic Kernel lower the barrier to entry for local AI development in C#.
A C#/.NET library to run LLM (🦙LLaMA/LLaVA) on your local device efficiently.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Enables entirely on-device LLM inference without cloud calls, ensuring data privacy and eliminating API costs, as highlighted in the key features for private, offline AI.
Works on Windows, Linux, and macOS with pre-compiled backends for CPU, CUDA, Metal, and Vulkan, allowing deployment across diverse hardware environments.
Offers higher-level APIs and integrations with frameworks like Semantic Kernel and LangChain, simplifying AI embedding in existing C# applications, as shown in the examples.
Supports Retrieval Augmented Generation via kernel-memory and vision-language models like LLaVA, enabling advanced context-aware and image-based applications.
Requires installing specific backend packages (e.g., CUDA versions) and managing compatibility, leading to setup headaches and crashes, as admitted in the FAQ about GPU issues.
GGUF model files must match exact LLamaSharp and llama.cpp versions; outdated models can cause failures, with the README warning about version maps and publishing times.
Inference speed depends on manual configuration like GpuLayerCount and may lag behind cloud services for large models, as noted in the FAQ on slow performance.