An LLM acceleration library for Intel XPU (GPU, NPU, CPU) to speed up local inference and finetuning of popular models.
IPEX-LLM is an LLM acceleration library developed by Intel to optimize the performance of large language models on Intel XPU hardware, including GPUs, NPUs, and CPUs. It solves the problem of slow and resource-intensive local LLM inference by providing low-bit quantization, hardware-specific optimizations, and seamless integration with popular AI frameworks. This enables users to run models like LLaMA and Mistral efficiently on consumer Intel devices.
Developers and researchers working with local LLM deployment on Intel hardware, including those using integrated GPUs (e.g., Intel Core Ultra), discrete Arc GPUs, or NPUs for inference and finetuning tasks.
Developers choose IPEX-LLM for its deep optimization of LLMs specifically for Intel architectures, offering superior performance and lower memory usage compared to generic solutions. Its unique selling point is the ability to run large models like DeepSeek V3 671B on just 1-2 Intel Arc GPUs through techniques like FlashMoE, which is not commonly available in other acceleration libraries.
Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, DeepSeek, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, DeepSpeed, Axolotl, etc.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Provides deep optimizations for Intel XPUs (iGPUs, dGPUs, NPUs), demonstrated by token generation speed benchmarks showing significant performance gains on devices like Intel Core Ultra and Arc GPUs.
Supports low-bit precisions including FP8, FP6, FP4, and INT4, enabling reduced memory usage and faster inference, with accuracy metrics provided for models like Llama-2-7B-chat.
Verified with over 70 LLMs and multimodal models, including popular architectures like LLaMA, Mistral, Qwen, and MiniCPM, ensuring broad applicability.
Integrates with popular frameworks like HuggingFace Transformers, LangChain, vLLM, and Ollama, allowing easy adoption through quickstart guides and examples.
Intel has archived the project, meaning no future maintenance, bug fixes, updates, or security patches, making it risky for long-term use.
The project has identified security vulnerabilities, and with no active maintenance, these remain unaddressed, posing significant risks in deployment.
Optimizations are specific to Intel XPUs, offering no benefits on AMD or NVIDIA hardware, which limits flexibility in heterogeneous environments.