A portable mixed-precision math library with 2,000+ SIMD kernels for 15+ numeric types across x86, Arm, RISC-V, and WebAssembly.
NumKong is a portable mixed-precision math and linear algebra library optimized for modern CPUs. It provides SIMD-accelerated kernels for distances, dot products, and matrix operations across 15+ numeric types, solving the problem of numerical overflow and instability in low-precision computations while delivering up to 100x speedups over traditional BLAS libraries.
Developers and researchers working on high-performance computing, AI inference, vector search, and scientific simulations who need fast, numerically stable operations across diverse hardware platforms.
NumKong offers superior performance and smaller binaries than alternatives like OpenBLAS and MKL, with cross-platform SIMD support, no hidden allocations, and explicit control over parallelism—making it ideal for embedded systems, real-time applications, and multi-language projects.
SIMD-accelerated distances, dot products, matrix ops, geospatial & geometric kernels for 16 numeric types — from 6-bit floats to 64-bit complex — across x86, Arm, RISC-V, and WASM, with bindings for Python, Rust, C, C++, Swift, JS, and Go 📐
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Handles 15+ numeric types from 4-bit integers to 128-bit complex numbers, with automatic promotion to wider accumulators to prevent overflow, as evidenced by benchmark tables showing zero error for Int8 and minimal error for Float16.
Optimized kernels for x86, Arm, RISC-V, LoongArch, Power, and WebAssembly leverage advanced ISA extensions like AMX and SME, delivering up to 100x speedups over traditional BLAS in benchmarks.
Avoids hidden allocations and thread pools, leaving memory management and parallelism to the caller, which ensures compatibility with arbitrary allocators and threading models, as described in the design philosophy.
Ships as a 5 MB binary, 5-100x smaller than alternatives like PyTorch or OpenBLAS, reducing install size for multi-language bindings across Python, Rust, JavaScript, and more.
Validated against 118-bit extended-precision baselines with compensated summation and saturation arithmetic, minimizing errors—shown in benchmarks where NumKong achieves lower error rates than NumPy or PyTorch for types like Float32.
Unlike BLAS libraries with built-in thread pools, NumKong requires callers to manually partition work using row-range parameters, adding complexity for multi-threaded applications without automatic load balancing.
Focuses on low-level kernels rather than comprehensive linear algebra suites; users must implement higher-level operations themselves or integrate with other libraries, as it lacks drop-in APIs for common frameworks.
Some operations, like mesh alignment and sparse products, are not available in all language bindings—e.g., JavaScript and Swift have gaps—limiting cross-ecosystem consistency as shown in the feature matrix.
The philosophy of no hidden allocations and explicit control means developers must handle memory packing and threading models, increasing initial setup time compared to libraries with automatic management.
NumKong is an open-source alternative to the following products:
Intel Math Kernel Library (MKL) is a library of optimized math routines for scientific, engineering, and financial applications, including highly vectorized and threaded linear algebra, FFT, and vector math functions.
OpenBLAS is an optimized BLAS (Basic Linear Algebra Subprograms) library for high-performance mathematical operations on CPUs.
Apple Accelerate is a macOS and iOS framework providing high-performance vector-accelerated math and digital signal processing libraries for optimized numerical computing.