Question 1

How to use NumKong for matrix multiplication in Python?

Accepted Answer

Install via pip, then use functions like numkong.dots_packed for optimized matrix multiplication with pre-packed weights. The README provides examples for packing right matrices once and reusing them across queries, leveraging SIMD kernels for performance.

Question 2

NumKong vs OpenBLAS: which is faster for float32 operations?

Accepted Answer

NumKong often outperforms OpenBLAS in single-threaded dot products—benchmarks show 7.1 GSO/s vs 1.5 GSO/s for Float32. However, for throughput-oriented matrix multiplication, OpenBLAS may be faster in some cases, so it depends on the workload and precision needs.

Question 3

Does NumKong support GPU acceleration?

Accepted Answer

No, NumKong is optimized for CPU SIMD across x86, Arm, and other architectures, focusing on low-latency, numerically stable operations. For GPU workloads, you'd need to pair it with other libraries like cuBLAS or use its WebAssembly backend for web-based GPUs.

Question 4

What's the error rate for BFloat16 in NumKong compared to PyTorch?

Accepted Answer

NumKong achieves around 1.8% mean relative error for BFloat16 dot products, similar to PyTorch, but with higher throughput—9.7 GSO/s vs 0.5 GSO/s in benchmarks. It uses widening accumulation to Float32 for better stability.

Question 5

How to parallelize NumKong operations in C++?

Accepted Answer

Use row-range parameters like start_row and end_row to partition work manually, then integrate with threading models like std::thread or ForkUnion. The README includes a Python example with concurrent.futures that can be adapted to C++.

Question 6

Can I use NumKong in a web browser?

Accepted Answer

Yes, via JavaScript bindings for Node.js, Bun, Deno, and browsers, leveraging WebAssembly SIMD for performance. Install with npm and import, but note that some advanced operations like mesh alignment are not available in JS.

Question 7

Is NumKong suitable for real-time embedded systems?

Accepted Answer

Yes, due to its no-allocation policy, small binary size, and deterministic latency without hidden thread pools. It's designed for edge AI and robotics where memory management and real-time constraints are critical.

NumKong

What is NumKong?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Open Source Alternative To

Frequently Asked Questions