A next-generation Kaldi-based toolkit for offline speech-to-text, text-to-speech, and audio processing across 12 languages and diverse hardware.
Sherpa-onnx is an open-source toolkit for offline speech and audio processing, built on next-generation Kaldi and ONNX Runtime. It provides speech-to-text, text-to-speech, speaker diarization, and other audio AI functions that run entirely locally without an internet connection. The project solves the need for privacy-preserving, low-latency audio processing on devices ranging from servers to embedded systems and mobile platforms.
Developers building voice-enabled applications for embedded systems, mobile apps (Android/iOS), IoT devices, or desktop software where offline operation, data privacy, or cross-platform support is critical. It's also suited for researchers and hobbyists working on multilingual speech projects.
Developers choose Sherpa-onnx for its unique combination of offline capability, extensive hardware support (including NPUs), and broad language/API coverage. Unlike cloud-dependent services, it offers complete data privacy and reduces latency, while its cross-platform nature and pre-built models accelerate development for diverse deployment targets.
Speech-to-text, text-to-speech, speaker diarization, speech enhancement, source separation, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, HarmonyOS, Raspberry Pi, RISC-V, RK NPU, Axera NPU, Ascend NPU, x86_64 servers, websocket server/client, support 12 programming languages
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
All processing runs locally without internet, ensuring data privacy and low latency, as core to its philosophy for privacy-sensitive and edge computing scenarios.
Deploys on x86, ARM, RISC-V, Android, iOS, HarmonyOS, Raspberry Pi, and embedded NPUs like RK and Qualcomm, with detailed tables listing over 10 platforms and architectures.
Offers bindings for 12 languages including C++, Python, JavaScript, Java, Go, Rust, and Swift, plus WebAssembly, making it accessible for diverse development stacks as shown in the supported languages table.
Supports speech recognition (streaming/non-streaming), TTS, speaker diarization, VAD, keyword spotting, and more across 12+ functions, all listed in the functions table with linked documentation.
Requires downloading and maintaining pre-trained models separately from releases, which can be cumbersome for dynamic applications needing multiple languages or frequent updates.
Integration with various NPUs and cross-platform builds involves significant configuration, as indicated by the extensive platform lists and fragmented documentation across multiple external links.
Models are fixed after download and lack automatic updates, missing out on real-time improvements from cloud-based services without manual intervention and re-deployment.