A Swift SDK for fully local, low-latency audio AI on Apple devices, including transcription, text-to-speech, voice activity detection, and speaker diarization.
FluidAudio is a Swift SDK that provides fully local, low-latency audio AI capabilities for Apple devices. It enables developers to integrate text-to-speech, speech-to-text, voice activity detection, and speaker diarization directly into their macOS and iOS apps, with inference optimized for the Apple Neural Engine to ensure privacy and efficiency.
iOS and macOS developers building applications that require private, on-device audio processing, such as dictation apps, meeting assistants, voice-controlled tools, and accessibility features.
Developers choose FluidAudio for its seamless integration with Apple's hardware, offering state-of-the-art audio models that run entirely on-device for maximum privacy, low latency, and minimal battery impact compared to cloud-based alternatives.
Frontier CoreML audio models in your apps — text-to-speech, speech-to-text, voice activity detection, and speaker diarization. In Swift, powered by SOTA open source.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Runs inference on the Apple Neural Engine for maximum speed and minimal power consumption, with benchmarks showing ~190x real-time factor on M4 Pro for ASR.
All models process audio locally, ensuring no data leaves the device, which is critical for privacy-focused apps like dictation tools and meeting assistants.
Uses permissively licensed models from HuggingFace (MIT/Apache 2.0), allowing transparency and customization without vendor lock-in.
Integrates ASR, TTS, VAD, and speaker diarization in one SDK, reducing dependency on multiple libraries for end-to-end audio processing.
Exclusively supports macOS and iOS, with no native options for other operating systems, limiting cross-platform development.
Text-to-speech is currently in beta and supports only American English, hindering use in multilingual applications despite plans for expansion.
Requires handling proxy configurations or custom registry URLs for model downloads in restricted networks, adding setup overhead as noted in the documentation.
Speaker diarization pipelines have varying speeds; for example, the Pyannote pipeline is slower than LS-EEND, and real-time performance depends heavily on ANE-capable hardware.