A pipeline that combines OpenAI Whisper for speech-to-text with speaker diarization to identify who said what in audio.
Whisper Diarization is an open-source tool that performs Automatic Speech Recognition (ASR) with speaker diarization. It uses OpenAI's Whisper model to transcribe audio and combines it with voice activity detection and speaker embedding models from Nvidia NeMo to label each part of the transcript with the correct speaker. It solves the problem of creating accurate, speaker-identified transcripts from multi-speaker audio files like meetings or interviews.
Developers, researchers, and data scientists working with audio data who need to transcribe conversations and identify participants, such as in computational linguistics, media analysis, or automated note-taking applications.
It provides a ready-to-use, integrated pipeline that combines best-in-class ASR with speaker identification, saving significant development time. Being open-source and self-hostable, it offers a free and customizable alternative to proprietary transcription services that include speaker diarization.
Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
Leverages OpenAI's Whisper for state-of-the-art speech recognition, providing reliable transcription as the foundation for diarization.
Combines Nvidia NeMo's TitaNet for speaker embedding and MarbleNet for VAD, creating a comprehensive system for speaker identification from the README.
Uses ctc-forced-aligner and punctuation models to align timestamps, minimizing diarization errors due to time shifts as described in the features.
Offers an experimental parallel processing script for systems with sufficient VRAM, speeding up inference by running models concurrently.
The README explicitly states this limitation, making it unreliable for conversations with simultaneous speech, a common real-world scenario.
Requires manual setup of FFMPEG, Cython, and specific Python versions, with dependency management via constraints.txt, which can be error-prone.
Optimal performance, especially with parallel processing, demands high VRAM (>=10GB), limiting accessibility on standard or resource-constrained systems.
The parallel processing script is marked as experimental with warnings of potential errors, indicating instability for production use.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.