Fast automatic speech recognition with accurate word-level timestamps and speaker diarization, built on OpenAI's Whisper.
WhisperX is an enhanced automatic speech recognition system that builds upon OpenAI's Whisper model to provide fast, batched transcription with accurate word-level timestamps and speaker diarization. It solves the problem of imprecise timestamps and lack of speaker identification in long-form audio, making it ideal for creating subtitles, meeting transcripts, and other time-aligned text outputs. The system uses forced phoneme alignment and voice activity detection to improve accuracy and reduce hallucinations.
Developers and researchers working on audio transcription projects, such as subtitle generation, meeting analysis, podcast indexing, or any application requiring precise time-aligned speech-to-text with speaker identification. It's particularly useful for those needing faster-than-realtime processing of large audio datasets.
Developers choose WhisperX because it offers significantly faster inference (up to 70x realtime) compared to standard Whisper, along with word-level timestamp accuracy and built-in speaker diarization. Its open-source nature, efficient batching, and support for multiple languages provide a production-ready alternative to proprietary ASR services.
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Uses batched processing with the faster-whisper backend to achieve up to 70x realtime transcription speed, as highlighted in the README for efficient large-scale audio processing.
Employs forced phoneme alignment with models like wav2vec2 to generate accurate per-word timing, addressing Whisper's utterance-level inaccuracies for subtitling and time-aligned outputs.
Incorporates pyannote-audio to automatically label transcripts with speaker IDs, useful for meeting analysis and multimedia indexing without external tools.
Requires less than 8GB GPU memory for the large-v2 model with default settings, making it accessible on consumer-grade hardware for cost-effective deployment.
The README admits that speaker diarization is 'far from perfect' and even promotes third-party services like Recall.ai for accurate speaker identification, limiting reliability in critical applications.
Words containing symbols or numbers may not be aligned properly, as they fall outside the alignment model's dictionary, affecting timestamp accuracy for technical or financial content.
Enabling speaker diarization requires a Hugging Face token and user agreement for the pyannote model, adding an extra layer of configuration and dependency that complicates deployment.