A Python library that extends OpenAI's Whisper to provide accurate word-level timestamps and confidence scores for multilingual speech recognition.
whisper-timestamped is a Python library that enhances OpenAI's Whisper speech recognition model by adding precise word-level timestamps and confidence scores to transcriptions. It solves the problem of aligning transcribed text accurately to audio timelines, which is essential for creating subtitles, analyzing speech patterns, or editing video content. The library uses Dynamic Time Warping on Whisper's internal attention weights to infer word boundaries without requiring additional language-specific models.
Developers and researchers working on applications that require precise audio-text alignment, such as subtitle generation, video editing tools, speech analytics platforms, or any project leveraging Whisper for transcription with timing metadata.
It provides a robust, efficient, and language-agnostic method for word-level timestamping directly integrated with Whisper, avoiding the overhead and limitations of alternative approaches like wav2vec models. The inclusion of confidence scores and optional Voice Activity Detection further enhances transcription reliability.
Multilingual Automatic Speech Recognition with word-level timestamps and confidence
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Uses Dynamic Time Warping on Whisper's cross-attention weights to predict precise start and end times for each word, avoiding the need for language-specific models as highlighted in the README's comparison with alternatives.
Assigns confidence values to each word and segment, providing reliability metrics that are output in JSON format, as shown in the example with word-level confidence scores.
Optional VAD pre-processing removes non-speech segments to reduce hallucinations, with support for multiple methods like silero and auditok, demonstrated in the plot examples.
Leverages Whisper's native multilingual capabilities without additional per-language models, ensuring broad support while maintaining memory efficiency for long audio files.
Supports JSON, CSV, SRT, VTT, and TSV with integrated word timestamps, making it easy to generate subtitles and other aligned outputs for diverse applications.
The README includes a disclaimer that it may 'significantly impact performance,' and the DTW processing adds computational cost compared to base Whisper, affecting speed.
Requires managing multiple optional dependencies (e.g., matplotlib for plotting, torchaudio and onnxruntime for VAD), with specific steps for CPU installation, increasing setup complexity.
Default decoding options favor efficiency over accuracy by disabling beam search and temperature sampling, so users must manually enable accurate mode for best transcription quality.
Marketed as 'intended for experimental purposes,' which may imply less reliability or frequent changes compared to more mature speech recognition tools.