A joint audio tagging and speech recognition model that adds audio event detection to OpenAI Whisper with minimal computational overhead.
Whisper-AT is an extension of OpenAI's Whisper speech recognition model that adds audio event tagging capabilities. It can transcribe speech while simultaneously identifying non-speech audio events like music, animal sounds, or environmental noises from 527 AudioSet categories. The model solves the problem of needing separate systems for speech recognition and audio content analysis by providing both functions in a single efficient model.
Researchers and developers working on audio understanding applications, particularly those needing both speech transcription and environmental sound recognition in noisy real-world audio.
Developers choose Whisper-AT because it provides audio event detection with negligible computational overhead compared to standalone Whisper, uses the same familiar API, and offers insights into how robust ASR models represent non-speech sounds.
Code and Pretrained Models for Interspeech 2023 Paper "Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong Audio Event Taggers"
Adds audio tagging with less than 1% extra FLOPs compared to standalone Whisper, making dual-task processing highly efficient, as noted in the README's performance tables.
Uses the exact same load_model and transcribe functions as OpenAI Whisper, allowing seamless migration from existing Whisper implementations with just an added at_time_res parameter.
Supports adjustable time windows for audio tagging (e.g., every 0.4 or 10 seconds), enabling tailored analysis for different applications, though limited to integer multiples of 0.4s.
Available in English-only and multilingual variants across tiny to large model sizes, providing flexibility for diverse language needs and computational budgets.
Based on findings that noise-robust ASR representations are noise-variant and useful for audio event detection, validated through peer-reviewed experiments detailed in the paper.
Audio tagging windows must be integer multiples of 0.4 seconds, restricting flexibility for applications needing arbitrary or finer-grained timing control.
Mac and Windows users face a known bug requiring a manual workaround to install dependencies separately, adding unnecessary setup complexity compared to standard pip installs.
With mAP scores around 35-42% on AudioSet, it may underperform compared to state-of-the-art standalone audio tagging models, as admitted in the performance comparisons.
Relies on frozen Whisper parameters, so improvements in newer Whisper versions or custom fine-tuning for ASR aren't automatically integrated without retraining the TL-TR module.
Port of OpenAI's Whisper model in C/C++
Port of OpenAI's Whisper model in C/C++
Faster Whisper transcription with CTranslate2
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.