Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Whisper
  3. Whisper-AT

Whisper-AT

BSD-2-ClausePython

A joint audio tagging and speech recognition model that adds audio event detection to OpenAI Whisper with minimal computational overhead.

GitHubGitHub
418 stars36 forks0 contributors

What is Whisper-AT?

Whisper-AT is an extension of OpenAI's Whisper speech recognition model that adds audio event tagging capabilities. It can transcribe speech while simultaneously identifying non-speech audio events like music, animal sounds, or environmental noises from 527 AudioSet categories. The model solves the problem of needing separate systems for speech recognition and audio content analysis by providing both functions in a single efficient model.

Target Audience

Researchers and developers working on audio understanding applications, particularly those needing both speech transcription and environmental sound recognition in noisy real-world audio.

Value Proposition

Developers choose Whisper-AT because it provides audio event detection with negligible computational overhead compared to standalone Whisper, uses the same familiar API, and offers insights into how robust ASR models represent non-speech sounds.

Overview

Code and Pretrained Models for Interspeech 2023 Paper "Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong Audio Event Taggers"

Use Cases

Best For

  • Adding audio event detection to existing Whisper-based transcription pipelines
  • Analyzing audio content where both speech and environmental sounds matter
  • Research on noise-robust speech recognition representations
  • Building multimodal applications that need audio scene understanding
  • Educational projects exploring audio AI model architectures
  • Content moderation systems that need to detect both speech and sound events

Not Ideal For

  • Real-time or streaming audio processing applications requiring low-latency analysis
  • Projects needing sub-second precision for audio event timing (resolution limited to multiples of 0.4 seconds)
  • Systems where audio tagging accuracy is the sole priority, as mAP scores (35-42%) may lag behind specialized models
  • Edge deployments with strict memory constraints, due to large model sizes even with low-compute variants

Pros & Cons

Pros

Minimal Computational Overhead

Adds audio tagging with less than 1% extra FLOPs compared to standalone Whisper, making dual-task processing highly efficient, as noted in the README's performance tables.

Drop-in API Compatibility

Uses the exact same load_model and transcribe functions as OpenAI Whisper, allowing seamless migration from existing Whisper implementations with just an added at_time_res parameter.

Configurable Temporal Resolution

Supports adjustable time windows for audio tagging (e.g., every 0.4 or 10 seconds), enabling tailored analysis for different applications, though limited to integer multiples of 0.4s.

Multilingual and Scalable

Available in English-only and multilingual variants across tiny to large model sizes, providing flexibility for diverse language needs and computational budgets.

Research-Backed Insights

Based on findings that noise-robust ASR representations are noise-variant and useful for audio event detection, validated through peer-reviewed experiments detailed in the paper.

Cons

Fixed Time Resolution Limits

Audio tagging windows must be integer multiples of 0.4 seconds, restricting flexibility for applications needing arbitrary or finer-grained timing control.

Platform-Specific Installation Issues

Mac and Windows users face a known bug requiring a manual workaround to install dependencies separately, adding unnecessary setup complexity compared to standard pip installs.

Moderate Audio Tagging Performance

With mAP scores around 35-42% on AudioSet, it may underperform compared to state-of-the-art standalone audio tagging models, as admitted in the performance comparisons.

Frozen Backbone Dependency

Relies on frozen Whisper parameters, so improvements in newer Whisper versions or custom fine-tuning for ASR aren't automatically integrated without retraining the TL-TR module.

Frequently Asked Questions

Quick Stats

Stars418
Forks36
Contributors0
Open Issues26
Last commit2 years ago
CreatedSince 2023

Tags

#transformer#python-library#audio-tagging#speech-recognition#audio-classification#audio-processing#multimodal-ai#machine-learning#audio

Built With

P
Python
W
Whisper
P
PyTorch

Included in

Whisper2.2k
Auto-fetched 1 day ago

Related Projects

whisper.cppwhisper.cpp

Port of OpenAI's Whisper model in C/C++

Stars50,282
Forks5,592
Last commit2 days ago
Bindings for many languagesBindings for many languages

Port of OpenAI's Whisper model in C/C++

Stars50,282
Forks5,592
Last commit2 days ago
faster-whisperfaster-whisper

Faster Whisper transcription with CTranslate2

Stars23,264
Forks1,906
Last commit6 months ago
WhisperXWhisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

Stars22,167
Forks2,284
Last commit6 days ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub