Question 1

How to install Whisper-AT on Windows?

Accepted Answer

The README specifies a workaround: install all dependencies except triton via pip (e.g., numba, torch), then install whisper-at with no dependencies. This is due to a known bug with the triton package on Mac/Windows.

Question 2

What sounds can Whisper-AT detect?

Accepted Answer

It detects 527 types of audio events from the AudioSet taxonomy, including music, speech, animal noises, and environmental sounds. A full label list is provided in the repository's audioset_label.csv file.

Question 3

Can I use Whisper-AT for live audio transcription?

Accepted Answer

No, it processes audio in fixed segments (default 30-second windows) and isn't designed for real-time streaming. It's better for batch processing of recorded files, as noted in the transcribe function's windowing approach.

Question 4

Whisper-AT vs a separate Whisper and audio tagging model?

Accepted Answer

Whisper-AT is more computationally efficient, adding minimal overhead, but for maximum audio tagging accuracy, a specialized model might outperform it. The trade-off is efficiency versus peak performance in audio event detection.

Question 5

How does the at_time_res parameter affect results?

Accepted Answer

It sets the window size for audio tagging predictions; smaller values give finer resolution but may reduce accuracy, as the model is trained with a default of 10 seconds for optimal performance on AudioSet.

Question 6

Is Whisper-AT good for transcribing podcasts with background music?

Accepted Answer

Yes, it excels in noisy environments by leveraging Whisper's noise-robust ASR while simultaneously tagging the background sounds, making it ideal for content where both speech and audio events matter.

Whisper-AT

What is Whisper-AT?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions