Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Whisper
  3. whisper-diarization

whisper-diarization

BSD-2-ClauseJupyter Notebook

A pipeline that combines OpenAI Whisper for speech-to-text with speaker diarization to identify who said what in audio.

GitHubGitHub
5.5k stars498 forks0 contributors

What is whisper-diarization?

Whisper Diarization is an open-source tool that performs Automatic Speech Recognition (ASR) with speaker diarization. It uses OpenAI's Whisper model to transcribe audio and combines it with voice activity detection and speaker embedding models from Nvidia NeMo to label each part of the transcript with the correct speaker. It solves the problem of creating accurate, speaker-identified transcripts from multi-speaker audio files like meetings or interviews.

Target Audience

Developers, researchers, and data scientists working with audio data who need to transcribe conversations and identify participants, such as in computational linguistics, media analysis, or automated note-taking applications.

Value Proposition

It provides a ready-to-use, integrated pipeline that combines best-in-class ASR with speaker identification, saving significant development time. Being open-source and self-hostable, it offers a free and customizable alternative to proprietary transcription services that include speaker diarization.

Overview

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper

Use Cases

Best For

  • Transcribing multi-speaker meetings and generating speaker-labeled minutes
  • Analyzing interview recordings to attribute quotes accurately
  • Processing podcast episodes to identify host and guest segments
  • Creating accessible transcripts for video content with multiple participants
  • Academic research involving conversational analysis and speaker turn-taking
  • Building custom audio processing applications that require speaker-aware transcription

Not Ideal For

  • Real-time applications requiring live speaker diarization, as it processes audio files offline
  • Environments with limited GPU VRAM (under 10GB), especially for parallel processing
  • Audio recordings with significant speaker overlap, which is a known unsupported limitation
  • Users needing a simple, hosted API without command-line setup and dependency management

Pros & Cons

Pros

Accurate Whisper Integration

Leverages OpenAI's Whisper for state-of-the-art speech recognition, providing reliable transcription as the foundation for diarization.

Robust Diarization Pipeline

Combines Nvidia NeMo's TitaNet for speaker embedding and MarbleNet for VAD, creating a comprehensive system for speaker identification from the README.

Timestamp Correction

Uses ctc-forced-aligner and punctuation models to align timestamps, minimizing diarization errors due to time shifts as described in the features.

Performance Optimization

Offers an experimental parallel processing script for systems with sufficient VRAM, speeding up inference by running models concurrently.

Cons

Overlapping Speakers Unsupported

The README explicitly states this limitation, making it unreliable for conversations with simultaneous speech, a common real-world scenario.

Complex Installation Process

Requires manual setup of FFMPEG, Cython, and specific Python versions, with dependency management via constraints.txt, which can be error-prone.

High Hardware Demands

Optimal performance, especially with parallel processing, demands high VRAM (>=10GB), limiting accessibility on standard or resource-constrained systems.

Experimental Features

The parallel processing script is marked as experimental with warnings of potential errors, indicating instability for production use.

Frequently Asked Questions

Quick Stats

Stars5,493
Forks498
Contributors0
Open Issues33
Last commit2 months ago
CreatedSince 2023

Tags

#automatic-speech-recognition#asr#speech-recognition#python#openai-whisper#speech-to-text#speaker-diarization#audio-processing#whisper#transcription#speech#voice-activity-detection

Built With

P
Python
F
FFmpeg

Included in

Whisper2.2k
Auto-fetched 1 day ago

Related Projects

whisper-standalone-winwhisper-standalone-win

Whisper & Faster-Whisper standalone executables for those who don't want to bother with Python.

Stars2,996
Forks161
Last commit5 months ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub