OpenAI’s AI model automatically recognizes speech and translates it into English

Benj Edwards / Ars Technica

On Wednesday, OpenAI released a new open source artificial intelligence model called Whisper that recognizes and translates audio at a level approaching human recognition skill. It can transcribe interviews, podcasts, conversations, and more.

OpenAI trained Whisper on 680,000 hours of audio data and corresponding transcripts in approximately 10 languages ​​collected from the web. According to OpenAI, this open collection approach has resulted in “improved robustness of accents, background noise and technical language”. It can also detect the spoken language and translate it into English.

OpenAI describes Whisper as an encoder-decoder transformer, a type of neural network that can use context gathered from the input data to learn associations that can then be translated into the model output. OpenAI presents this overview of Whisper’s operations:

The incoming audio is split into 30-second chunks, converted to a log-mel spectrogram, and then passed to an encoder. A decoder is trained to predict the corresponding text caption, mixed with special tokens that direct the individual model to perform tasks such as language identification, sentence-level timestamps, multilingual speech transcription, and English speech translation.

With Whisper’s open-sourcing, OpenAI hopes to introduce a new foundation model that others can build on in the future to improve speech processing and accessibility tools. OpenAI has a significant track record on this front. In January 2021, OpenAI released CLIP, an open source machine vision model that likely kicked off the recent era of rapidly evolving image synthesis technology such as DALL-E 2 and Stable Diffusion.

At Ars Technica, we tested Whisper from the code available on GitHub and provided him with multiple samples, including a podcast episode and a particularly hard-to-understand audio section from a phone interview. While it took some time when running on a standard Intel desktop CPU (the technology still doesn’t work in real time), Whisper did a good job of transcribing audio to text through the Python demonstration program, much better than some AI-based audio transcription services we’ve tried in the past.

Example of console output from OpenAI's Whisper demo program while transcribing a podcast.
Zoom in / Example of console output from OpenAI’s Whisper demo program while transcribing a podcast.

Benj Edwards / Ars Technica

With the correct setup, Whisper could easily be used to transcribe interviews, podcasts, and potentially translate podcasts produced in languages ​​other than English into English on your computer, for free. This is a powerful combination that could ultimately disrupt the transcription industry.

As with almost every major new AI model these days, Whisper offers positive benefits and the potential for misuse. On Whisper’s model sheet (in the “Broader Implications” section), OpenAI warns that Whisper could be used to automate surveillance or identify individual speakers in a conversation, but the company hopes it will be used “primarily for charitable purposes.”