What "audio format" actually means here
An audio format is a combination of a container (the file wrapper) and a codec (the algorithm that compresses or stores the audio inside). Most people use "format" loosely to mean both. WAV is a container with PCM codec (uncompressed). MP3 is a codec with no container (just the raw stream). M4A is an MP4 container with AAC codec inside. FLAC is both a codec and a container that share the same name. OGG is a container that usually holds Opus or Vorbis.
For transcription, what matters is that the audio data inside can be decoded to raw PCM samples that the Whisper model reads. We handle the container and codec details on our side, you just upload the file.
Why format choice almost never affects transcript quality
Whisper large-v3 resamples whatever audio it gets to 16 kHz mono before the first inference step. A 192 kHz 24-bit stereo studio recording gets crushed down to the same input shape as a 16 kHz mono phone call. The model never sees the "fancy" version of the audio.
What this means in practice: a 128 kbps MP3 of your podcast transcribes essentially the same as the original 24-bit WAV master. A 32 kbps Opus voice message from Telegram transcribes essentially the same as if you had recorded the same voice in WAV. The format only matters at the edges: very low bitrate compressed audio (under 32 kbps) starts to lose information Whisper needs.
When format actually does matter
For upload size and storage. A one-hour recording is around 14 MB as 32 kbps mono AAC, 30 MB as 64 kbps mono MP3, 300+ MB as CD-quality WAV. The transcript is the same; the upload time and storage cost are very different. For everyday use, small lossy formats (MP3, M4A, OGG) are the practical choice. For archival or editing, keep the original lossless format (WAV, FLAC, ALAC) on your drive.
How we handle format detection
On upload, we read the first few bytes of the file to identify the container and codec. Most audio formats have distinctive magic numbers in their headers: WAV starts with "RIFF", MP3 starts with "ID3" or specific frame sync patterns, M4A/MP4 starts with "ftyp", FLAC starts with "fLaC", OGG starts with "OggS". We use the header rather than the file extension because extensions can lie (a renamed file, a system that stripped the extension).
What we do not currently accept
Proprietary or encrypted formats with DRM. Some old WMA files from the Microsoft DRM era cannot be decoded by any current open-source tool. Apple-Lossless ALAC files inside DRM-protected M4P containers (legacy iTunes Store purchases before 2009) similarly cannot be decoded. For these cases, you need to find or generate an unprotected copy.