WAV files in plain language
A WAV file is, in the standard case, raw uncompressed PCM audio with a small header on top. There is no codec, no perceptual model, no compression. The bytes in the file are the recording. That simplicity is why every DAW and field recorder on the planet can export WAV without negotiation, and it is also why WAV files are so much larger than MP3 or M4A files of the same length.
Why WAV is so large
File size is determined almost entirely by three numbers: sample rate (how many samples per second), bit depth (how many bits per sample), and channel count (mono or stereo). A one-minute stereo CD-quality recording (44.1 kHz, 16-bit, two channels) is 10.1 MB. A one-minute 24-bit 96 kHz field recording is around 33 MB. A one-hour 32-bit float stereo master at 48 kHz can land near 1.4 GB. WAV does not compress, so those numbers scale linearly with duration.
What this means for speech recognition
Whisper large-v3 (the model we run) resamples whatever you give it to 16 kHz mono before the first inference step. A 192 kHz 32-bit float multi-channel WAV ends up shaped exactly the same as a 16 kHz mono phone call by the time the model sees it. In our testing, the transcript quality difference between a 16 kHz mono WAV and a 96 kHz 24-bit stereo WAV of the same speech is statistically zero. What changes is your upload time and your file-size budget.
When uncompressed actually helps
There is one situation where WAV beats a low-bitrate MP3 for transcription: marginal audio. Very quiet voices, heavy ambient noise, dropouts from a flaky lavalier. MP3 encoders at low bitrates throw away exactly the high-frequency tail Whisper sometimes uses to disambiguate fricatives (s, f, sh sounds). If you already have a recording that transcribes poorly as MP3, the WAV version sometimes recovers words the compressed copy missed. For clean studio audio at any reasonable bitrate, you will not see the difference.
The Broadcast Wave (BWF) variant
Professional field recorders (Sound Devices, Zaxcom, recent Tascam and Zoom pro models) write Broadcast Wave, which is a regular WAV with extra metadata chunks: the bext chunk holds timecode and originator info, iXML carries scene and take numbers, sometimes there is a chna chunk for multi-channel naming. Mictoo reads BWF files the same as any other WAV. The metadata is ignored for transcription purposes, the audio is transcribed, and your original file on your drive is never touched or rewritten.