mictoo

Blog··9 min read

MP3 vs M4A vs WAV vs FLAC for Transcription: Which Format Should You Use?

A practical comparison of the four most common audio formats from the perspective of what actually matters for AI transcription. Quality, file size, accuracy, and when each one is worth using.

Almost every transcription article online says "use the highest quality format you have." That advice is technically true and practically useless. What you actually want to know is whether you should bother re-encoding your iPhone voice memo to WAV before uploading (no), whether transcribing a 64 kbps phone-call MP3 will work (yes, mostly), and how much accuracy you lose by picking the smaller file (almost none, in most cases).

This article answers those questions with specifics. We use OpenAI's Whisper large-v3 across the board on Mictoo, which is the strongest open speech model available right now. The numbers and rules of thumb below assume Whisper-class accuracy. They translate well to other modern speech models, but the exact percentages will vary.

TL;DR

FormatBest forFile size (10 min voice)Transcription accuracy
MP3Default for podcasts, phone recordings, anything where you want maximum compatibility~5 MB at 64 kbps monoSame as lossless above 64 kbps
M4AiPhone Voice Memos, GarageBand, anything from the Apple ecosystem~5 MB at default voice qualitySame as lossless above 64 kbps
WAVStudio sessions, field recorders, you also need the audio for editing~100 MB at 44.1 kHz stereo 16-bitMarginally better only on extremely low-quality alternatives
FLACArchival, lossless without the WAV file size, CD rips~50 MB at 44.1 kHz stereo 16-bitIdentical to WAV

The headline answer: for transcription alone, MP3 and M4A at typical quality settings give you the same transcript as WAV and FLAC, with a tenth of the file size. Lossless formats are worth using only if you also need the audio for editing, mastering, or archival.

Why format matters less than you think

Whisper does not actually care about your file format. As soon as the audio reaches the model, it gets resampled to 16 kHz mono, regardless of what you uploaded. Your 24-bit 96 kHz studio WAV and your 32 kbps phone-call MP3 both get the same treatment internally.

What matters is whether your file preserves the speech well enough that the resampled version is still intelligible. For voice, that is a much lower bar than people assume. Speech is mostly between 80 Hz and 8 kHz. A 16 kHz sample rate captures everything up to 8 kHz cleanly, which is more than enough. A 64 kbps mono MP3 preserves that range without audible artifacts on speech.

So the practical question becomes: at what point does compression start to actively damage the speech? The answer for modern codecs is "much lower than you would think." We routinely see usable transcripts from 24 kbps phone-recorded MP3s. Below that, accuracy starts to drop noticeably.

MP3 deep dive

MP3 is the universal format. Every player, every editor, every transcription tool accepts it. It is also the most likely format you already have, especially for podcast downloads and phone recordings.

Bitrate recommendation for transcription: 64 kbps mono is the sweet spot. Higher bitrates do not improve transcript accuracy. Lower bitrates start losing intelligibility for the model around 32 kbps mono.

Variable bitrate (VBR) vs constant bitrate (CBR): Both work. CBR is slightly more predictable for transcription pipelines because it has consistent frame sizes. If you control the export and have a choice, pick CBR at 64 or 96 kbps mono.

Where MP3 shines: When you want the smallest possible file that still transcribes well. A 60-minute interview at 64 kbps mono MP3 is around 28 MB. Fits comfortably in the 60 MB upload cap most free transcription services have, including Mictoo.

Where MP3 hurts: Two cases. First, very low bitrate (under 24 kbps) phone-recorded MP3s lose enough information that names, numbers, and rapid speech get garbled. Second, re-encoding from MP3 to MP3 (lossy to lossy) compounds quality loss. If you have the original WAV or M4A, do not save out as MP3 just to re-encode again.

When to convert TO MP3

If your original file is over your transcription service's upload cap, re-encoding to 64 kbps mono MP3 usually solves the problem with no accuracy loss. A 500 MB stereo WAV becomes a 28 MB mono MP3 with the same transcribable content. We have a short guide on how to compress audio if you need it.

M4A deep dive

M4A is the Apple ecosystem default. iPhone Voice Memos, GarageBand, QuickTime audio recording, AirDrop exports of audio. All M4A. Under the hood, M4A is AAC audio inside an MP4 container.

AAC is a better codec than MP3, especially at low bitrates. A 64 kbps AAC sounds noticeably better than a 64 kbps MP3. For transcription, the difference is minimal at normal bitrates (around 64 kbps and up).

iPhone Voice Memos quality settings: The default is "Compressed" which is lossy AAC at a reasonable bitrate for voice. There is also a "Lossless" setting that uses ALAC. For transcription, "Compressed" is the right choice. Lossless ALAC files are 5x larger with zero transcription benefit.

Where M4A shines: The iPhone Voice Memos workflow. Open the app, record, AirDrop or email to your computer, upload to transcription, done. No format conversion. Most 30-minute interviews fit comfortably under 25 MB.

Where M4A hurts: Compatibility with older tools. Some legacy audio software does not handle M4A and asks for MP3 or WAV. For transcription specifically, this is not a real problem because all modern transcription services accept M4A directly, including Mictoo.

iPhone-specific tip

If your Voice Memos are stored as ALAC (lossless M4A) and the file is too big to upload, you can convert to lossy AAC in QuickTime: File, Export As, Audio Only, M4A Audio. The result is roughly one-fifth the size with identical transcription quality.

WAV deep dive

WAV is uncompressed PCM audio. Every sample is stored exactly as recorded. No compression, no quality loss, no codec to argue about. The trade-off is file size: a 30-minute WAV at typical studio settings (44.1 kHz, 16-bit, stereo) is around 300 MB.

Where WAV makes sense for transcription: When you already have the WAV and the file size fits. There is no reason to convert MP3 to WAV before uploading. There is also no reason to convert WAV to MP3 if the WAV is small enough to upload directly.

Where WAV hurts: File size. A 60-minute lecture recording at 44.1 kHz stereo 16-bit is 600 MB or more. That blows past every free transcription service's upload cap. You either need to convert (re-encode to MP3 or AAC) or split the file.

Bit depth and sample rate trivia

Studio recordings often use 24-bit or 32-bit float for editing headroom. For transcription, Whisper resamples to 16-bit 16 kHz mono anyway. So a 32-bit 96 kHz stereo WAV gives you the exact same transcript as a 16-bit 16 kHz mono WAV. Pick the smaller one if you have a choice. We have detail on our WAV to text page.

FLAC deep dive

FLAC is lossless compressed audio. It is bit-for-bit identical to the original WAV when decoded, but the file is typically 50 to 60 percent of the WAV size. So FLAC gets you the same transcription accuracy as WAV with half the upload time.

FLAC is uncommon in casual transcription workflows. Most people who use FLAC are in archival, audiophile, or CD-ripping contexts. If your audio is FLAC, you probably already know why and you do not need this article to tell you to use it.

Where FLAC shines: Archival projects, oral history transcription, library digitization, anything where you want to keep the master in lossless form forever. Transcribe from the FLAC directly so you do not have to maintain a parallel MP3 copy just for transcription.

Where FLAC hurts: Compatibility, marginally. Some older Windows audio software does not handle FLAC out of the box. For transcription specifically, Mictoo accepts FLAC directly, as do most modern services.

Formats we did not cover

A few formats you might encounter that are worth a brief note.

OGG (Vorbis or Opus): Common in WhatsApp voice notes (Opus), Telegram voice messages (Opus, with .oga extension), and Audacity exports (Vorbis). All work for transcription. OGG Opus is actually the most efficient codec for speech, narrowly beating AAC. See our OGG to text page for details.

WEBM: The web-native format. Discord call recordings, browser screen recorders, MediaRecorder API output all save as WEBM (usually with Opus audio inside). Same story: transcribes well, no conversion needed. Details on our WEBM to text page.

AIFF: Apple's uncompressed format. Equivalent to WAV in audio terms. Some Apple-native tools default to AIFF. For transcription, most services accept it, but if not, converting to WAV or M4A is trivial.

ALAC (Apple Lossless): Lossless audio inside an M4A container. Same use case as FLAC. iPhone Voice Memos can save in ALAC if you flip the "Lossless" setting. For transcription, the lossy AAC version is genuinely better in terms of practical workflow (much smaller, identical transcript quality).

Practical decision rules

Three rules cover 95 percent of cases.

Rule 1: If you already have the file, use it. Do not re-encode just to standardize on one format. Re-encoding MP3 to MP3 makes things worse. Re-encoding WAV to WAV is pointless. Drop what you have into a transcription service that accepts it.

Rule 2: If you are recording fresh and you only need a transcript, pick the lossy default. iPhone Voice Memos, default Android voice recorder, any podcast hosting service's default export. All produce M4A or MP3 at sensible quality for transcription. No need to crank up to lossless.

Rule 3: If you are recording fresh and you also need to edit later, pick WAV or FLAC. Editing benefits from lossless headroom. Transcription does not. So the choice between lossy and lossless is really a choice between "transcript only" and "transcript plus edit-ready audio."

What actually affects transcription accuracy (besides format)

Format is the most-talked-about variable and the least important one. Here is what actually moves the needle on transcript accuracy, ranked.

  1. Microphone quality and position. A laptop built-in mic at 50 cm gives much worse results than a 20-dollar USB headset at 10 cm. This is the single biggest factor.
  2. Background noise. Quiet room beats noisy room. Wind beats indoor every time for the wrong reason. HVAC hum, traffic, music in another room all hurt accuracy.
  3. Speaker accent and language coverage of the model. Whisper handles most major accents well. Heavy regional dialects (strong Glaswegian, heavy Swiss German, deep Quebec French) lose a few percent.
  4. Speaking rate and clarity. Slower, clearer speech transcribes more accurately than rapid-fire delivery, mumbling, or overlapping voices.
  5. Domain vocabulary. Common words come through almost perfectly. Specialized terms (medical, legal, technical jargon, proper nouns) often need manual cleanup.
  6. Format and bitrate. Only matters at the extremes. Below 24 kbps starts to cost accuracy. Anything above 64 kbps mono is essentially identical.

If you want to improve your transcripts, fix items 1 through 5 before you start worrying about format.

Bottom line

For pure transcription, use whatever format you have. MP3 and M4A at typical quality settings are indistinguishable from WAV and FLAC in transcript output. The only times format matters: when your file is too big to upload (re-encode to 64 kbps mono MP3 or AAC), when you also need the audio for editing (use WAV or FLAC), or when bitrate drops below 24 kbps (which is rare outside old phone recordings).

Stop overthinking this. Drop your file into a good transcription tool and ship the text. If you want to test how your specific audio comes through, you can try Mictoo free, no signup. Whisper large-v3 across every format.

Want to try Mictoo?

Free AI transcription. No signup. Drop any audio or video file and get the text in seconds.

Open the transcriber