How video files work for transcription
A video file is a container (MP4, MOV, WebM, MKV, AVI) holding at least one video track and at least one audio track, plus optional subtitle tracks and metadata. The container specifies how the tracks are interleaved on disk and stamped with timestamps; the actual video and audio data are encoded separately with their own codecs.
For transcription, the video frames are irrelevant. We demux the file (extract individual tracks from the container), discard the video track, take the audio track, decode it to PCM, and feed it to Whisper. The video frames never reach the transcription model.
The common video containers and their audio codecs
MP4 (also called .m4v in some contexts) is the H.264 era's universal container. Audio inside is almost always AAC, occasionally MP3 or AC-3. iPhone video, screen recorders on most platforms, YouTube downloads, exported video editor projects all default to MP4 with AAC audio. This is the format you will encounter most often.
MOV is Apple's QuickTime container, very similar to MP4 internally (both are derived from the ISO Base Media File Format). MOV files often hold higher-bitrate or higher-resolution content from professional editing pipelines. Audio is usually AAC, sometimes ALAC for lossless, occasionally PCM for uncompressed.
WebM is the open-source video container, originally pushed by Google for HTML5 video. Audio inside is Opus or Vorbis; video is VP8 or VP9. Browser-recorded video (the MediaRecorder API output) is typically WebM. Some screen recording tools also default to WebM.
MKV (Matroska) is a flexible container that can hold almost any codec, common for downloaded media and high-quality archives. Audio inside varies: AAC, AC-3, DTS, FLAC, Opus are all common.
AVI is the older container from Microsoft, less common today but still present in older video collections. Audio inside is usually MP3 or AC-3.
Why video files are so much bigger than audio-only equivalents
A 1080p 30-minute video at typical YouTube quality is 300-500 MB. The audio track inside is 10-30 MB. The video track is 95% of the file size and 0% of what transcription needs.
If your video file is over the 60 MB free tier limit and you have ffmpeg installed, extracting the audio is one command: ffmpeg -i video.mp4 -vn -ac 1 -ar 16000 audio.m4a. The -vn flag drops video,-ac 1 converts to mono, -ar 16000resamples to 16 kHz (the same rate Whisper uses internally). The resulting M4A is typically 5-10 MB for a 30-minute video.
What about multi-track audio in video files
Some video files have multiple audio tracks: original language plus dubbed versions, dialogue plus separate music/effects tracks, or commentary tracks. We currently transcribe the default audio track (track 0 in container terminology). If your video has a non-default track you want transcribed, extract that track first withffmpeg -i video.mp4 -map 0:a:1 -vn audio.m4a(where 0:a:1 selects the second audio track).
Subtitle tracks vs transcription
If your video already has burned-in subtitles (rendered into the video frames as visible text), they do not help transcription. We never look at video frames. If your video has a separate subtitle track inside the container (sometimes called soft subs, common in MKV), we still ignore it during transcription; we transcribe from the audio. To extract existing soft subtitles, use a tool like MKVToolNix instead of Mictoo.