What WebM actually is
WebM is a container format Google released in 2010 as an explicitly open, royalty-free alternative to MP4. It is built on a subset of the Matroska (MKV) container, restricted to specific video codecs (VP8, VP9, more recently AV1) and audio codecs (Vorbis originally, Opus mostly today). The point was to give browsers a format they could play natively without licensing MPEG patents.
It worked. Today every major browser ships with native WebM playback, and the format is the default for video APIs built into the browser itself. When your web app records audio or video using MediaRecorder, WebM is what comes out.
Why audio-only WebM is rare in the wild
Most people get a WebM file from a screen recording, a video call, or a video upload. Audio-only WebM exists (some podcast tools use it) but is much less common. So in practice, when you arrive at this page with a WebM in hand, it almost always has video inside, which is why we lead with "we strip the video".
The audio codec inside is the same Opus codec Telegram uses for voice messages: small, clear, voice-friendly. Whisper handles Opus at 64 to 96 kbps stereo (typical for screen recordings) very well. Your transcript quality depends on microphone placement and room noise, not on Opus vs anything else.
Stripping the video saves a lot of bandwidth
A typical Loom recording at 720p uses 80 to 95% of its bytes for video, and only 5 to 20% for audio. So a 200 MB Loom screen recording usually has only 10 to 40 MB of actual audio. The ffmpeg one-liner in the Pro tips section extracts that audio without re-encoding, in seconds, on any laptop. Drop the extracted audio file here and the upload completes much faster than uploading the original 200 MB video.
WebM vs MP4 for the same recording
Both work in Mictoo. WebM uses Opus audio (slightly more efficient at the same bitrate), MP4 uses AAC audio (better tool support across legacy software). Transcript quality is identical between the two if the source recording quality is the same. The choice between them comes down to what your recording tool happens to export by default.