Video files · MP4 · MOV · WebM · Free

Video to Text
For screen recordings, tutorials, and vlogs

Upload an MP4, MOV, WebM, MKV, or AVI video file. We extract the audio track, transcribe with Whisper large-v3, and return a transcript with timestamps, AI summary, and SRT subtitles ready to drop onto the video.

AI summaryTranslate, 28 langsOpenAI Whisper

Language:

Or paste a YouTube URLWe'll fetch the video's captions instantly. Free.

or upload a file

Drop your file here

or click to browse

MP3 · MP4 · WAV · M4A · OGG · WEBM · FLAC · Max 25MB · Max 30 min (60 min · Sign in)

Got a bigger file? See how to compress.

Got a longer recording? See how to split.

Video files combine a visual track and an audio track in a single container (MP4, MOV, WebM, etc). For transcription, only the audio track matters. We extract it on our side, so you do not need to convert your video to audio-only first.

This is the right page for screen recordings (Loom, Zoom, OBS), tutorials, vlogs, webinars, conference talks, and downloaded videos. For audio-only files (MP3, M4A, WAV, etc), theaudio to text page is the right entry point. For YouTube URLs (no upload needed), seeYouTube to text.

Note: video files are much larger than the audio inside them. A 30-minute MP4 can easily be 200-500 MB depending on resolution; the audio track is typically 10-30 MB. Tips on handling large video files appear below.

How it works

🎬

Upload your video file

MP4 (H.264 or H.265 video, AAC audio is most common), MOV (QuickTime), WebM (VP8/VP9 video, Opus or Vorbis audio), MKV, AVI. Free tier accepts files up to 60 MB. For larger videos, the tips below cover audio extraction first.

🔊

We extract the audio track

On our side, we demux the audio track from the video container, ignore the visual frames, and pass the audio to Whisper large-v3. You do not need to do anything with the video itself.

📝

Transcript, SRT, and summary

Transcript with timestamps appears alongside an AI summary. Download as TXT, SRT (for video editor captions), VTT (for HTML5 video and YouTube), or DOCX. Translate to 50+ languages with one click.

Why Mictoo for video file transcription

No "extract audio first" detour

Other tools ask you to convert the video to MP3 before upload. Mictoo accepts the video directly, demuxes the audio track on our side. One step instead of two, no separate ffmpeg or Audacity session.

SRT and VTT ready for video editors

After transcription, download SRT for Premiere/Final Cut/DaVinci Resolve, or VTT for HTML5 video and YouTube. Timestamps align to the original video timeline, so subtitles drop in correctly without re-syncing.

AI summary for video descriptions

The AI summary appears alongside the transcript automatically. Useful as the starting point for YouTube descriptions, blog post versions of video tutorials, and meeting recap emails.

Works with most video formats

MP4 is the universal default. MOV from iPhone/QuickTime. WebM from screen recorders and browser-recorded video. MKV from downloaded media. AVI from older sources. We handle each container without requiring a specific format.

No download or signup required

Files up to 60 MB on the free tier, no account creation, no watermark on the transcript, no time limit, no daily counter. The first transcription works the same as the hundredth.

Video transcription scenarios

Screen recording / Loom transcripts

You recorded a product demo, tutorial, or async update in Loom or QuickTime. Drop the MP4 here to turn the spoken commentary into searchable text and SRT captions for the embedded version.

Zoom or webinar recordings

Recorded a webinar, panel, or meeting locally as MP4. Drop the file here, get the transcript with timestamps for show notes, post-event summary, and accessibility captions on the replay.

YouTube video preparation

About to upload a video to YouTube and want SRT captions ready. Drop the video here, download SRT, upload alongside the video. The auto-generated YouTube captions are usually noticeably worse than Whisper.

Tutorial repurposing as blog post

You have a video tutorial and want a text article version for SEO and accessibility. Transcript gives you the spoken content; AI summary gives you the structural outline. Use both as the article draft.

Conference talk transcription

Downloaded video of a talk or presentation. Transcribe to quote in articles, share quoteable excerpts on social, or create a written companion to the talk recording.

Vlog or interview video captioning

Vlog episodes or recorded interviews destined for YouTube, Instagram Reels, or TikTok. Transcribe for the captions track that boosts watch time on muted-by-default mobile feeds.

Video transcription tips

For videos over 60 MB, extract audio first

A 30-minute 1080p MP4 can easily be 300 MB. The audio inside is typically 10-30 MB. Run ffmpeg -i video.mp4 -vn -ac 1 -ar 16000 audio.m4a to strip the video, then upload the audio file. Same transcript, much smaller upload.

Screen recordings often have low-quality audio

Microphone audio captured during screen recording is often quieter and noisier than purpose-recorded podcast audio. Whisper handles it, but if accuracy is poor, consider re-recording the audio with a closer mic for important content.

SRT vs VTT: pick by destination

SRT is the universal choice for video editors (Premiere, Final Cut, DaVinci Resolve). VTT is the W3C standard for HTML5 video and YouTube. The content is identical, only the format differs. We let you download either.

For YouTube uploads, Whisper beats auto-captions

YouTube auto-captions are usable but noticeably worse than Whisper, especially for technical terms, proper names, and accented speakers. Transcribe with Mictoo, download SRT, upload manually for cleaner captions.

How video files work for transcription

A video file is a container (MP4, MOV, WebM, MKV, AVI) holding at least one video track and at least one audio track, plus optional subtitle tracks and metadata. The container specifies how the tracks are interleaved on disk and stamped with timestamps; the actual video and audio data are encoded separately with their own codecs.

For transcription, the video frames are irrelevant. We demux the file (extract individual tracks from the container), discard the video track, take the audio track, decode it to PCM, and feed it to Whisper. The video frames never reach the transcription model.

The common video containers and their audio codecs

MP4 (also called .m4v in some contexts) is the H.264 era's universal container. Audio inside is almost always AAC, occasionally MP3 or AC-3. iPhone video, screen recorders on most platforms, YouTube downloads, exported video editor projects all default to MP4 with AAC audio. This is the format you will encounter most often.

MOV is Apple's QuickTime container, very similar to MP4 internally (both are derived from the ISO Base Media File Format). MOV files often hold higher-bitrate or higher-resolution content from professional editing pipelines. Audio is usually AAC, sometimes ALAC for lossless, occasionally PCM for uncompressed.

WebM is the open-source video container, originally pushed by Google for HTML5 video. Audio inside is Opus or Vorbis; video is VP8 or VP9. Browser-recorded video (the MediaRecorder API output) is typically WebM. Some screen recording tools also default to WebM.

MKV (Matroska) is a flexible container that can hold almost any codec, common for downloaded media and high-quality archives. Audio inside varies: AAC, AC-3, DTS, FLAC, Opus are all common.

AVI is the older container from Microsoft, less common today but still present in older video collections. Audio inside is usually MP3 or AC-3.

Why video files are so much bigger than audio-only equivalents

A 1080p 30-minute video at typical YouTube quality is 300-500 MB. The audio track inside is 10-30 MB. The video track is 95% of the file size and 0% of what transcription needs.

If your video file is over the 60 MB free tier limit and you have ffmpeg installed, extracting the audio is one command: ffmpeg -i video.mp4 -vn -ac 1 -ar 16000 audio.m4a. The -vn flag drops video,-ac 1 converts to mono, -ar 16000resamples to 16 kHz (the same rate Whisper uses internally). The resulting M4A is typically 5-10 MB for a 30-minute video.

What about multi-track audio in video files

Some video files have multiple audio tracks: original language plus dubbed versions, dialogue plus separate music/effects tracks, or commentary tracks. We currently transcribe the default audio track (track 0 in container terminology). If your video has a non-default track you want transcribed, extract that track first withffmpeg -i video.mp4 -map 0:a:1 -vn audio.m4a(where 0:a:1 selects the second audio track).

Subtitle tracks vs transcription

If your video already has burned-in subtitles (rendered into the video frames as visible text), they do not help transcription. We never look at video frames. If your video has a separate subtitle track inside the container (sometimes called soft subs, common in MKV), we still ignore it during transcription; we transcribe from the audio. To extract existing soft subtitles, use a tool like MKVToolNix instead of Mictoo.

Frequently asked questions

What video formats does Mictoo accept?

MP4, MOV (QuickTime), WebM, MKV (Matroska), AVI. We accept the most common container formats holding standard codecs (H.264, H.265, VP8, VP9 video; AAC, MP3, Opus, Vorbis, AC-3 audio). Free tier accepts files up to 60 MB.

Do I need to extract audio from the video first?

No. We extract the audio track from the video container on our side and discard the video frames. Upload the video directly. The only reason to extract audio first is if the video file exceeds the 60 MB upload cap.

My 30-minute video is 300 MB, over the cap. What do I do?

Extract just the audio with ffmpeg -i video.mp4 -vn -ac 1 -ar 16000 audio.m4a. The audio file will be 5-15 MB. Upload the audio file instead. Same transcript, smaller upload. Or use the YouTube URL field if the video is on YouTube.

Can I get SRT subtitles back?

Yes. Download as SRT for video editor use (Premiere, Final Cut, DaVinci Resolve), or VTT for HTML5 video and YouTube. Timestamps align to the original video timeline, so subtitles sync correctly without re-alignment.

Are timestamps exact enough for closed captions?

Yes. Whisper produces word-level timestamps that group into segment timestamps suitable for caption files. Cuts typically land within 100-300 ms of the actual sentence boundaries, which is comfortable for caption display timing.

Does Mictoo transcribe video in languages other than English?

Yes. Whisper large-v3 supports 50+ languages with auto-detection. For short videos or non-English content, set the language explicitly in the dropdown before upload for cleaner first-pass detection.

Can I get the video transcribed and then translated?

Yes. After transcription, pick a target language and click Translate. The translated text appears alongside the original. SRT and VTT downloads also work for the translated version, useful for subtitle versions in multiple languages.

What about screen recordings with multiple speakers?

Whisper does not currently distinguish speakers in the transcript ("speaker 1: ... speaker 2: ..."). Speaker diarisation is on our Pro tier roadmap. For now, the transcript is continuous text without speaker labels.

Does Mictoo handle vertical video (TikTok, Reels) the same as horizontal?

Yes. We only look at the audio track; the video aspect ratio does not matter. Vertical MP4 from TikTok exports, Reels recordings, and Stories transcribe identically to horizontal video.

Will my video file be saved on your servers?

No. The video streams to our processor for audio extraction, the extracted audio goes to the transcription provider, and both are dropped from memory after processing. The video itself is never written to disk on our side.