What an SRT file actually contains
SubRip (.srt) is one of the simplest possible subtitle formats: a plain text file with numbered segments, each containing a start timestamp, an end timestamp, and one or two lines of caption text. A single segment looks like:
1
00:00:01,000 --> 00:00:03,500
Welcome to the show.
2
00:00:03,600 --> 00:00:06,200
Today we talk about
the subtitle generator workflow.
That is the whole format. No styling, no positioning, no font specification. The simplicity is why it works everywhere: parsing SRT is trivial enough that even indie video tools implement it without fuss.
SRT vs VTT: what is the actual difference
VTT (WebVTT) is the W3C standard for HTML5 video captions. It adds optional styling (positioning, colors, classes), multi-line cues, and metadata. For the basic case (text with timestamps), VTT is almost the same as SRT with a different header and a different timestamp separator (period instead of comma for fractional seconds).
Practical choice: use SRT if your target is a video editor or YouTube. Use VTT if your target is HTML5 video on your own website (the standard HTML <track>element expects VTT). Mictoo offers both downloads from the same source transcription.
How Whisper produces caption timing
Whisper outputs word-level timestamps for the whole transcription. We group consecutive words into caption segments using a few rules: keep segments under ~84 characters (so they fit on two lines of typical caption display), break at sentence and clause boundaries where possible, keep individual segments between 2 and 7 seconds. The resulting segments read naturally on screen rather than ending mid-clause.
Timestamp accuracy is typically within 100-300 ms of the actual word boundaries, which is comfortable for caption display (viewers tolerate small drift, especially when captions appear slightly before the speech).
Why "burned-in" captions are different
SRT files are external captions: the .srt file lives alongside the video, and the player or editor renders the text on top. Burned-in captions are pixels baked into the video frames during render. Burned-in captions cannot be turned off, cannot be translated, cannot be re-edited. External captions (SRT or VTT) can be toggled, replaced with translated versions, or edited without re-rendering.
For most use cases (YouTube, web video, NLE projects), external SRT captions are preferred for the flexibility. For platforms that do not support uploadable captions (some social platforms, downloaded video for offline viewing), burn the captions in during the video editor export, using the SRT as the source for caption text.
Common SRT pitfalls and how to avoid them
Missing blank line between segments: SRT requires a single blank line between numbered segments. Some tools omit it and the file silently fails to parse in strict players. Mictoo emits properly formatted SRT with blank lines.
Wrong line ending convention (CRLF vs LF): SRT specs tolerate either. YouTube and most NLEs handle both. Some older Windows-only tools require CRLF. Mictoo emits LF by default; convert with a text editor if your target tool needs CRLF.
Encoding: SRT files should be UTF-8 for non-ASCII characters (accented letters, non-Latin scripts, emoji). Mictoo emits UTF-8. If you see "garbled accents" in your destination tool, it is reading the file as Latin-1 or Windows-1252 instead of UTF-8.