The short answer: ChatGPT cannot transcribe a long audio file in the way you probably mean. You cannot upload a one-hour podcast MP3 to chat.openai.com and get back a clean timestamped transcript. The longer answer is more interesting, because it's also wrong to say ChatGPT has no audio capability at all. As of mid-2026, there are three distinct things that get called "ChatGPT transcription", and only one of them is what people actually want.
The three things people mean by "ChatGPT transcribe audio"
1. Upload an MP3 to ChatGPT and get back a transcript
This is what most people are asking about. The honest answer: no, this does not work as a standard ChatGPT feature. If you upload an audio file to the chat input, ChatGPT will recognise it as an attachment but won't transcribe the contents into text. The Files tool, the Code Interpreter, and the various plugin attempts at workarounds all have severe length limits (typically 25 MB and a few minutes) and produce inconsistent results.
OpenAI has not added an "upload audio, get transcript" flow to the consumer ChatGPT product. They did add it to the API (the Whisper endpoint), but that's a developer tool, not a chat feature.
2. ChatGPT Voice Mode (the microphone button)
The little microphone icon in the ChatGPT app does use audio. But the goal is conversation, not transcription. ChatGPT listens to what you say, transcribes it internally with Whisper, sends the text to the language model, and reads the answer back to you with TTS. The transcription is real but it's used as input to a conversation. You don't see the transcript as a deliverable.
You can sort of trick it by saying "please write down exactly what I just said, word for word" — but it caps at the maximum voice-input length (roughly a few minutes), and ChatGPT often paraphrases or cleans up filler words rather than giving a strict transcript.
3. Audio inside a multimodal prompt (GPT-4o input)
GPT-4o (the multimodal model behind ChatGPT) technically accepts audio input through the API. In the ChatGPT product, this is exposed only through voice mode. Through the API, a developer can send an audio file and ask the model to transcribe it. But again: 25 MB cap, a few minutes per request, and not the same accuracy guarantees as the dedicated Whisper endpoint.
What ChatGPT uses under the hood: Whisper
Every time ChatGPT processes audio (voice mode, audio in GPT-4o), it runs the input through Whisper, OpenAI's dedicated speech recognition model. Whisper is also what powers almost every modern AI transcription tool, including Mictoo. So when people ask "can ChatGPT transcribe audio?", what they're really asking is whether the consumer ChatGPT app exposes Whisper directly. It does not. ChatGPT is a chat product. Whisper is a speech-to-text model. The interesting bit is that you can use Whisper without going through ChatGPT at all.
Three practical paths in 2026
Path 1: Use a dedicated transcription tool
The simplest option, and the right one for almost everyone. Tools like Mictoo wrap Whisper (or fast equivalents like Deepgram, AssemblyAI) in a normal web interface. Drop an MP3, get text in seconds. Returns timestamps, SRT subtitles, AI summaries. No upload limits in the punishing range — Mictoo handles up to 25 MB anonymously, 60 MB after a free signup, and the underlying Whisper engine accepts files up to several gigabytes.
Cost: zero on the free tier for the typical use case. You get the same Whisper model that ChatGPT's voice mode uses, just exposed as a transcription deliverable instead of a conversational input.
Path 2: Use the OpenAI Whisper API directly
If you have OpenAI API credits and a few minutes to write code, you can call the Whisper endpoint directly. The pricing is $0.006 per minute as of 2026 — about 36 cents an hour. For a developer transcribing one or two files, that's cheaper than any subscription.
Trade-off: you do the file chunking, the language detection, the SRT formatting, the speaker labelling, the summarisation, and the error handling yourself. For a one-off script, that's a weekend. For ongoing use, it's a lot of maintenance.
Path 3: Use ChatGPT voice mode for very short clips
If you have a 30-second voice memo and you specifically want ChatGPT to do something with the words (summarise, translate, rewrite), voice mode does the job in one step. Open the app, hit the microphone, ask: "Listen to this and give me a summary." This is the only case where ChatGPT's built-in audio handling is actually faster than using a dedicated tool, because you skip the "transcribe then paste then ask" loop.
Hard limits: maximum a few minutes per recording, transcript is not stored separately from the conversation, no SRT output, no timestamps, no speaker labels.
Comparison: ChatGPT vs dedicated transcription
| Capability | ChatGPT (consumer) | Whisper API | Mictoo (free) |
|---|---|---|---|
| Upload an audio file | No | Yes (via API) | Yes |
| Get text deliverable | No (voice replies only) | Yes (JSON) | Yes (TXT, SRT, VTT, JSON) |
| Timestamps | No | Yes (with verbose_json) | Yes, every line |
| SRT subtitles | No | Yes (response_format: srt) | Yes, one click |
| Long files (1 hour+) | No | Yes (with chunking) | Yes, with the queue |
| AI summary | Yes (paste text first) | DIY | Built in, free |
| Cost | $20/mo (Plus) | $0.006/min | Free |
| Technical setup needed | No | Yes (write code) | No |
What about ChatGPT and video files?
Same answer as audio, with one extra layer: ChatGPT doesn't accept video file uploads at all in the consumer product. The voice mode is voice-only. The GPT-4o API supports images frame by frame, but not full video streams.
The practical path for video: extract the audio track first (any of these methods — MP4 to MP3, audio extraction guide, ffmpeg one-liner — does it in seconds), then transcribe the audio. Same for YouTube videos: download the video first, extract audio, transcribe.
Specific common questions
Why can't ChatGPT transcribe audio?
It can, technically, with Whisper. But OpenAI hasn't exposed that capability as a first-class feature in the consumer app. The product team has chosen to keep ChatGPT focused on chat and to keep the audio APIs separate. They'd rather sell Whisper API access to developers than build a transcription product that competes with their own paid ChatGPT subscription value proposition.
What is the file upload limit for ChatGPT?
For text and code files, ChatGPT Plus accepts up to 512 MB and 2 million tokens per file, with a daily limit of around 80 files. For audio, the limit isn't officially documented because audio upload isn't officially supported. Workarounds typically cap around 25 MB before timing out.
Can ChatGPT listen to audio in voice mode?
Yes, voice mode listens to your microphone input in real time and transcribes it internally with Whisper. The transcription stays inside the conversation — you don't get it back as a separate text deliverable. Maximum recording length is roughly a few minutes per turn.
Is the GPT-4o multimodal model better at transcription?
For accuracy, no. GPT-4o uses the same Whisper model for speech-to-text. The multimodal part is about handling combined image plus text plus audio in a single context — not about transcription quality. For pure transcription, dedicated Whisper API access gives identical or slightly better results because the request is formatted for that one task.
What about plugins that claim to transcribe inside ChatGPT?
A few exist. They typically work by sending the file to a third-party Whisper-wrapping service and returning the text into the chat. Two problems: (a) the round-trip is slow compared to going directly to a transcription tool, and (b) you're sharing audio with a third party of unknown privacy standards. For non-sensitive recordings, fine. For anything you care about, skip the plugin and use a dedicated tool with a clear privacy policy.
Bottom line
If you want ChatGPT to transcribe an audio file the way Microsoft Word handles a paste, that feature doesn't exist. The model behind ChatGPT (Whisper) does transcription well, but OpenAI exposes it as a separate API endpoint, not as a chat feature. For practical use:
- Have a recording, want a transcript: use a dedicated transcription tool. Mictoo does it free in seconds with the same engine ChatGPT voice mode uses.
- Developer with API credits: Whisper API directly. Six tenths of a cent per minute. Bring your own code.
- Short clip, want the model to do something else with it: ChatGPT voice mode. One step.
ChatGPT is a great chat product. It's not a transcription product. Use the right tool for the job and the friction disappears.