Audio to Text
Transcribe MP3, WAV, M4A, voice memos, Zoom recordings, and podcasts in your browser. Speaker labels included. Export TXT, SRT, VTT. No upload, no signup, no cost.
Your audio stays inside this browser tab, never reaches a server, and nothing is saved to your computer. The first visit prepares the AI tools in your browser cache (about 80 MB, one-time). After that, each transcript runs instantly.
infoRunning in compatibility mode. Transcription will work but may take longer. For the fastest experience, try Chrome or Edge on a recent computer.
Drop or pick an audio file to transcribe.
lightbulbTip
Record from your mic, capture another browser tab, or upload an audio file. When you press Stop, you get a full transcript with speakers labeled. Click any "Speaker 1" label to rename it to the actual person's name before exporting.
What is the Audio to Text?
Audio transcription converts spoken audio into a written text document. This tool runs OpenAI's Whisper-base.en model and a speaker-diarization pipeline directly in your browser via WebGPU or WebAssembly, so the audio file or microphone stream is never sent to a server. Each speaker is labeled (Speaker 1, Speaker 2, and so on) and the output can be exported as plain text, SRT subtitles, or WebVTT captions.
How to use the Audio to Text
- 1
Pick an input mode
Choose Microphone to record from your device, Tab Capture to transcribe audio playing in a Chrome or Edge tab, or File Upload to drop an MP3, WAV, M4A, OGG, or WebM file.
- 2
Wait for the model to load
On first use, Whisper-base.en and the speaker-segmentation model download from Hugging Face (~80 MB total). This takes 30-60 seconds depending on your connection. Subsequent visits load instantly from the browser cache.
- 3
Start and stop recording
Click Start. For microphone and tab capture, speak or play the audio, then click Stop. For file uploads, processing begins automatically after the model is ready.
- 4
Review and rename speakers
The transcript appears with Speaker 1, Speaker 2 labels. Click any label to rename it (for example, to the person's actual name) before exporting.
- 5
Export the transcript
Download as TXT for plain text, SRT for video subtitle tracks, or VTT for web captions. All three formats include timestamps.
Works with
Meeting & call recordings
- •Zoom (cloud and local recordings)
- •Google Meet recordings
- •Microsoft Teams meeting recordings
- •Webex recordings
- •Discord call recordings
- •Slack huddle recordings
- •GoToMeeting and BlueJeans archives
Video platforms (via Tab Capture)
- •YouTube videos
- •Vimeo videos
- •Loom screen recordings
- •Twitch stream replays
- •TikTok and Instagram Reels (when played in browser)
- •Any video playing in a Chrome or Edge tab
Audio file formats
- •MP3
- •WAV
- •M4A
- •OGG
- •WebM
- •FLAC
- •AAC
- •Opus
Mobile recordings
- •iPhone Voice Memos (.m4a)
- •Android voice recorder files
- •WhatsApp voice notes
- •Telegram voice messages
- •Field recordings from journalism apps
Content types
- •Podcasts and podcast interviews
- •University lectures and webinars
- •Journalist interviews and source recordings
- •Conference talks and panel discussions
- •Sermons and religious talks
- •Court depositions and legal recordings
- •Oral history and ethnographic research
- •Customer support call recordings
- •Sales call recordings for note-taking
Frequently Asked Questions
Is my audio uploaded to a server?
No. The Whisper model runs inside your browser using WebGPU (or WebAssembly as a fallback). Your microphone stream, tab audio, and uploaded files are processed locally and never sent anywhere. You can verify this by opening DevTools and checking the Network tab during transcription.
What audio file formats are supported?
MP3, WAV, M4A, OGG, and WebM. File size is limited by your device's available RAM rather than a server cap. Most recordings under 2 hours process without issues on a modern laptop.
What languages does it support?
The default model is Whisper-base.en, optimized for English. It will produce output for other languages but accuracy drops significantly. A multilingual Whisper variant may be added in a future update.
Why is the first transcription slow?
The first run downloads Whisper-base.en and pyannote-segmentation-3.0 from Hugging Face CDN (~80 MB combined). After that, both models are cached in the browser's IndexedDB and load in a few seconds on return visits.
How accurate is speaker diarization?
Speaker detection works best when speakers take distinct turns and there is minimal crosstalk. Accuracy degrades with heavy background noise, overlapping speech, or more than four speakers. The labels are a starting point; rename them to reflect the actual speakers before publishing.
Can I transcribe a Zoom, Google Meet, or Microsoft Teams recording?
Yes, in two ways. If you already downloaded the recording (typical Zoom .m4a or .mp4 file), drop it into File Upload mode. If the meeting is currently playing back in a browser tab, switch to Tab Capture mode (Chrome or Edge), share that tab, and tick "Share tab audio." Speaker labels make it easy to attribute who said what afterward.
Can I transcribe a voice memo from my iPhone or Android?
Yes. iPhone Voice Memos export as .m4a, Android phones typically save .m4a or .amr. AirDrop or share the file to a desktop browser, then drop it into File Upload mode. Most voice memos under 30 minutes process in under a minute on a recent laptop.
Can I transcribe a YouTube video?
Yes, using Tab Capture mode in Chrome or Edge. Open the YouTube video in another tab, select that tab in the Tab Capture picker, then start transcription. The tool captures the tab's audio output directly. No browser extension required.
Is this a free alternative to Otter.ai, Rev, or Descript?
Yes. Otter, Rev, and Descript all send audio to their servers, require an account, and charge after a free quota. This tool runs everything locally in your browser, is free with no usage caps, and requires no signup. Speaker labels and SRT/VTT export are included.
How does the transcription quality compare to Otter.ai or Rev?
Whisper-base.en is a smaller model than what Otter and Rev use on their cloud, so on accented or noisy audio the word error rate is slightly higher. On clean studio-quality audio (podcasts, recorded meetings), accuracy is comparable. The trade-off you make is privacy and zero cost in exchange for a few percentage points of accuracy on hard audio.
Do I need real-time captions instead of a finished transcript?
Use the Live Transcription tool instead. It shows captions as you speak with around 500ms latency. Note that Live Transcription does not include speaker labels.