Live Transcription

Real-time captions for your mic or any browser tab. Captions appear as you speak. Use it for dictation, live meeting captions, accessibility, or voice typing. Free, no signup, no server.

lock

Captions appear as you speak. Your audio stays inside this browser tab, never reaches a server, and nothing is saved to your computer. The first visit prepares the AI tools in your browser cache (about 155 MB, one-time). After that everything starts instantly.

infoRunning in compatibility mode. Captions may take a moment longer to appear. Try Chrome or Edge on a recent computer for the fastest experience.

mic

Your browser will set up the speech recognition tools the first time you use this. After that, every session starts instantly.

infoAbout

This tool is tuned for live captions and dictation. It does not tag speakers. For multi-speaker meeting transcripts with Speaker 1 / 2 / 3 labels, use the Audio to Text tool instead.

What is the Live Transcription?

Live transcription is the process of converting speech into text with near-zero delay, so captions appear on screen while someone is still talking. This tool runs UsefulSensors' Moonshine ASR model and Silero VAD (voice activity detection) in your browser via WebGPU, producing captions with roughly 500ms latency. Nothing is sent to a server. For recordings that need speaker labels or SRT/VTT export, use the Audio to Text tool instead.

How to use the Live Transcription

  1. 1

    Pick an input mode

    Choose Microphone to caption your own speech, or Tab Capture (Chrome and Edge only) to caption audio playing in another browser tab.

  2. 2

    Wait for the model to load

    Moonshine ASR and Silero VAD download from Hugging Face on first use (~155 MB). This takes 1-2 minutes on a typical connection. The models are cached after the first load.

  3. 3

    Start speaking

    Click Start. Captions appear sentence by sentence as Moonshine detects phrase boundaries. Expect roughly 500ms of lag on WebGPU-enabled devices; slightly more on WebAssembly fallback.

  4. 4

    Stop and review

    Click Stop to end the session. The full transcript is shown in the text area and can be edited before export.

  5. 5

    Export as TXT

    Click Download to save the session transcript as a plain text file.

Works with

Live meeting captions (via Tab Capture)

  • Zoom Web Client
  • Google Meet
  • Microsoft Teams web
  • Webex
  • Discord browser calls
  • Slack huddles in browser
  • Whereby and Jitsi calls

Video and stream captions (via Tab Capture)

  • YouTube live streams and on-demand videos
  • Vimeo videos without built-in captions
  • Twitch live streams
  • Loom recordings
  • Any video playing in a browser tab

Dictation and voice typing

  • Email and document drafting
  • Blog post and article writing
  • Voice notes and quick capture
  • Journaling and stream-of-consciousness writing
  • Search query dictation
  • Code comments and pseudocode

Accessibility

  • Personal live captions for hearing impairment
  • Real-time captions for foreign-language video viewing
  • Caption assistance during phone calls played through speaker
  • Reading along with podcasts and audiobooks

Live event coverage

  • Conference talks streamed live
  • Webinar follow-along notes
  • Lecture and class participation notes
  • Live sports commentary capture
  • Live news broadcast captioning

Frequently Asked Questions

Is my audio uploaded anywhere?

No. Moonshine ASR and Silero VAD run inside your browser using WebGPU or WebAssembly. Your microphone stream and tab audio stay on your device. No data leaves your browser during or after a session.

What is Moonshine and how does it compare to Whisper?

Moonshine is a speech recognition model from UsefulSensors built for on-device real-time inference. On short audio segments it achieves lower word error rate than Whisper-tiny while running faster, which is why it suits live captioning. Whisper-base.en (used in the Audio to Text tool) is stronger on longer recordings and supports speaker diarization.

Why is the first session slow to start?

The Moonshine ASR model and Silero VAD checkpoint total approximately 155 MB and are downloaded from Hugging Face CDN on first use. After that they are cached in the browser's IndexedDB; loading takes a few seconds on return visits.

Does it work offline?

After the initial model download, yes. The transcription itself requires no internet connection. You need connectivity only for the first-ever load to pull the model files.

What languages does Moonshine support?

The current build uses Moonshine's English model. Non-English speech will produce garbled output. Multilingual support depends on future model releases from UsefulSensors.

Can I get speaker labels in the output?

No. This tool identifies speech boundaries but does not attribute them to individual speakers. If you need Speaker 1 / Speaker 2 labels and SRT or VTT export, use the Audio to Text tool.

Which browsers are supported?

Chrome and Edge have the best experience because they support both WebGPU and Tab Capture. Firefox can run the tool using WebAssembly but will be slower and does not support Tab Capture. Safari does not yet support WebGPU and has limited WebAssembly SIMD support, so it is not recommended.

How is this different from Chrome's built-in live captions?

Chrome's built-in captions are OS-level and cannot be exported or edited. This tool produces a full session transcript you can download as TXT, and it works across any Chromium-based browser on any OS without enabling any system-level setting.

Can I use this for dictation or voice typing?

Yes. Click Start, speak normally, and your words appear on screen sentence by sentence. When you're done, copy or download the text and paste it wherever you need (email, document, blog editor). It's a privacy-respecting alternative to dictation features that send audio to Google or Apple servers.

Is this useful for accessibility or hearing impairment?

It works as a personal live captioning tool. Open Live Transcription in one tab, then play a video, podcast, or video call audio in another tab using Tab Capture. Captions appear with about 500ms delay. Note that this is a self-serve tool, not a certified ADA assistive technology, so do not rely on it as a replacement for professional captioning in legal or medical contexts.

Can I use this to caption a Zoom or Google Meet call?

Yes for the audio side. Use Tab Capture to share the meeting tab and tick "Share tab audio" in the picker. Captions will appear for whoever is speaking out of your speakers. Note that Live Transcription does not label speakers; if you need a recorded transcript with Speaker 1, Speaker 2 labels, record the meeting and use the Audio to Text tool afterward.

Related Tools