When you build a dictation app, the model is the product. Everything else — the hotkey, the menu bar icon, the dictionary — is packaging around one question: how accurately and how fast does the thing turn your voice into text? So when a new speech model gets called a "Whisper killer," I don't get excited. I get a test file ready. We've shipped EmberType on OpenAI's Whisper since day one, and I've watched enough "revolutionary" ASR models turn out to be a demo reel and a leaderboard screenshot to be wary.
Parakeet is different, and I'll say that up front. NVIDIA's Parakeet-TDT-0.6b-v2 was already good enough that we added it to EmberType as a selectable model. When v3 landed on Hugging Face in August 2025, it added the one thing v2 was missing — languages other than English — without giving up the speed. So we added that too. Both are in the app right now, both flagged experimental, both a click away from becoming your default. This piece is the evaluation behind those decisions.
What Parakeet v3 actually is
Strip the marketing and here's the spec sheet, straight from NVIDIA's model card. Parakeet-TDT-0.6b-v3 is a 600-million-parameter automatic speech recognition model built on a FastConformer encoder with a TDT (Token-and-Duration Transducer) decoder. It's released under a permissive CC-BY-4.0 license, which is why an indie app can ship it at all. It does automatic punctuation and capitalization, emits word- and segment-level timestamps, auto-detects the spoken language, and handles audio up to about 24 minutes in a single pass.
The headline number is its reported average word error rate of 6.34% on the Hugging Face Open ASR Leaderboard — on LibriSpeech test-clean it's a startling 1.93%. Those are strong numbers. But a WER figure on a curated benchmark is the beginning of an evaluation, not the end of one. The interesting parts are what the spec sheet doesn't put in bold.
The speed number everyone quotes — and what it means on a Mac
The stat that gets Parakeet its "Whisper killer" headlines is throughput. On the same leaderboard, Parakeet v3 records an RTFx of about 3,332 — meaning on the reference hardware it transcribes roughly 3,332 seconds of audio per second of compute. Whisper large-v3 sits near 68.56 on that same board. That's not 10× faster. It's closer to 48× faster.
Here's the honest asterisk a lot of blog posts skip: those RTFx figures come from a data-center GPU running large batches, not from your MacBook. You will not see 3,332× on an M-series chip transcribing one sentence at a time. Batch-128 throughput on an A100 tells you almost nothing about the latency you feel when you finish a sentence and wait for text to appear. So why does the number still matter? Because the architecture underneath it is what makes Parakeet feel instant on a Mac. A transducer decoder is dramatically cheaper per token than Whisper's autoregressive attention decoder, and that efficiency survives the trip from an A100 down to Apple Silicon. In practice, for the short bursts most dictation actually is, both Parakeet and our quantized Whisper feel immediate. Where you notice Parakeet pulling ahead is on longer takes — dictating three paragraphs, transcribing a voice memo — where Whisper's decoder has to grind through every token and Parakeet doesn't.
Parakeet v2 vs v3: what actually changed
This is the comparison people search for, and the answer is more interesting than "v3 is the new one, use v3." Both models are the same 600M-parameter FastConformer-TDT architecture. The difference is scope.
| Parakeet v2 | Parakeet v3 | |
|---|---|---|
| Languages | English only | 25 European languages |
| Avg WER (Open ASR) | ~6.05% | ~6.34% |
| Parameters | 600M | 600M |
| Released | May 2025 | Aug 2025 |
| License | CC-BY-4.0 | CC-BY-4.0 |
Figures from NVIDIA's Hugging Face model cards for v2 and v3.
Look closely at the WER row. Parakeet v2, the English specialist, is actually the slightly more accurate model on English — 6.05% versus 6.34%. That's not a bug; it's the classic multilingual trade-off, and the Open ASR Leaderboard authors call it out directly: broadening a model to 25 languages costs a little English precision. v3 didn't beat v2 at v2's own game. It kept the speed, kept nearly all the English accuracy, and threw in 24 languages — Spanish, French, German, Italian, Dutch, Polish, Ukrainian, and more — with automatic detection so you don't have to tell it which one you're speaking.
The practical read: if you dictate only in English and want the last fraction of a percent of accuracy, v2 is a defensible pick. If you ever touch a second language — or you just don't want to think about it — v3 is the obvious choice for a rounding error of accuracy.
Where Whisper still wins (and why we still ship it)
Here's the part a Parakeet fan post won't tell you: Parakeet does not replace Whisper for everyone, and we didn't drop Whisper from EmberType when Parakeet arrived. Two reasons.
Language breadth. Twenty-five European languages is a lot. One hundred-plus is more. If you dictate in Japanese, Hindi, Arabic, Mandarin, Turkish, or most of the languages spoken outside Europe, Parakeet v3 simply can't help you and Whisper can. That's why our recommended models list still includes a quantized Whisper Large v3 Turbo — it's the option we point multilingual users to, full stop.
Translation. Whisper doesn't just transcribe; it can translate speech in another language directly into English text. Parakeet transcribes the language you spoke, and that's it. For a lot of people that translate-to-English trick is the whole reason they reach for a speech model, and Parakeet doesn't have it.
So the honest hierarchy inside EmberType looks like this: Parakeet for fast English and European dictation, Whisper when your languages run past Parakeet's list or you need translation. Naming one "the best model" flattens a decision that genuinely depends on what comes out of your mouth. If you want the full menu and the reasoning behind each pick, that's exactly what the models page is for, and it pairs well with the broader guide to offline AI tools for Mac.
The hallucination difference nobody puts on a spec sheet
This is the one that changed my mind about Parakeet, and it never shows up as a number. Whisper has a well-documented failure mode: because it decodes autoregressively — predicting the next word from the previous words, like a language model — it will happily invent fluent text during silence or background noise. Leave a Whisper transcription running over a quiet gap and you can get a hallucinated "thank you for watching" or a repeated phrase that was never spoken. Anyone who's transcribed a long, pause-heavy recording has seen it.
Parakeet's transducer architecture is structurally more resistant to that. A TDT model emits tokens frame by frame in step with the audio; when there's no speech, there's nothing to emit, so it tends to stay quiet rather than confabulate. It's not magic — heavy background noise still degrades it, and NVIDIA's own card shows accuracy falling as the signal-to-noise ratio drops — but the specific "confident nonsense in the gaps" behavior is far less common. For dictation, where you pause to think mid-sentence constantly, that's not a benchmark curiosity. It's the difference between trusting the output and re-reading every line.
How to run Parakeet v3 in EmberType
If you want to try it yourself, it takes about two clicks. Open EmberType, go to the AI Models tab, and you'll see the local models listed with their size, speed, and accuracy. Parakeet V3 is there — a 494 MB multilingual model. Hit Set as Default and every dictation from then on runs through it, fully offline. No account, no upload, nothing leaving your Mac.
You'll notice both Parakeet models still carry an Experimental tag, and that's deliberate. We added them fast because they're that good, but "good on a leaderboard" and "good as the silent default under a hundred thousand dictations a day" are different bars. The tag is me being honest that they're newer in our pipeline than our battle-tested Whisper builds — not a warning that they're broken. If you're the kind of developer who wants the fastest local model and doesn't mind living slightly ahead of the default, flip it on; that's a big part of why developers became one of our largest user groups in the first place.
So is Parakeet v3 our new default?
Not yet — and I want to be straight about why, because "we're being careful" is the truthful answer, not a hedge. Parakeet v3 is faster than Whisper, hallucinates less, and matches it on accuracy for the languages it covers. On the merits it's a strong candidate to become the default English-and-European model in EmberType, and it may well get there. What I'm still watching before I flip that switch for everyone is the boring, unglamorous stuff that only shows up at scale: behavior on messy real-world audio, edge cases in punctuation on technical dictation, and how it holds up across the range of Macs our users actually run.
That's the whole point of shipping it as a selectable, experimental model instead of quietly swapping your default overnight: you get to test NVIDIA's Whisper rival on your own voice, today, and decide for yourself — while we do the slow work of earning it the default slot. A model this good deserves that, and so do you.
Try Parakeet v3 on your own voice.
EmberType ships Parakeet V3, V2, and Whisper — all running 100% offline on your Mac. Pick your model, dictate anywhere, and nothing ever leaves the machine.
Download EmberType Free7-day trial. $49 one-time after. macOS 14+, Apple Silicon. No account required to transcribe.
FAQ
Is Parakeet v3 better than Whisper?
For raw English and European-language speed, yes — Parakeet v3 posts an RTFx around 3,300 on the Open ASR Leaderboard versus roughly 69 for Whisper large-v3, at a comparable ~6.3% word error rate. But Whisper covers 100+ languages and can translate to English, while Parakeet v3 covers 25 European languages and only transcribes. Which is "better" depends entirely on your languages.
What is the difference between Parakeet v2 and v3?
Parakeet v2 is English-only (average WER about 6.05%). Parakeet v3 keeps the same 600M-parameter FastConformer-TDT architecture and adds 24 more European languages for 25 total, at an average WER of about 6.34% with automatic language detection. Broadening to 25 languages costs a small amount of English accuracy versus the English-specialized v2.
Can I run Parakeet v3 on a Mac?
Yes. It's a 600M-parameter model that runs locally on Apple Silicon. EmberType ships it as a selectable model in its AI Models tab — a 494 MB download that runs entirely offline, with no account or connection required to transcribe.
Does Parakeet hallucinate like Whisper?
It's less prone to it. Whisper's autoregressive decoder can invent fluent text during silence or noise. Parakeet is a transducer model that emits tokens frame by frame, so it tends to stay quiet when there's nothing to transcribe rather than filling gaps with confident nonsense — a meaningful difference for real dictation.
Free Mac Dictation Tips
Privacy-first voice-to-text, offline workflows, and honest notes on the models under the hood. No spam.
Unsubscribe anytime. We never share your email.
You're in! Check your inbox.
