Voice sounds garbled¶
The Speak button plays something but it's wrong: random voice each generation, mid-sentence cutoffs, repeated words, made-up content, or just unintelligible noise.
"Voice changes every generation"¶
The engine isn't seeing a reference clip. Codec engines (OmniVoice, XTTS, F5-TTS) clone from a sample on every call; without one they pick a fresh voice from the model's latent space each time.
Fix:
- Settings → Voice → Voice library. Upload a 5-10 second clip.
- Back in Voice settings, pick the library entry from the dropdown (it shows under "Your samples (voice cloning)" with a
lib:prefix). - Save.
If your engine is Kokoro / Kokoro-FastAPI (named voices), this doesn't apply — you should be picking a voice from "Upstream voices" instead.
See Voice cloning for the full flow.
"Voice sounds garbled / unintelligible"¶
The reference clip's quality is poor, or the engine's loader struggled with it.
Try, in order:
- Use a clean clip. One speaker, no background music, no sound effects. Codec engines clone whatever's in the reference.
- Keep it short. Under 10 seconds. Longer clips trigger OmniVoice's silence-gap auto-trim, which can cut in awkward places.
-
Switch to WAV. mlx-audio's MP3 reader (via miniaudio) is sometimes flaky on specific MP3 files. Convert to 24kHz mono WAV:
Then upload the WAV. 4. Pre-trim. If the clip has long silences or music intros, cut them out before upload:
-
Try a different reference voice. Some speakers' vocal characteristics are outside the model's training distribution. If a clip doesn't work, a different one might.
"Repeats words / makes up content / wrong order"¶
Autoregressive drift in the codec model. Pluma's defaults try to suppress it: temperature = 0.4, repetition_penalty = 1.15, message chunked at ~120 chars. If you're still seeing it:
- Make sure markdown's getting stripped. The chat path runs every message through
cleanForTTS()first. If you see asterisks or backticks being vocalised, something's bypassing it (open an issue). - Try shorter chunks. The 120-char default is a balance; for very long messages, smaller chunks (60-80) help. Not user-tunable yet (smelt-53o).
- Lower temperature further. Same issue tracks exposing the sampling knobs in Settings → Voice. For now the value is hardcoded.
"First word gets clipped"¶
Codec warm-up. OmniVoice drops the first 100-300 ms while the codec settles. The Settings → Voice Test voice path prepends "One. Two. Three." as a throwaway phrase that absorbs the warm-up; the chat path doesn't (the prefix is too audible mid-conversation).
For the test path: nothing to do; the result already has the absorbed warm-up.
For chat: live with the clip; sentence one occasionally loses its leading consonant. Workaround: prefix your assistant's actual response with throwaway content — character cards can do this in first_mes.
"Mid-stream cutoff (played briefly, then stopped)"¶
The upstream's stream terminated abnormally. Check Pluma's log:
If you see copy_err=unexpected EOF or copy_err=context deadline exceeded, the TTS upstream isn't producing audio reliably. Common causes:
- mlx-audio + bad model path. Wrong model id, the model loaded but generation crashed. Check mlx-audio's own log.
- Reference clip too long for the engine to load. mlx-audio sometimes silently fails on >30s references. Trim per the steps above.
- Out of memory. Codec models can OOM on long generations. Restart the TTS upstream.
"Empty response from the upstream"¶
bytes=0 in the Pluma log. The upstream returned 200 OK with an empty body. Usually means the model didn't load yet (it's still warming up; first request after start is sometimes the silent-empty-body case). Hit Speak again — second try usually works.
If it stays empty across multiple tries, the TTS upstream is broken — restart it.
Still stuck¶
Open an issue with:
- The TTS upstream + version (Kokoro-FastAPI vX, mlx-audio vY, OpenAI hosted, …)
- The voice you picked
- A 30-second excerpt of
/tmp/pluma.logaround a failed Speak attempt - If using a library voice: was it the same clip that worked on previous Speak attempts?