Voice sounds garbled¶

The Speak button plays something but it's wrong: random voice each generation, mid-sentence cutoffs, repeated words, made-up content, or just unintelligible noise.

"Voice changes every generation"¶

The engine isn't seeing a reference clip. Codec engines (OmniVoice, XTTS, F5-TTS) clone from a sample on every call; without one they pick a fresh voice from the model's latent space each time.

Fix:

Settings → Voice → Voice library. Upload a 5-10 second clip.
Back in Voice settings, pick the library entry from the dropdown (it shows under "Your samples (voice cloning)" with a lib: prefix).
Save.

If your engine is Kokoro / Kokoro-FastAPI (named voices), this doesn't apply — you should be picking a voice from "Upstream voices" instead.

See Voice cloning for the full flow.

"Voice sounds garbled / unintelligible"¶

The reference clip's quality is poor, or the engine's loader struggled with it.

Try, in order:

Use a clean clip. One speaker, no background music, no sound effects. Codec engines clone whatever's in the reference.
Keep it short. Under 10 seconds. Longer clips trigger OmniVoice's silence-gap auto-trim, which can cut in awkward places.
Switch to WAV. mlx-audio's MP3 reader (via miniaudio) is sometimes flaky on specific MP3 files. Convert to 24kHz mono WAV:
```
ffmpeg -i source.mp3 -ar 24000 -ac 1 -c:a pcm_s16le source.wav
```
Then upload the WAV. 4. Pre-trim. If the clip has long silences or music intros, cut them out before upload:
```
ffmpeg -ss 0:10 -t 8 -i source.wav -c copy trimmed.wav
```
Try a different reference voice. Some speakers' vocal characteristics are outside the model's training distribution. If a clip doesn't work, a different one might.

"Repeats words / makes up content / wrong order"¶

Autoregressive drift in the codec model. Pluma's defaults try to suppress it: temperature = 0.4, repetition_penalty = 1.15, message chunked at ~120 chars. If you're still seeing it:

Make sure markdown's getting stripped. The chat path runs every message through cleanForTTS() first. If you see asterisks or backticks being vocalised, something's bypassing it (open an issue).
Try shorter chunks. The 120-char default is a balance; for very long messages, smaller chunks (60-80) help. Not user-tunable yet (smelt-53o).
Lower temperature further. Same issue tracks exposing the sampling knobs in Settings → Voice. For now the value is hardcoded.

"First word gets clipped"¶

Codec warm-up. OmniVoice drops the first 100-300 ms while the codec settles. The Settings → Voice Test voice path prepends "One. Two. Three." as a throwaway phrase that absorbs the warm-up; the chat path doesn't (the prefix is too audible mid-conversation).

For the test path: nothing to do; the result already has the absorbed warm-up.

For chat: live with the clip; sentence one occasionally loses its leading consonant. Workaround: prefix your assistant's actual response with throwaway content — character cards can do this in first_mes.

"Mid-stream cutoff (played briefly, then stopped)"¶

The upstream's stream terminated abnormally. Check Pluma's log:

tail -20 /tmp/pluma.log | grep tts

If you see copy_err=unexpected EOF or copy_err=context deadline exceeded, the TTS upstream isn't producing audio reliably. Common causes:

mlx-audio + bad model path. Wrong model id, the model loaded but generation crashed. Check mlx-audio's own log.
Reference clip too long for the engine to load. mlx-audio sometimes silently fails on >30s references. Trim per the steps above.
Out of memory. Codec models can OOM on long generations. Restart the TTS upstream.

"Empty response from the upstream"¶

bytes=0 in the Pluma log. The upstream returned 200 OK with an empty body. Usually means the model didn't load yet (it's still warming up; first request after start is sometimes the silent-empty-body case). Hit Speak again — second try usually works.

If it stays empty across multiple tries, the TTS upstream is broken — restart it.

Still stuck¶

Open an issue with:

The TTS upstream + version (Kokoro-FastAPI vX, mlx-audio vY, OpenAI hosted, …)
The voice you picked
A 30-second excerpt of /tmp/pluma.log around a failed Speak attempt
If using a library voice: was it the same clip that worked on previous Speak attempts?