Skip to content

Voice sounds garbled

The Speak button plays something but it's wrong: random voice each generation, mid-sentence cutoffs, repeated words, made-up content, or just unintelligible noise.

"Voice changes every generation"

The engine isn't seeing a reference clip. Codec engines (OmniVoice, XTTS, F5-TTS) clone from a sample on every call; without one they pick a fresh voice from the model's latent space each time.

Fix:

  1. Settings → Voice → Voice library. Upload a 5-10 second clip.
  2. Back in Voice settings, pick the library entry from the dropdown (it shows under "Your samples (voice cloning)" with a lib: prefix).
  3. Save.

If your engine is Kokoro / Kokoro-FastAPI (named voices), this doesn't apply — you should be picking a voice from "Upstream voices" instead.

See Voice cloning for the full flow.

"Voice sounds garbled / unintelligible"

The reference clip's quality is poor, or the engine's loader struggled with it.

Try, in order:

  1. Use a clean clip. One speaker, no background music, no sound effects. Codec engines clone whatever's in the reference.
  2. Keep it short. Under 10 seconds. Longer clips trigger OmniVoice's silence-gap auto-trim, which can cut in awkward places.
  3. Switch to WAV. mlx-audio's MP3 reader (via miniaudio) is sometimes flaky on specific MP3 files. Convert to 24kHz mono WAV:

    ffmpeg -i source.mp3 -ar 24000 -ac 1 -c:a pcm_s16le source.wav
    

    Then upload the WAV. 4. Pre-trim. If the clip has long silences or music intros, cut them out before upload:

    ffmpeg -ss 0:10 -t 8 -i source.wav -c copy trimmed.wav
    
  4. Try a different reference voice. Some speakers' vocal characteristics are outside the model's training distribution. If a clip doesn't work, a different one might.

"Repeats words / makes up content / wrong order"

Autoregressive drift in the codec model. Pluma's defaults try to suppress it: temperature = 0.4, repetition_penalty = 1.15, message chunked at ~120 chars. If you're still seeing it:

  • Make sure markdown's getting stripped. The chat path runs every message through cleanForTTS() first. If you see asterisks or backticks being vocalised, something's bypassing it (open an issue).
  • Try shorter chunks. The 120-char default is a balance; for very long messages, smaller chunks (60-80) help. Not user-tunable yet (smelt-53o).
  • Lower temperature further. Same issue tracks exposing the sampling knobs in Settings → Voice. For now the value is hardcoded.

"First word gets clipped"

Codec warm-up. OmniVoice drops the first 100-300 ms while the codec settles. The Settings → Voice Test voice path prepends "One. Two. Three." as a throwaway phrase that absorbs the warm-up; the chat path doesn't (the prefix is too audible mid-conversation).

For the test path: nothing to do; the result already has the absorbed warm-up.

For chat: live with the clip; sentence one occasionally loses its leading consonant. Workaround: prefix your assistant's actual response with throwaway content — character cards can do this in first_mes.

"Mid-stream cutoff (played briefly, then stopped)"

The upstream's stream terminated abnormally. Check Pluma's log:

tail -20 /tmp/pluma.log | grep tts

If you see copy_err=unexpected EOF or copy_err=context deadline exceeded, the TTS upstream isn't producing audio reliably. Common causes:

  • mlx-audio + bad model path. Wrong model id, the model loaded but generation crashed. Check mlx-audio's own log.
  • Reference clip too long for the engine to load. mlx-audio sometimes silently fails on >30s references. Trim per the steps above.
  • Out of memory. Codec models can OOM on long generations. Restart the TTS upstream.

"Empty response from the upstream"

bytes=0 in the Pluma log. The upstream returned 200 OK with an empty body. Usually means the model didn't load yet (it's still warming up; first request after start is sometimes the silent-empty-body case). Hit Speak again — second try usually works.

If it stays empty across multiple tries, the TTS upstream is broken — restart it.

Still stuck

Open an issue with:

  • The TTS upstream + version (Kokoro-FastAPI vX, mlx-audio vY, OpenAI hosted, …)
  • The voice you picked
  • A 30-second excerpt of /tmp/pluma.log around a failed Speak attempt
  • If using a library voice: was it the same clip that worked on previous Speak attempts?