Voice & TTS¶
Pluma talks to any OpenAI-compatible /v1/audio/speech endpoint. Three common backends:
| Backend | When to pick it |
|---|---|
| Kokoro-FastAPI | Local TTS with a fixed catalogue of ~49 named voices (English, Japanese, Mandarin, Spanish, French, …). Apache 2.0. The default. |
| mlx-audio + OmniVoice | Apple Silicon, MIT-licensed, voice-cloning from a reference clip. Use the Voice cloning guide. |
| OpenAI hosted TTS | Cloud, sounds great, 6 voices (alloy/echo/fable/onyx/nova/shimmer). Costs money. |
Anything else that speaks /v1/audio/speech plugs in the same way.
Set up Kokoro-FastAPI¶
The fastest local path. Install:
In Pluma: Settings → Voice → Use Kokoro defaults (fills http://127.0.0.1:8880/v1 + kokoro). Pick a voice from the dropdown, click Test voice to audition without saving.
Set up OpenAI hosted¶
Settings → Voice:
- Endpoint URL:
https://api.openai.com/v1 - Model:
tts-1(ortts-1-hd) - Voice: one of
alloy,echo,fable,onyx,nova,shimmer
The active LLM connection's API key gets reused for the TTS request when the upstream host matches OpenAI's.
Set up mlx-audio + OmniVoice¶
mlx-audio is the Apple Silicon TTS runtime; OmniVoice is the codec-based voice-cloning model it runs.
Then in Pluma: Settings → Voice with http://127.0.0.1:8880/v1 + a model path (e.g. /Users/you/mlx-models/omnivoice). Voice selection is moot for OmniVoice — it clones from a reference clip you upload in the Voice library. See Voice cloning.
Use it in chat¶
Each assistant message has a Speak entry in its action menu (three-dot icon). Click to synthesize + play. Pluma:
- Strips markdown and bracketed RP stage directions so the codec doesn't try to vocalise asterisks.
- Normalises curly quotes / em dashes / ellipses to ASCII.
- Chunks long messages into ~120-char sentence-aware pieces (codec engines drift mid-generation on long input).
- Fetches chunks in parallel; plays them back-to-back through one audio element.
Sampling defaults are tuned for stability: temperature = 0.4, repetition_penalty = 1.15. OmniVoice's stock 0.7 produced audible drift; the lower values trade a bit of expressiveness for far fewer out-of-order or repeated words.
Per-character voice (planned)¶
Today the Voice setting is global. Per-character voice override is queued under smelt-5cn — character cards will gain a voice field that overrides the global default for any chat with that character.