Skip to content

Voice & TTS

Pluma talks to any OpenAI-compatible /v1/audio/speech endpoint. Three common backends:

Backend When to pick it
Kokoro-FastAPI Local TTS with a fixed catalogue of ~49 named voices (English, Japanese, Mandarin, Spanish, French, …). Apache 2.0. The default.
mlx-audio + OmniVoice Apple Silicon, MIT-licensed, voice-cloning from a reference clip. Use the Voice cloning guide.
OpenAI hosted TTS Cloud, sounds great, 6 voices (alloy/echo/fable/onyx/nova/shimmer). Costs money.

Anything else that speaks /v1/audio/speech plugs in the same way.

Set up Kokoro-FastAPI

The fastest local path. Install:

docker run -d --name kokoro -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:latest
pip install kokoro-fastapi
kokoro-fastapi

In Pluma: Settings → Voice → Use Kokoro defaults (fills http://127.0.0.1:8880/v1 + kokoro). Pick a voice from the dropdown, click Test voice to audition without saving.

Set up OpenAI hosted

Settings → Voice:

  • Endpoint URL: https://api.openai.com/v1
  • Model: tts-1 (or tts-1-hd)
  • Voice: one of alloy, echo, fable, onyx, nova, shimmer

The active LLM connection's API key gets reused for the TTS request when the upstream host matches OpenAI's.

Set up mlx-audio + OmniVoice

mlx-audio is the Apple Silicon TTS runtime; OmniVoice is the codec-based voice-cloning model it runs.

pip install mlx-audio
mlx_audio.server --host 127.0.0.1 --port 8880

Then in Pluma: Settings → Voice with http://127.0.0.1:8880/v1 + a model path (e.g. /Users/you/mlx-models/omnivoice). Voice selection is moot for OmniVoice — it clones from a reference clip you upload in the Voice library. See Voice cloning.

Use it in chat

Each assistant message has a Speak entry in its action menu (three-dot icon). Click to synthesize + play. Pluma:

  1. Strips markdown and bracketed RP stage directions so the codec doesn't try to vocalise asterisks.
  2. Normalises curly quotes / em dashes / ellipses to ASCII.
  3. Chunks long messages into ~120-char sentence-aware pieces (codec engines drift mid-generation on long input).
  4. Fetches chunks in parallel; plays them back-to-back through one audio element.

Sampling defaults are tuned for stability: temperature = 0.4, repetition_penalty = 1.15. OmniVoice's stock 0.7 produced audible drift; the lower values trade a bit of expressiveness for far fewer out-of-order or repeated words.

Per-character voice (planned)

Today the Voice setting is global. Per-character voice override is queued under smelt-5cn — character cards will gain a voice field that overrides the global default for any chat with that character.