Skip to content

Voice cloning

Codec TTS engines (OmniVoice, XTTS, F5-TTS) don't have a named-voice catalogue — they clone the voice from a reference audio clip. Pluma's Voice library is where you store and pick those clips.

Engines that don't clone (Kokoro, OpenAI hosted) ignore the reference and stick with their named voices.

The flow

  1. Upload (or import from a URL) a 5-10 second clip of someone speaking.
  2. Pluma stores it under <datadir>/voices/library/<id>.<ext>.
  3. The library voice shows up in the Voice dropdown as Your samples (voice cloning).
  4. Pick it as your default voice.
  5. Synthesis-time: Pluma sends ref_audio: <path-to-clip> to the upstream; the engine clones the voice from the clip.

Adding a sample

Settings → Voice → Voice library.

Two tabs:

Upload file. Pick an audio file (.wav, .mp3, .m4a, .ogg, .flac, .aac, .opus) or a video (.mp4, .mov, .webm, .mkv). Video uploads work because mlx-audio reads the audio track directly. Add a name + optional description. Save.

From URL. Paste a YouTube / Vimeo / SoundCloud / Twitch URL or a direct media URL.

  • Streaming-site URLs route through yt-dlp.
  • Direct media URLs use the SSRF-restricted client.
  • Optional Start offset + Duration trim the clip (YouTube path only; uses yt-dlp's --download-sections).
  • Default duration = 10 seconds (OmniVoice's built-in reference cap).

Click Preview on any library row to audition the stored clip without running TTS.

Picking a good reference

Codec engines clone whatever's in the reference — including background music, multiple speakers, or sound effects. Best results come from:

  • One speaker, no music, minimal background noise.
  • 5-10 seconds of normal-pace speech.
  • Clear pronunciation; the engine will mimic accent + cadence.

Reference clips longer than ~15 seconds trigger OmniVoice's silence-gap auto-trim, which can cut in awkward places. Use the URL-import trim controls (or pre-trim the clip elsewhere) to stay under that.

What if the output sounds wrong

  • Voice changes every generation — the engine isn't seeing a reference. Make sure your selected voice is lib:<id> (a library entry), not blank or an upstream-catalogue voice.
  • Voice sounds garbled — the reference quality is poor, or the engine's MP3 reader is struggling. Convert to WAV (24kHz mono) and re-upload. See Troubleshooting → Voice sounds garbled.
  • First word gets clipped — codec warm-up. The chat path adds a brief warmup-absorbing prefix; per-message variance is normal.

On-disk layout

<datadir>/voices/library/
├── pluma-narrator-d1f9ef3a.wav
├── pluma-narrator-d1f9ef3a.json
├── abby-aba1b288.mp3
└── abby-aba1b288.json

<id>.<format> is the audio; <id>.json is the metadata ({ id, name, description, format, created_at }). Delete both to remove a sample; the Delete button in the library does that for you.

Future: per-character voice

Today the active voice is global. Per-character override (so Pluma's Pluma always speaks with one voice and your custom character speaks with another) is queued under smelt-5cn.