Show HN: Real-time local TTS (31M params, 5.6x CPU, voice cloning, ONNX)

github.com

4 points by ZDisket 13 hours ago

Hi guys and gals, I made a TTS model based on my highly upgraded VITS base, conditioned on external speaker embeddings (Resemble AI's Resemblyzer).

The model, with ~31M parameters (ONNX), is tuned for latency and local inference, and comes already exported. I was trying to push the limits of what I could do with small, fast models. Runs 5.6x realtime on a server CPU

It supports voice cloning, voice blending (mix two or more speakers to make a new voice), the license is Apache 2.0 and it uses DeepPhonemizer (MIT) for the phonemization, so no license issues.

The repo contains the checkpoint, how to run it, and links to Colab and HuggingFace demos.

Now, because it's tiny, audio quality isn't the best, and as it was trained on LibriTTS-R + VCTK (both fully open datasets), speaker similarity isn't as good.

Regardless, I hope it's useful.

chenglin97 2 hours ago

Does it support other languages than English? I’m Chinese speaker

popalchemist 10 hours ago

given the architecture, is there a way to force the use of specific phonemes for hard-to-pronounce words? If so that's big