Show HN: Three new Kitten TTS models – smallest less than 25MB (github.com)

194 points by rohan_joshi 5 hours ago

65 comments:

by daneel_w 20 minutes ago

A very clear improvement from the first set of models you released some time ago. I'm really impressed. Thanks for sharing it all.

by rohan_joshi 15 minutes ago

thanks a lot. yeah these models are way better than our previous launch. our 15M model now is better than our previous 80M model and we expect to continue seeing this rate of improvement.

by kevin42 4 hours ago

What I love about OpenClaw is that I was able to send it a message on Discord with just this github URL and it started sending me voice messages using it within a few minutes. It also gave me a bunch of different benchmarks and sample audio.

I'm impressed with the quality given the size. I don't love the voices, but it's not bad. Running on an intel 9700 CPU, it's about 1.5x realtime using the 80M model. It wasn't any faster running on a 3080 GPU though.

by rohan_joshi 4 hours ago

yeah we'll add some more professional-sounding voices and also support for diy custom voices. we tried to add more anime/cartoon-ish voices to showcase the expressivity.

Regarding running on the 3080 gpu, can you share more details on github issues, discord or email? it should be blazing fast on that. i'll add an example to run the model on gpu too.

by boutell 20 minutes ago

Great stuff. Is your team interested in the STT problem?

by rohan_joshi 14 minutes ago

Yes, we've started working on it and will have a range of stt models v soon. lmk if you have a prod use-case in mind?

by ks2048 4 hours ago

You should put examples comparing the 4 models you released - same text spoken by each.

by rohan_joshi 3 hours ago

great idea, let me add this. meanwhile, you can try the models on our huggingface spaces demo here: https://huggingface.co/spaces/KittenML/KittenTTS-Demo

by vezycash 2 hours ago

Would an Android app of this be able to replace the built in tts?

by rohan_joshi 2 hours ago

yes, our mobile sdk is coming soon(eta 2 weeks) so we should be able to replace the built-in version of it. can you share what tts use-case you're thinking of?

by satvikpendem 2 hours ago

I use an epub reader like Moon+ with the built in TTS to turn epubs into audiobooks, and I tried Kokoro TTS but the issue was too much lag between sentences plus it doesn't preprocess the next sentence while it reads out the current one.

by rohan_joshi an hour ago

okay this seems pretty doable, i think i know someone who is working on an epub reader using kittentts. if they don't post about it, i'll do it once its done.

by gabrielcsapo an hour ago

Working on a reader and server that use pockettts to turn epubs into audio books https://github.com/gabrielcsapo/compendus shows a virtual scroller for the text and audio

by magicalhippo 3 hours ago

A lot of good small TTS models in recent times. Most seem to struggle hard on prosody though.

Kokoro TTS for example has a very good Norwegian voice but the rhythm and emphasizing is often so out of whack the generated speech is almost incomprehensible.

Haven't had time to check this model out yet, how does it fare here? What's needed to improve the models in this area now that the voice part is more or less solved?

by rohan_joshi 2 hours ago

small models struggle with prosody due to limited capacity. this version does much better than the precious one and is the best among other <25MB models. Kokoro is a really good model for its size, its competitive on artificial analysis too. i think by the next release we should have something kokoro quality but a fifth of the size. Adding control for rhythm seems to be quite important too, and we should start looking at that for other languages.

by soco 3 hours ago

That, and also using English words in the middle of another language phrase confuses them a lot.

by rohan_joshi 2 hours ago

yes. the current release of our model is english-only. so other languages are not expected to perform well. we'll try to look out for this in our multilingual release.

by armcat 2 hours ago

This is awesome, well done. Been doing lot of work with voice assistants, if you can replicate voice cloning Qwen3-TTS into this small factor, you will be absolute legends!

by rohan_joshi 2 hours ago

thanks a lot, our voice cloning model will be out by May. we're experimenting w some very cool ways of doing voice cloning at 15M but will have a range of models going upto 500M

by armcat 31 minutes ago

That's sick, looking forward to it! You have my email in the profile, please let me know when you do!

by pumanoir 2 hours ago

The example.py file says "it will run blazing fast on any GPU. But this example will run on CPU."

I couldn't locate how to run it on a GPU anywhere in the repo.

by rohan_joshi 2 hours ago

thanks for the feedback. i'll add an example of running it on gpu.

by altruios 4 hours ago

One of the core features I look for is expressive control.

Either in the form of the api via pitch/speed/volume controls, for more deterministic controls.

Or in expressive tags such as [coughs], [urgently], or [laughs in melodic ascending and descending arpeggiated gibberish babbles].

the 25MB model is amazingly good for being 25MB. How does it handle expressive tags?

by rohan_joshi 3 hours ago

thank you so much. Right now, it cannot handle expressive tags. what kind of tags would be most helpful according to you?

by daneel_w 15 minutes ago

Intonation (frequency rise/fall) would offer a lot of versatility.

by altruios 3 hours ago

Emotion based tagging control would be the most helpful narrowing it down. Tags like [sarcastically] [happily] [joyfully] [fearfully]: so a subsection of adverbs.

A stretch goal is 'arbitrary tags' from [singing] [sung to the tune of {x}] [pausing for emphasis] [slowly decreasing speed for emphasis] [emphasizing the object of this sentence] [clapping] [car crash in the distance] [laser's pew pew].

But yeah: instruction/control via [tags] is the deciding feature for me, provided prompt adherence is strong enough.

Also: a thought...

Everyone is using [] for different kinds of tags in this space: which is very simple. Maybe it makes sense to differentiate kinds of tags? I.E. [tags for modifying how text is spoken] vs {tags for creating sounds not specifically speech: not modifying anything... but instead it's own 'sound/word'}

by rohan_joshi 2 hours ago

yeah i think to start with, narrowing it down to a few tags would be most helpful and we'll probably start w that first. Thanks a lot!

by ks2048 4 hours ago

There's a number of recent, good quality, small TTS models.

If the author doesn't describe some detail about the data, training, or a novel architecture, etc, I only assume they just took another one, do a little finetuning, and repackage as a new product.

by the_duke 4 hours ago

Any recommendations?

by Joel_Mckay an hour ago

Depends how small or complex you want a TTS, as flite + flitevox voice packages worked on pi or zynq ARM cpu just fine. =3

Also:

https://github.com/sparkaudio/spark-tts

by Remi_Etien 3 hours ago

25MB is impressive. What's the tradeoff vs the 80M model β€” is it mainly voice quality or does it also affect pronunciation accuracy on less common words?

by rohan_joshi 2 hours ago

80M model is the highest quality while also being quite efficient. it is superior in terms of pronunciation accuracy for less common words, and also is more stable in terms of speed. its my fav model. i think the 40M is quite similar to 80M for most usecases. 15M is for resource cpus, loading onto a browser etc.

The new 15M is way better than the previous 80M model(v0.1). So we're able to predictably improve the quality which is very encouraging.

by DavidTompkins 3 hours ago

This would be great as a js package - 25mb is small enough that I think it'd be worth it (in-browser tts is still pretty bad and varies by browser)

by rohan_joshi 2 hours ago

great idea, we're on it. we're also working on a mobile sdk. a browser sdk would be really cool too.

by schopra909 2 hours ago

Really cool to see innovation in terms of quality of tiny models. Great work!

by rohan_joshi 2 hours ago

thanks a lot. small model quality is improving exponentially. This 15M is way better than the 80M model from our previous launch (V0.1).

by devinprater 3 hours ago

A lot of these models struggle with small text strings, like "next button" that screen readers are going to speak a lot.

by soco 3 hours ago

I think I tried on my Android everything I could try and 1. outside webpage reading, not many options; 2. as browser extensions, also not many (I don't like to copy URLs in your app) 3. they all insist reading every little shit, not only buttons but also "wave arrow pointing directly right" which some people use in their texts. So basically reading text aloud is a bunch of shitty options. Anyone jumping in this market opening?

by rohan_joshi 2 hours ago

we'd love to serve this use-case. i'll make a demo for this next week and comment here with it.

by sschueller 2 hours ago

I'm still looking for the "perfect" setup in order to clone my voice and use it locally to send voice replies in telegram via openclaw. Does anyone have auch a setup?

I want to be my own personal assistant...

EDIT: I can provide it a RTX 3080ti.

by nicpottier 40 minutes ago

Try training a model on piper, you will need to record a lot of utterances but the results are pretty great and the output is a fast TTS model.

by ilaksh 2 hours ago

You need to provide info on your hardware. Pocket-TTS does cloning on CPU, but for me randomly outputs something pretty weird sounding mixed in with like 90% good outputs. So it hasn't been quite stable enough to run without checking output. But maybe it depends on your voice sample.

Qwen 3 TTS is good for voice cloning but requires GPU of some sort.

by justanotherunit 2 hours ago

Is it not just to train a model on your voice recordings and just use that to generate audio clips from text?

by janice1999 2 hours ago

What's the actual install size for a working example? Like similar "tiny" projects, do these models actually require installing 1GB+ of dependencies?

by deathanatos an hour ago

Running the example is 3 MiB for the repo, +667 MiB of Python dependencies, +86 MiB of models that will get downloaded from HuggingFace. =756 MiB.

(That's using the example as-is. If you switch it to the smaller model, modify the above with +57 MiB of models from HuggingFace, or =727 MiB.)

So I toyed with this a bit + the Rust library "ort", and ort is only 224M in release (non-debug) mode, and it was pretty simple to run this model with it. (I did not know ort before just now.) I didn't replicate the preprocessing the Python does before running the model, though. (You have to turn the text into an array of floats, essentially; the library is doing text -> phonemes -> tokens; the latter step is straight-forward.)

by wedowhatwedo an hour ago

My quick test showed 670m of python libraries required on top of the model.

by gabrielcsapo an hour ago

are there plans to output text alignment?

by rohan_joshi an hour ago

yes, we just started working on this yesterday haha, great that you mentioned it. once we have it working it'll be out soon.

by gabrielcsapo an hour ago

that would be awesome, I was using pockettts then I had to run it through whisper to get the accurate alignment. Not super productive for realtime work.

by fwsgonzo 4 hours ago

How much work would it be to use the C++ ONNX run-time with this instead of Python? Is it a Claudeable amount of work?

The iOS version is Swift-based.

by rohan_joshi 4 hours ago

shouldn't be hard. what backend/hardware are you interested in running this with? i'll add an example for using C++ onnx model. btw check out roadmap, our inference engine will be out 1-2 weeks and it is expected to be faster than onnx.

by fwsgonzo an hour ago

desktop CPUs running inference on a single background thread would be the ideal case for what I'm considering.

by great_psy 4 hours ago

Thanks for working on this!

Is there any way to get those running on iPhone ? I would love to have the ability for it to read articles to me like a podcast.

by rohan_joshi 4 hours ago

yes, we're releasing an official mobile sdk and inference engine very soon. if you want to use something until then, some folks from the oss community have built ways to run kitten on ios. if you search kittentts ios on github you should find a few. if you cant find it, feel free to ping me and i can help you set it up. thanks a lot for your support and feedback!

by whitepaper27 2 hours ago

This is great. Demo looks awesome.

by rohan_joshi an hour ago

thanks, glad you liked it

by ilaksh 4 hours ago

Thanks for open sourcing this.

Is there any way to do a custom voice as a DIY? Or we need to go through you? If so, would you consider making a pricing page for purchasing a license/alternative voice? All but one of the voices are unusable in a business context.

by rohan_joshi 4 hours ago

thanks a lot for the feedback. yes, we're working on a diy way to add custom voices and will also be releasing a model with more professional voices in the next 2-3 weeks. as of now, we're providing commercial support for custom voices, languages and deployment through the support form on our github. can you share more about your business use-case? if possible, i'd like to ensure the next release can serve that.

by ilaksh 2 hours ago

Right now it's outgoing calls for a small business client that checks information. Although if they call back they don't mind an automated system, on outgoing calls the person answering will often hang up if they detect AI right away, so we use a realistic custom voice with an accent.

This is a mind numbing task that requires workers to make hundreds of calls each day with only minor variations, sometimes navigating phone trees, half the time leaving almost the exact same message.

Anyway, I believe almost all such businesses will be automated within months. Human labour just cannot compete on cost.

by Tacite 4 hours ago

Is it English only?

by rohan_joshi 4 hours ago

as of now its english only. the training for multilingual model is underway and should be out in April! what languages are you most interested in? Right now, we are providing deployments for custom languages + voices through support form on the github.

by ivm 43 minutes ago

Spanish would be great, there's a serious lack of Spanish TTS on Android compared to iOS and the quality is not the best.

by Zopieux 2 hours ago

French, Spanish, German would go a long way.

by wiradikusuma 3 hours ago

I'm thinking of giving "voice" to my virtual pets (think Pokemon but less than a dozen). The pets are made up animals but based on real animal, like Mouseier from Mouse (something like that). Is this possible?

Tldr: generate human-like voice based on animal sound. Anyway maybe it doesn't make sense.

by rohan_joshi 2 hours ago

it'd be an interesting experiment to try what kind of information is extracted from the samples of the pet sounds. it'd be so cool if it can just get the features of the audio and then still be able to reproduce the audio in english lol. we would need a really good "speaker" encoder i think.

Data from: Hacker News, provided by Hacker News (unofficial) API