Building a speech-to-speech assistant with only on-device AI
How did we build a speech-to-speech assistant using only on-device AI? This video shows how.
A sentence transformer takes your question and generates an embedding, a high-dimensional vector that represents the meaning of the question. This enables semantic search for similar-meaning queries, so it doesn't matter if they don't use the exact phrase.
The encoder-only sentence transformer is faster to run than a full LLM, and avoids hallucinations.
#The pipeline
Any appliance, in the home, retail, or industry, that would benefit from a voice UI can now run one fully on-device:
- Speech-to-text: Moonshine from Useful Sensors, around 5x faster than Whisper, with better accuracy.
- Response: context-specific understanding with an encoder-only language model (Hugging Face SmolLM2).
- Text-to-speech: Piper, from the Open Home Foundation.
Note: this demo is written in Python with off-the-shelf models, so there's plenty of optimisation potential still there.