2024-12-16

Building a speech-to-speech assistant with only on-device AI

How did we build a speech-to-speech assistant using only on-device AI? This video shows how.

A sentence transformer takes your question and generates an embedding, a high-dimensional vector that represents the meaning of the question. This enables semantic search for similar-meaning queries, so it doesn't matter if they don't use the exact phrase.

The encoder-only sentence transformer is faster to run than a full LLM, and avoids hallucinations.

The pipeline

Any appliance, in the home, retail, or industry, that would benefit from a voice UI can now run one fully on-device:

Speech-to-text: Moonshine from Useful Sensors, around 5x faster than Whisper, with better accuracy.
Response: context-specific understanding with an encoder-only language model (Hugging Face SmolLM2).
Text-to-speech: Piper, from the Open Home Foundation.

Note: this demo is written in Python with off-the-shelf models, so there's plenty of optimisation potential still there.

#edge-ai #llm

Building a speech-to-speech assistant with only on-device AI

#The pipeline

The pipeline