Dominic PajakBlog

2024-12-11

A multi-modal on-device assistant that can see

This multi-modal AI assistant is running 100% on-device, no cloud. It understands loose commands, and can even see the world around it.

Under the hood it's the on-device assistant I built at Synaptics: voice activity detection, Moonshine speech-to-text (~5× faster than Whisper), sentence-transformer embeddings matched by semantic search for responses, and Piper text-to-speech, fast (as low as ~500 ms) and grounded, since answers come from semantic Q&A matching rather than open-ended generation. When a query needs the world around it, the agent tool-calls the NPU-accelerated vision model and feeds the result back into the loop.

Pipeline of the on-device assistant: speech in, VAD, Moonshine speech-to-text, sentence-transformer embeddings and semantic search, grounded response with an optional NPU vision tool-call, Piper text-to-speech, speech out
Everything runs on the Astra board: speech in, semantic match against a grounded Q&A set, with an optional vision tool-call on the NPU, then speech out in roughly 500 to 600 ms.

The prototype is all Python and off-the-shelf models, so there's so much more potential. The full code is on GitHub: on-device-ai-assistant.

#edge-ai#llm#arm