2024-12-11

A multi-modal on-device assistant that can see

This multi-modal AI assistant is running 100% on-device, no cloud. It understands loose commands, and can even see the world around it.

Synaptics Astra processors are designed to support multi-modal AI in fanless IoT applications.
In the demo, a vision model is accelerated with an on-chip NPU (the Synap compiler allows easy import of models like YOLO11), and the agent accesses it via tool calling.
The assistant runs on 4x Arm Cortex-A73. It semantically matched "bring the room into darkness" to "light off".
The demo is literally a Synaptics Astra SL1680 board connected to a USB webcam and speakerphone. A more compact form factor is clearly possible.

Under the hood it's the on-device assistant I built at Synaptics: voice activity detection, Moonshine speech-to-text (~5× faster than Whisper), sentence-transformer embeddings matched by semantic search for responses, and Piper text-to-speech, fast (as low as ~500 ms) and grounded, since answers come from semantic Q&A matching rather than open-ended generation. When a query needs the world around it, the agent tool-calls the NPU-accelerated vision model and feeds the result back into the loop.

Everything runs on the Astra board: speech in, semantic match against a grounded Q&A set, with an optional vision tool-call on the NPU, then speech out in roughly 500 to 600 ms.

The prototype is all Python and off-the-shelf models, so there's so much more potential. The full code is on GitHub: on-device-ai-assistant.

#edge-ai #llm #arm