A multi-modal on-device assistant that can see
This multi-modal AI assistant is running 100% on-device, no cloud. It understands loose commands, and can even see the world around it.
- Synaptics Astra processors are designed to support multi-modal AI in fanless IoT applications.
- In the demo, a vision model is accelerated with an on-chip NPU (the Synap compiler allows easy import of models like YOLO11), and the agent accesses it via tool calling.
- The assistant runs on 4x Arm Cortex-A73. It semantically matched "bring the room into darkness" to "light off".
- The demo is literally a Synaptics Astra SL1680 board connected to a USB webcam and speakerphone. A more compact form factor is clearly possible.
Under the hood it's the on-device assistant I built at Synaptics: voice activity detection, Moonshine speech-to-text (~5× faster than Whisper), sentence-transformer embeddings matched by semantic search for responses, and Piper text-to-speech, fast (as low as ~500 ms) and grounded, since answers come from semantic Q&A matching rather than open-ended generation. When a query needs the world around it, the agent tool-calls the NPU-accelerated vision model and feeds the result back into the loop.
The prototype is all Python and off-the-shelf models, so there's so much more potential. The full code is on GitHub: on-device-ai-assistant.