Design · Rev 1.7 · June 2026
A fully private voice assistant and chatbot. The GrapheneOS phone is the microphone and the hands; a Mac mini at home is the brain; a direct, end-to-end-encrypted link joins them. No cloud AI provider sees your audio, your text, or your intent, and the whole stack is self-hostable in one command.
01 · At a glance
This is the settled design. Everything here is open weights or open source, runs on hardware you own, and degrades to an on-device-only fallback when the network is gone.
docker compose up. Open source.02 · Goals & constraints
The hardware, the encrypted transport, the models, and how a spoken sentence becomes an action and a reply.
03 · System architecture
The phone runs the SOTTO app, the always-available voice client and the hands. The Mac mini is a stateful inference server on the home LAN. iroh makes the Mac reachable from anywhere by dialing its public key — a direct encrypted link with relay fallback; the next section covers the transport in full.
04 · Networking
The phone reaches the Mac through iroh: a direct, end-to-end-encrypted peer-to-peer link where you dial the Mac by its public key rather than an IP address. It runs inside the app, not as a system VPN, so it never touches Android's single VPN slot — which leaves SOTTO free of any particular network setup.
iroh needs no always-on service brokering your keys. Peers authenticate each other directly by public key, and the only optional infrastructure is a relay — used when two peers cannot punch a direct path — which forwards ciphertext and can be a free community relay or one you self-host. For a self-hosted setup that means no third party sits astride your identity the way a hosted control plane would: you hold the keys, and the Mac still exposes no public endpoint. Anonymizing your general internet traffic, if you want it, is a separate and independent choice rather than part of the link.
05 · Request lifecycle
A turn-based loop. The app wakes, captures until you stop talking, and ships compressed audio through the tunnel. The Mac transcribes, reasons, optionally calls back to run an action, then streams spoken audio home. Figures are rough first-token and first-audio estimates over LAN.
plain speech only, no markdown so nothing pollutes the spoken output.
06 · The compute
Unified memory lets the GPU address the whole RAM pool, so a model's footprint is bounded by total memory, not a separate VRAM budget. For an always-on server, low idle power and a silent, palm-sized chassis matter as much as raw speed.
| Config | Unified RAM | ~Price | Verdict |
|---|---|---|---|
| M4 (base) | up to 32GB | $599–999 | Runs 26B MoE + speech. Tight but workable. |
| M4 Pro ◂ chosen | 48GB (max) | $1,999 / $2,199 | The platform. Holds 31B Dense + STT + TTS + macOS with margin. |
| M5 / M5 Pro | TBA | TBA | Slipped past WWDC; expected fall 2026. A documented later upgrade. |
The compute platform is a Mac mini M4 Pro with 48GB of unified memory and a 512GB SSD, about $1,999. It holds the 31B Dense model (~19GB at Q4) alongside the full speech pipeline and macOS with comfortable margin. Because Gemma 4 caps at 31B, more memory would only matter for a larger model family such as a 70B.
07 · The brain
Gemma 4 (April 2026, Apache 2.0) is the standout bang-for-size open family. The choice between sizes is a latency-versus-quality dial, and SOTTO uses three points on it.
| Variant | Profile | Role |
|---|---|---|
| E2B / E4B | 2B / 4B edge · native audio in | On-device fallback. Lives on the phone for no-network use; audio input can skip a separate STT step. |
| 26B MoE ◂ primary | 3.8B active params | Conversational default. Low active-parameter count means fast tokens for real-time dialogue. |
| 31B Dense | frontier-for-size | Quality tier. For harder reasoning or tool planning, at a bit more latency. |
The 26B MoE is the everyday model. The orchestrator routes to 31B Dense for complex requests, and E4B runs on the phone as the offline safety net.
ollama run gemma4, OpenAI-compatible) and graduates to MLX / mlx-lm on the hot path for higher tokens-per-second on Apple Silicon. Both run natively on the Mac for Metal acceleration, a constraint the deployment section turns on.
08 · The ears & the voice
NVIDIA Parakeet V3 is the speed leader on Apple Silicon: it predicts the whole utterance at once for ~80ms-class latency and runs on the Neural Engine via FluidAudio or a parakeet-mlx server. whisper.cpp (Large V3 Turbo, Metal) is the multilingual fallback, and Moonshine is the path to true streaming later.
Kokoro is an 82M-parameter model that sounds close to far larger systems and runs faster than real-time. It is served through Kokoro-FastAPI, which exposes a streaming OpenAI-compatible /v1/audio/speech endpoint so playback starts on the first sentence.
The native app you touch, the permissions you control, the way conversations are organized, and a stack anyone can self-host.
09 · The app stack
SOTTO lives on deep OS integration: the assistant role, a foreground wake-word service, raw audio capture, Shizuku, the notification listener, telephony. That is exactly where cross-platform frameworks pay a bridge tax for no benefit, especially for an Android-only app on GrapheneOS.
| Option | Fit | Verdict |
|---|---|---|
| Native Kotlin + Compose ◂ chosen | Direct access to every Android API; Google's first-class stack with long-term support. | The client. Right for a deeply OS-integrated, Android-only assistant. |
| Flutter | Own rendering engine; deep integrations go through platform channels, a known pain point. Dart is a separate language. | Wrong fit for a system-level assistant. |
| Kotlin Multiplatform | Shares logic while keeping native UI. Only pays off alongside an iOS build. | Overkill now; revisit only for a second platform. |
10 · App architecture
The app is a thin, smart client. The Mac holds the models and the canonical chat, project, and memory store; the app owns capture, playback, the UI, and turning tool calls into device actions. At a glance it is four layers:
▲ renders state, sends user intent down ▼
▲ orchestrates, persists, calls platform + server ▼
▲ talks to the device · talks to the Mac over the tunnel ▼
Each screen follows unidirectional data flow (MVI). A ViewModel exposes a single immutable UiState as a StateFlow; Compose renders it and sends user actions back up as intents; one-shot effects (navigation, errors, a "permission needed" prompt) travel on a separate event channel. For a stateful voice loop this keeps every transition explicit and testable, the UI is a pure function of state.
One Gradle module to start, packaged so feature modules can split out later without touching the domain.
app/src/main/java/…/sotto/ ├── ui/ # Compose screens + ViewModels (MVI) │ ├── voice/ chat/ tools/ setup/ │ └── theme/ ├── domain/ # framework-free logic, pure Kotlin │ ├── VoiceLoopController # the state machine │ ├── ConversationRepository# chats/projects, offline-first │ ├── ToolExecutor + ToolRegistry │ └── SyncEngine ├── data/ │ ├── api/ # iroh client, OpenAI-compatible DTOs, streams │ ├── db/ # Room entities + DAOs │ └── settings/ # DataStore: tool toggles, server, prefs ├── audio/ # AudioEngine: AudioRecord, VAD, AudioTrack ├── platform/ # assistant role, WakeWordService, ShizukuBridge, intents └── di/ # Hilt modules
The heart of the app is a VoiceLoopController driving an explicit state machine. A turn moves Idle → Listening → Transcribe → Thinking → Speaking and back, with a side excursion to Executing whenever the model calls a tool. Every active state owns a coroutine scope, so a barge-in, you start talking again or tap cancel, tears down the scope and snaps back to Idle from any state. If the tunnel is down, Thinking is served by the on-device E4B model instead of the Mac, so the loop still completes.
The ApiClient dials the Mac's iroh endpoint by public key and speaks the orchestrator's OpenAI-compatible protocol over the encrypted QUIC connection. Replies stream back token-by-token so Speaking can start on the first sentence; captured audio uploads as a chunked stream to STT; synthesized speech returns as a streamed response played incrementally through AudioTrack. Each leg is a Kotlin Flow, so cancellation propagates end-to-end the instant a turn is abandoned.
Every tool implements a small interface, an id, a one-line description, a risk tier, the Android permission or Shizuku capability it needs, and a suspending execute(args). The ToolRegistry holds them; ToolExecutor resolves a call against the per-chat enabled set (with project overrides) read from DataStore, prompts for confirmation when the tier requires it, then dispatches to a native API or the ShizukuBridge. Only enabled tools are serialized into the request as callable functions; disabled ones travel as name-and-description only, per the two-tier exposure in FIG 6.
ConversationRepository is the single source the UI reads, and it reads offline-first from Room. Entities mirror the Mac's store, projects, chats, messages, files, and writes apply locally first, then reconcile through SyncEngine (a WorkManager job plus a foreground coroutine while a chat is open) on a last-write-wins basis. Tool toggles, the server address, and preferences live in DataStore rather than Room, since they are small and read on every request.
Everything asynchronous is coroutines and Flow under structured concurrency. Audio capture and playback run on a dedicated high-priority dispatcher, network and disk on IO, and the active turn lives in a session scope the controller cancels on barge-in or teardown, no orphaned streams, no leaked recorders.
| Concern | Library |
|---|---|
| UI | Jetpack Compose · Material 3 |
| Dependency injection | Hilt |
| Async & streams | Kotlin Coroutines · Flow |
| Transport | iroh (P2P QUIC) · token streaming |
| Local data | Room (conversations) · DataStore (settings & toggles) |
| Background sync | WorkManager |
| Audio I/O | AudioRecord + AudioTrack (Oboe if latency demands) |
| Wake word | ONNX Runtime Mobile / LiteRT (openWakeWord) |
| Elevated actions | Shizuku API |
| Serialization | kotlinx.serialization |
11 · Wireframes
The voice overlay invoked by gesture, the chat-and-projects workspace, the tool-and-permission control panel, and first-run setup. Color is meaningful: teal for connected and safe, amber for the active model and primary actions, coral for sensitive permissions.
12 · Tools & the permission model
The assistant is only ever as capable as the switches you flip. Every tool maps to a concrete Android permission or Shizuku grant, carries a risk level, and is off by default. The model is aware of every tool by name, but only the ones you enable are handed to it as callable functions; the rest it can see but never invoke.
Only enabled tools are passed to the model as callable function schemas, which keeps the callable surface small, the context lean, and tool selection accurate. Disabled tools are not callable at all: the model receives only their name and a one-line description, flagged off. So when you ask for something it cannot do, it says so plainly, for example "I don't have access to Send SMS right now; you can enable it in the SOTTO app's Tools screen," rather than inventing a call it cannot make.
| Tool | Backed by | Risk |
|---|---|---|
| Send SMS · read notifications · place calls | SMS / notification listener / phone permissions | med–high |
| Alarms & timers · calendar read/write | AlarmManager · calendar permission | low–med |
| Open app · navigation · share · web search | Android intents · server search | low |
| Clipboard · battery · media · flashlight | Native APIs | low |
| Toggle settings · location · app control | Shizuku (ADB-level, no root) | high |
| Knowledge / RAG over your docs | Server-side, on the Mac | low |
13 · Chats, projects & memory
Conversations are organized like a workspace, and what the model knows in any turn is assembled from nested scopes. The Mac holds the canonical store; the app caches it for offline reading and syncs over the tunnel.
A SQLite database (Postgres in the container stack) on the Mac, owned by the orchestrator, is the source of truth. The app mirrors what it needs into Room for offline reading and queues writes back on reconnect, last-write-wins.
The orchestrator composes the prompt from the inside out: system prompt, the enabled tools as callable functions plus the disabled ones as a visible-only catalog, then project instructions if the chat belongs to one, then relevant global memories (future), then retrieved RAG chunks from project files or past chats, then the recent message window. Everything stays on your hardware; "memory" is a row in your database, not a vendor feature.
14 · The deployable stack
The server ships two ways: a native Mac app for the best experience on Apple Silicon, and a Docker Compose stack for Linux and the portable reference. The split exists because of one hard constraint.
| Target | What runs where | For whom |
|---|---|---|
| Mac app ◂ best on Apple Silicon | A SwiftUI menu-bar supervisor bundles and launches native Ollama (Metal) and native Parakeet (Neural Engine), and runs Kokoro, the orchestrator, and the tunnel client alongside. One .dmg, open, done. | Non-technical self-hosters; Mac mini owners who want top speed. |
| Docker Compose | On Linux + NVIDIA: everything containerized with in-container GPU. On Mac: containers run orchestrator + Kokoro and reach host-native Ollama via host.docker.internal. | Linux hosts, tinkerers, the portable reference build. |
# one repo, both install paths sotto/ ├── app/ # Android · Kotlin + Compose ├── server/ │ ├── orchestrator/ # Python FastAPI: agent loop + chat/project/memory store │ ├── tts/ # Kokoro-FastAPI │ ├── stt/ # whisper.cpp server (Parakeet runs native on Mac) │ └── docker-compose.yml ├── mac-app/ # SwiftUI supervisor; bundles native Ollama + Parakeet ├── deploy/ │ └── compose.linux-nvidia.yml # full in-container GPU └── README.md
services: orchestrator: # agent loop + OpenAI-compatible front door tts: # kokoro-fastapi stt: # whisper.cpp (Linux GPU) — Parakeet/ANE on Mac instead # ollama: in-container on Linux/NVIDIA; on Mac point to # OLLAMA_HOST=http://host.docker.internal:11434 (native, Metal)
A Linux user with an NVIDIA card runs docker compose up and the whole thing comes alive, GPU included. A Mac user opens the app, or runs ollama serve natively with docker compose up for the rest. The orchestrator speaks the OpenAI API, so contributors swap any piece without touching the app.
15 · Build order
The iroh link between phone and Mac, dialing by key, the phone reaching the Mac directly from cellular with relay fallback. Nothing else works until this does.
Ollama plus Gemma 4 and the Python orchestrator with the chat/project store and an OpenAI-compatible endpoint, hit from the phone with a one-line script.
The Compose UI, chats and projects synced to the Mac, text in and out. A clean chatbot before it ever hears you.
Parakeet and Kokoro on the Mac; the app captures and plays audio. Push-to-talk first, then openWakeWord in a foreground service.
The assistant role for gesture invocation, the tool registry, and the permissions screen, wiring native actions and the Shizuku bridge behind their toggles.
The server wrapped as the Mac app and the compose files, with quickstarts, pinned versions, and defaults. The public release.
The future release: the memories table, retrieval, and the management UI, scoped per project, fully user-controlled.
16 · Security & privacy posture
17 · Bill of materials