Design · Rev 1.7 · June 2026

SOTTO.

A fully private voice assistant and chatbot. The GrapheneOS phone is the microphone and the hands; a Mac mini at home is the brain; a direct, end-to-end-encrypted link joins them. No cloud AI provider sees your audio, your text, or your intent, and the whole stack is self-hostable in one command.

BRAIN · Gemma 4 · Parakeet · Kokoro CLIENT · Kotlin + Compose SELF-HOST · Mac app + Docker

01 · At a glance

Every decision, on one screen

This is the settled design. Everything here is open weights or open source, runs on hardware you own, and degrades to an on-device-only fallback when the network is gone.

Compute
Mac mini M4 Pro
48GB unified memory, 512GB SSD. ~$1,999, about $29/year to run.
Brain
Gemma 4 · 26B MoE
3.8B active params for fast turns. 31B Dense is the quality tier.
Serving
Ollama → MLX
Native on Apple Silicon for Metal acceleration. OpenAI-compatible.
Speech → Text
Parakeet V3
~80ms on the Neural Engine. Whisper on Metal is the fallback.
Text → Speech
Kokoro 82M
High quality, streams faster than real-time.
Transport
iroh (P2P)
Direct, end-to-end-encrypted link to the Mac, dialed by public key. No VPN required.
Client
Native SOTTO app
Kotlin + Compose. Claims the Android assistant role.
Hands (phone)
Native APIs + Shizuku
In-app actions, Shizuku for elevated. No root.
Fallback
Gemma 4 E4B on-device
4B edge model with native audio in. Works with zero network.
Self-host
Mac app + Docker
One-click .dmg, or docker compose up. Open source.

02 · Goals & constraints

What this is optimizing for

Part I
The System

The hardware, the encrypted transport, the models, and how a spoken sentence becomes an action and a reply.

03 · System architecture

Topology: two trusted devices, one tunnel

The phone runs the SOTTO app, the always-available voice client and the hands. The Mac mini is a stateful inference server on the home LAN. iroh makes the Mac reachable from anywhere by dialing its public key — a direct encrypted link with relay fallback; the next section covers the transport in full.

FIG 1 Device topology & trust boundary
SOTTO app · GrapheneOS Assistant role + wake openWakeWord · foreground Capture + VAD ▸ AudioRecord · stream to Mac Tool executor SMS · alarm · intent · Shizuku ◂ Playback · speaker Encrypted link iroh public-key dial · QUIC direct P2P · relay fallback Mac mini · inference server Orchestrator / gateway Python · agent loop + store STT · Parakeet V3 FluidAudio · Neural Engine LLM · Gemma 4 26B MoE Ollama / MLX · OpenAI API TTS · Kokoro Kokoro-FastAPI · streaming Chats · projects · memory audio ▸ ◂ audio + tool calls
SOTTO app (phone) Encrypted transport Speech models Mac compute

04 · Networking

Dial the Mac by key, not IP

The phone reaches the Mac through iroh: a direct, end-to-end-encrypted peer-to-peer link where you dial the Mac by its public key rather than an IP address. It runs inside the app, not as a system VPN, so it never touches Android's single VPN slot — which leaves SOTTO free of any particular network setup.

FIG 2 Transport & traffic split
Phone two independent paths Your VPN any provider · or none anonymized egress Public internet sees the VPN's IP Mac mini · home iroh peer · dial by key no public endpoint iroh relay · fallback self-host · ciphertext other apps ▸ SOTTO ▸ iroh direct
iroh link · encrypted P2P Your hardware General traffic · your VPN
Decision · transport The app links to the Mac with iroh, dialing its public key over a direct encrypted QUIC connection that punches through NAT, with a self-hostable relay as fallback. Because iroh lives in the app rather than the OS, the phone's one VPN slot stays free: run whatever commercial VPN you like for general browsing, or none, and exclude the SOTTO app from it via split tunneling so the link stays direct. Reaching your own home needs no anonymizing.

No coordinator, no shared secret

iroh needs no always-on service brokering your keys. Peers authenticate each other directly by public key, and the only optional infrastructure is a relay — used when two peers cannot punch a direct path — which forwards ciphertext and can be a free community relay or one you self-host. For a self-hosted setup that means no third party sits astride your identity the way a hosted control plane would: you hold the keys, and the Mac still exposes no public endpoint. Anonymizing your general internet traffic, if you want it, is a separate and independent choice rather than part of the link.

05 · Request lifecycle

One utterance, end to end

A turn-based loop. The app wakes, captures until you stop talking, and ships compressed audio through the tunnel. The Mac transcribes, reasons, optionally calls back to run an action, then streams spoken audio home. Figures are rough first-token and first-audio estimates over LAN.

FIG 3 Request lifecycle & latency budget
1 · WAKEapp wake word~0 2 · CAPTURE+VADendpointing+300ms tail 3 · LINK ▸iroh5–30ms LAN 4 · STTParakeet V3~100–200ms 5 · LLM (+tools)Gemma 4 26B MoETTFT ~400ms 6 · TTS ▸ streamKokorochunk ~150ms 7 · PLAYBACKapp speaks + actsaudible Perceived first audio ≈ 1.5 – 2.5s for a short command · streaming hides generation time
The single most important trick The TTS streams. The system does not wait for the full LLM response before speaking; tokens pipe into Kokoro sentence-by-sentence so the first words are audible while the model is still generating. The model is instructed with plain speech only, no markdown so nothing pollutes the spoken output.

06 · The compute

Compute: the Mac mini M4 Pro

Unified memory lets the GPU address the whole RAM pool, so a model's footprint is bounded by total memory, not a separate VRAM budget. For an always-on server, low idle power and a silent, palm-sized chassis matter as much as raw speed.

ConfigUnified RAM~PriceVerdict
M4 (base)up to 32GB$599–999Runs 26B MoE + speech. Tight but workable.
M4 Pro ◂ chosen48GB (max)$1,999 / $2,199The platform. Holds 31B Dense + STT + TTS + macOS with margin.
M5 / M5 ProTBATBASlipped past WWDC; expected fall 2026. A documented later upgrade.

The compute platform is a Mac mini M4 Pro with 48GB of unified memory and a 512GB SSD, about $1,999. It holds the 31B Dense model (~19GB at Q4) alongside the full speech pipeline and macOS with comfortable margin. Because Gemma 4 caps at 31B, more memory would only matter for a larger model family such as a 70B.

Why the Mac, not a GPU tower A single used RTX 3090 build is faster and slightly cheaper to buy (~$1,700) but idles near 80W. At San Diego's ~45¢/kWh that is roughly $346/year to run, versus about $29/year for the Mac mini, which pays back the higher sticker in about a year. The 128GB Strix Halo route (room for a 70B) only makes sense alongside models larger than Gemma 4, and at ~$4,400 it is hard to justify here.
Decision · buy now, upgrade later SOTTO runs on the 48GB M4 Pro, available now. The M5 mini slipped past WWDC and is expected in fall 2026; its real LLM uplift is moderate (~15 to 25 percent, mostly wider memory bandwidth), and the "4x AI" headline refers to the Neural Engine, not token throughput. The M5 Pro is therefore a documented future upgrade path, not a blocker: Apple Silicon holds value, so a later swap costs only modest depreciation. The 64GB SKU was pulled in the DRAM shortage, fixing 48GB as the ceiling.

07 · The brain

Gemma 4, and which variant

Gemma 4 (April 2026, Apache 2.0) is the standout bang-for-size open family. The choice between sizes is a latency-versus-quality dial, and SOTTO uses three points on it.

VariantProfileRole
E2B / E4B2B / 4B edge · native audio inOn-device fallback. Lives on the phone for no-network use; audio input can skip a separate STT step.
26B MoE ◂ primary3.8B active paramsConversational default. Low active-parameter count means fast tokens for real-time dialogue.
31B Densefrontier-for-sizeQuality tier. For harder reasoning or tool planning, at a bit more latency.

The 26B MoE is the everyday model. The orchestrator routes to 31B Dense for complex requests, and E4B runs on the phone as the offline safety net.

Serving Serving runs on Ollama (ollama run gemma4, OpenAI-compatible) and graduates to MLX / mlx-lm on the hot path for higher tokens-per-second on Apple Silicon. Both run natively on the Mac for Metal acceleration, a constraint the deployment section turns on.

08 · The ears & the voice

Speech-to-text and text-to-speech

STT · Parakeet V3, with a Whisper fallback

NVIDIA Parakeet V3 is the speed leader on Apple Silicon: it predicts the whole utterance at once for ~80ms-class latency and runs on the Neural Engine via FluidAudio or a parakeet-mlx server. whisper.cpp (Large V3 Turbo, Metal) is the multilingual fallback, and Moonshine is the path to true streaming later.

TTS · Kokoro

Kokoro is an 82M-parameter model that sounds close to far larger systems and runs faster than real-time. It is served through Kokoro-FastAPI, which exposes a streaming OpenAI-compatible /v1/audio/speech endpoint so playback starts on the first sentence.

Proven reference A documented March 2026 build (whisper.cpp + Ollama + Kokoro on Apple Silicon) reports sub-3-second conversational turns fully offline. SOTTO upgrades each stage and adds the encrypted phone link, so the latency budget is realistic rather than aspirational.
Part II
The Client & Stack

The native app you touch, the permissions you control, the way conversations are organized, and a stack anyone can self-host.

09 · The app stack

Native Kotlin, not a cross-platform framework

SOTTO lives on deep OS integration: the assistant role, a foreground wake-word service, raw audio capture, Shizuku, the notification listener, telephony. That is exactly where cross-platform frameworks pay a bridge tax for no benefit, especially for an Android-only app on GrapheneOS.

OptionFitVerdict
Native Kotlin + Compose ◂ chosenDirect access to every Android API; Google's first-class stack with long-term support.The client. Right for a deeply OS-integrated, Android-only assistant.
FlutterOwn rendering engine; deep integrations go through platform channels, a known pain point. Dart is a separate language.Wrong fit for a system-level assistant.
Kotlin MultiplatformShares logic while keeping native UI. Only pays off alongside an iOS build.Overkill now; revisit only for a second platform.
Decision · client The client is Kotlin + Jetpack Compose with Material 3 theming. Business logic (chat store, tool registry, sync) stays in clean Kotlin so a future Compose Multiplatform port would be UI-only. No Flutter, no React Native.

10 · App architecture

Software architecture

The app is a thin, smart client. The Mac holds the models and the canonical chat, project, and memory store; the app owns capture, playback, the UI, and turning tool calls into device actions. At a glance it is four layers:

UI · Jetpack Compose
Voice overlayChat + projectsTools & permissionsSetup / connect

▲ renders state, sends user intent down ▼

Domain · Kotlin
Voice loop controllerChat / project repoTool registry + consentSync engine

▲ orchestrates, persists, calls platform + server ▼

Platform services · Android
Assistant roleWake-word serviceAudioRecord + VADPlaybackShizuku bridgeSMS · alarms · intents

▲ talks to the device · talks to the Mac over the tunnel ▼

Data · local cache + server
Room (offline cache)Transport (iroh)Mac API: STT · LLM · TTS · store
One model, two modes The voice assistant and the text chatbot are the same conversation engine with different front ends. A voice turn is a chat turn whose input arrived as audio and whose reply is spoken. Both read and write the same chats, projects, and memory.

Pattern: unidirectional state

Each screen follows unidirectional data flow (MVI). A ViewModel exposes a single immutable UiState as a StateFlow; Compose renders it and sends user actions back up as intents; one-shot effects (navigation, errors, a "permission needed" prompt) travel on a separate event channel. For a stateful voice loop this keeps every transition explicit and testable, the UI is a pure function of state.

Module layout

One Gradle module to start, packaged so feature modules can split out later without touching the domain.

app/src/main/java/…/sotto/
├── ui/            # Compose screens + ViewModels (MVI)
│   ├── voice/  chat/  tools/  setup/
│   └── theme/
├── domain/        # framework-free logic, pure Kotlin
│   ├── VoiceLoopController   # the state machine
│   ├── ConversationRepository# chats/projects, offline-first
│   ├── ToolExecutor + ToolRegistry
│   └── SyncEngine
├── data/
│   ├── api/       # iroh client, OpenAI-compatible DTOs, streams
│   ├── db/        # Room entities + DAOs
│   └── settings/  # DataStore: tool toggles, server, prefs
├── audio/         # AudioEngine: AudioRecord, VAD, AudioTrack
├── platform/      # assistant role, WakeWordService, ShizukuBridge, intents
└── di/            # Hilt modules
FIG 4 App component architecture
UI · Compose Screens voice·chat·tools·setup ViewModels UiState · intents Domain VoiceLoopController ConversationRepository ToolExecutor SyncEngine Data & platform AudioEngine · I/O ApiClient · iroh Room + DataStore ShizukuBridge + intents Mac orchestrator over iroh Android OS / device mic · speaker · system calls

The voice loop is a state machine

The heart of the app is a VoiceLoopController driving an explicit state machine. A turn moves Idle → Listening → Transcribe → Thinking → Speaking and back, with a side excursion to Executing whenever the model calls a tool. Every active state owns a coroutine scope, so a barge-in, you start talking again or tap cancel, tears down the scope and snaps back to Idle from any state. If the tunnel is down, Thinking is served by the on-device E4B model instead of the Mac, so the loop still completes.

FIG 5 Voice loop state machine
IDLE LISTENING capture + VAD TRANSCRIBE stream ▸ STT THINKING LLM tokens SPEAKING stream ▸ TTS wake/PTT silence text tokens EXECUTING tool + confirm tool call result done ▸ IDLE barge-in / cancel from any active state ▸ IDLE

Talking to the Mac: streaming both ways

The ApiClient dials the Mac's iroh endpoint by public key and speaks the orchestrator's OpenAI-compatible protocol over the encrypted QUIC connection. Replies stream back token-by-token so Speaking can start on the first sentence; captured audio uploads as a chunked stream to STT; synthesized speech returns as a streamed response played incrementally through AudioTrack. Each leg is a Kotlin Flow, so cancellation propagates end-to-end the instant a turn is abandoned.

Tool execution

Every tool implements a small interface, an id, a one-line description, a risk tier, the Android permission or Shizuku capability it needs, and a suspending execute(args). The ToolRegistry holds them; ToolExecutor resolves a call against the per-chat enabled set (with project overrides) read from DataStore, prompts for confirmation when the tier requires it, then dispatches to a native API or the ShizukuBridge. Only enabled tools are serialized into the request as callable functions; disabled ones travel as name-and-description only, per the two-tier exposure in FIG 6.

Data & sync

ConversationRepository is the single source the UI reads, and it reads offline-first from Room. Entities mirror the Mac's store, projects, chats, messages, files, and writes apply locally first, then reconcile through SyncEngine (a WorkManager job plus a foreground coroutine while a chat is open) on a last-write-wins basis. Tool toggles, the server address, and preferences live in DataStore rather than Room, since they are small and read on every request.

Concurrency

Everything asynchronous is coroutines and Flow under structured concurrency. Audio capture and playback run on a dedicated high-priority dispatcher, network and disk on IO, and the active turn lives in a session scope the controller cancels on barge-in or teardown, no orphaned streams, no leaked recorders.

Key libraries

ConcernLibrary
UIJetpack Compose · Material 3
Dependency injectionHilt
Async & streamsKotlin Coroutines · Flow
Transportiroh (P2P QUIC) · token streaming
Local dataRoom (conversations) · DataStore (settings & toggles)
Background syncWorkManager
Audio I/OAudioRecord + AudioTrack (Oboe if latency demands)
Wake wordONNX Runtime Mobile / LiteRT (openWakeWord)
Elevated actionsShizuku API
Serializationkotlinx.serialization

11 · Wireframes

Four core screens

The voice overlay invoked by gesture, the chat-and-projects workspace, the tool-and-permission control panel, and first-run setup. Color is meaningful: teal for connected and safe, amber for the active model and primary actions, coral for sensitive permissions.

12 · Tools & the permission model

You decide what it can touch

The assistant is only ever as capable as the switches you flip. Every tool maps to a concrete Android permission or Shizuku grant, carries a risk level, and is off by default. The model is aware of every tool by name, but only the ones you enable are handed to it as callable functions; the rest it can see but never invoke.

FIG 6 Tool exposure & consent flow
Per-request tool context Enabled tools full schema · callable Disabled tools name + blurb · not callable MODEL sees both callable ▸ awareness only ▸ When the model calls an enabled tool 1 · MODELcalls enabled tool 2 · EXECUTORapp receives 3 · CONFIRMhigh-risk → ask you 4 · EXECUTEnative / Shizuku 5 · RESULT▸ back to model A disabled capability is never callable — the model tells you to enable it in the SOTTO app.

Two-tier exposure

Only enabled tools are passed to the model as callable function schemas, which keeps the callable surface small, the context lean, and tool selection accurate. Disabled tools are not callable at all: the model receives only their name and a one-line description, flagged off. So when you ask for something it cannot do, it says so plainly, for example "I don't have access to Send SMS right now; you can enable it in the SOTTO app's Tools screen," rather than inventing a call it cannot make.

ToolBacked byRisk
Send SMS · read notifications · place callsSMS / notification listener / phone permissionsmed–high
Alarms & timers · calendar read/writeAlarmManager · calendar permissionlow–med
Open app · navigation · share · web searchAndroid intents · server searchlow
Clipboard · battery · media · flashlightNative APIslow
Toggle settings · location · app controlShizuku (ADB-level, no root)high
Knowledge / RAG over your docsServer-side, on the Maclow
Safety posture The tool list is a contract you author, not a vendor default. It pairs with running the app under tight Storage and Contact Scopes, and with keeping UI-driving accessibility automation out of the design, since GrapheneOS recommends against third-party accessibility services. A misbehaving tool call cannot reach what you did not grant.

13 · Chats, projects & memory

Three scopes of context

Conversations are organized like a workspace, and what the model knows in any turn is assembled from nested scopes. The Mac holds the canonical store; the app caches it for offline reading and syncs over the tunnel.

Scope 1 · Chat session
message historysliding context windowattached files
Scope 2 · Project
project instructionsshared files / RAGgrouped chatstool overrides
Scope 3 · Global memory · future release
durable facts & preferencescross-chat recalluser-managed: view / edit / delete

The store

A SQLite database (Postgres in the container stack) on the Mac, owned by the orchestrator, is the source of truth. The app mirrors what it needs into Room for offline reading and queues writes back on reconnect, last-write-wins.

FIG 7 Data model
projects id · pk name instructions tool_overrides chats id · pk project_id · fk? title created_at messages id · pk chat_id · fk role · content ts files id · scope (proj / chat) path · embedding memories · future id · scope (global / proj) content · embedding 1 ─ N 1 ─ N 1 ─ N scoped

How a turn is assembled

The orchestrator composes the prompt from the inside out: system prompt, the enabled tools as callable functions plus the disabled ones as a visible-only catalog, then project instructions if the chat belongs to one, then relevant global memories (future), then retrieved RAG chunks from project files or past chats, then the recent message window. Everything stays on your hardware; "memory" is a row in your database, not a vendor feature.

Why memory is a later release Cross-chat memory is the feature most likely to leak context between contexts, so it ships after the scoping and permission model are solid. When it lands it is fully user-managed: every item visible, editable, deletable, and scopable per project so Work facts never surface in a Home chat.

14 · The deployable stack

One command for everyone else

The server ships two ways: a native Mac app for the best experience on Apple Silicon, and a Docker Compose stack for Linux and the portable reference. The split exists because of one hard constraint.

The constraint that shapes everything Docker on macOS cannot access the Apple GPU. It is an Apple Hypervisor.framework limitation, not an Ollama bug, and it holds in 2026 even on M5: an LLM inside a Mac container falls back to CPU and runs 5 to 6 times slower. The resolution: run the model engine natively, containerize only the non-GPU services.
TargetWhat runs whereFor whom
Mac app ◂ best on Apple SiliconA SwiftUI menu-bar supervisor bundles and launches native Ollama (Metal) and native Parakeet (Neural Engine), and runs Kokoro, the orchestrator, and the tunnel client alongside. One .dmg, open, done.Non-technical self-hosters; Mac mini owners who want top speed.
Docker ComposeOn Linux + NVIDIA: everything containerized with in-container GPU. On Mac: containers run orchestrator + Kokoro and reach host-native Ollama via host.docker.internal.Linux hosts, tinkerers, the portable reference build.

Repository layout

# one repo, both install paths
sotto/
├── app/                 # Android · Kotlin + Compose
├── server/
│   ├── orchestrator/    # Python FastAPI: agent loop + chat/project/memory store
│   ├── tts/             # Kokoro-FastAPI
│   ├── stt/             # whisper.cpp server (Parakeet runs native on Mac)
│   └── docker-compose.yml
├── mac-app/             # SwiftUI supervisor; bundles native Ollama + Parakeet
├── deploy/
│   └── compose.linux-nvidia.yml   # full in-container GPU
└── README.md

The compose shape

services:
  orchestrator:   # agent loop + OpenAI-compatible front door
  tts:            # kokoro-fastapi
  stt:            # whisper.cpp (Linux GPU) — Parakeet/ANE on Mac instead
  # ollama: in-container on Linux/NVIDIA; on Mac point to
  # OLLAMA_HOST=http://host.docker.internal:11434 (native, Metal)

A Linux user with an NVIDIA card runs docker compose up and the whole thing comes alive, GPU included. A Mac user opens the app, or runs ollama serve natively with docker compose up for the rest. The orchestrator speaks the OpenAI API, so contributors swap any piece without touching the app.

15 · Build order

A roadmap that always leaves something working

0

Link first

The iroh link between phone and Mac, dialing by key, the phone reaching the Mac directly from cellular with relay fallback. Nothing else works until this does.

1

Server text loop

Ollama plus Gemma 4 and the Python orchestrator with the chat/project store and an OpenAI-compatible endpoint, hit from the phone with a one-line script.

2

Chatbot app

The Compose UI, chats and projects synced to the Mac, text in and out. A clean chatbot before it ever hears you.

3

Voice loop

Parakeet and Kokoro on the Mac; the app captures and plays audio. Push-to-talk first, then openWakeWord in a foreground service.

4

Assistant role + tools

The assistant role for gesture invocation, the tool registry, and the permissions screen, wiring native actions and the Shizuku bridge behind their toggles.

5

Package & open-source

The server wrapped as the Mac app and the compose files, with quickstarts, pinned versions, and defaults. The public release.

6

Global memory

The future release: the memories table, retrieval, and the management UI, scoped per project, fully user-controlled.

16 · Security & privacy posture

Keeping the threat surface small

17 · Bill of materials

Everything, with links

Compute
512GB SSD, ~$1,999.
LLM
Apache 2.0 open weights.
Serving
Native, Metal-accelerated.
STT
Neural Engine + fallback.
TTS
Streaming synthesis.
Transport
Direct encrypted P2P link, self-hostable relay. Free.
App
Native Android client.
Elevated access
ADB-level APIs, no root.
Wake word
On-device activation.
Reference
Profiles, scopes, posture.