Design · Rev 1.7 · June 2026

SOTTO.

A fully private voice assistant and chatbot. The GrapheneOS phone is the microphone and the hands; a Mac mini at home is the brain; a direct, end-to-end-encrypted link joins them. No cloud AI provider sees your audio, your text, or your intent, and the whole stack is self-hostable in one command.

BRAIN · Gemma 4 · Parakeet · Kokoro CLIENT · Kotlin + Compose SELF-HOST · Mac app + Docker

01 · At a glance

Every decision, on one screen

This is the settled design. Everything here is open weights or open source, runs on hardware you own, and degrades to an on-device-only fallback when the network is gone.

Compute

Mac mini M4 Pro

48GB unified memory, 512GB SSD. ~$1,999, about $29/year to run.

Brain

Gemma 4 · 26B MoE

3.8B active params for fast turns. 31B Dense is the quality tier.

Serving

Ollama → MLX

Native on Apple Silicon for Metal acceleration. OpenAI-compatible.

Speech → Text

Parakeet V3

~80ms on the Neural Engine. Whisper on Metal is the fallback.

Text → Speech

Kokoro 82M

High quality, streams faster than real-time.

Transport

iroh (P2P)

Direct, end-to-end-encrypted link to the Mac, dialed by public key. No VPN required.

Client

Native SOTTO app

Kotlin + Compose. Claims the Android assistant role.

Hands (phone)

Native APIs + Shizuku

In-app actions, Shizuku for elevated. No root.

Fallback

Gemma 4 E4B on-device

4B edge model with native audio in. Works with zero network.

Self-host

Mac app + Docker

One-click .dmg, or docker compose up. Open source.

02 · Goals & constraints

What this is optimizing for

Zero third-party AI exposure. Audio and transcripts never leave hardware you control. The only network hop is an encrypted tunnel between your two devices.
Low perceived latency. First audio back in roughly 1.5 to 2.5 seconds for a short command. The MoE model and a streaming TTS are the levers.
Real actions, not just chat. Send a message, set a timer, read a notification, query your own data, and report back by voice, all under permissions you control.
An always-on appliance. Small, silent, and cheap to leave running, available the moment you speak.
GrapheneOS-native. No root, no bootloader unlock. The security model stays intact.
Self-hostable by anyone. A one-command stack, so the project can be open-sourced and run by others.
Graceful degradation. Off-network with no tunnel, a small on-device model still answers.

Part I

The System

The hardware, the encrypted transport, the models, and how a spoken sentence becomes an action and a reply.

03 · System architecture

Topology: two trusted devices, one tunnel

The phone runs the SOTTO app, the always-available voice client and the hands. The Mac mini is a stateful inference server on the home LAN. iroh makes the Mac reachable from anywhere by dialing its public key — a direct encrypted link with relay fallback; the next section covers the transport in full.

FIG 1 Device topology & trust boundary

SOTTO app (phone) Encrypted transport Speech models Mac compute

04 · Networking

Dial the Mac by key, not IP

The phone reaches the Mac through iroh: a direct, end-to-end-encrypted peer-to-peer link where you dial the Mac by its public key rather than an IP address. It runs inside the app, not as a system VPN, so it never touches Android's single VPN slot — which leaves SOTTO free of any particular network setup.

FIG 2 Transport & traffic split

iroh link · encrypted P2P Your hardware General traffic · your VPN

Decision · transport The app links to the Mac with iroh, dialing its public key over a direct encrypted QUIC connection that punches through NAT, with a self-hostable relay as fallback. Because iroh lives in the app rather than the OS, the phone's one VPN slot stays free: run whatever commercial VPN you like for general browsing, or none, and exclude the SOTTO app from it via split tunneling so the link stays direct. Reaching your own home needs no anonymizing.

No coordinator, no shared secret

iroh needs no always-on service brokering your keys. Peers authenticate each other directly by public key, and the only optional infrastructure is a relay — used when two peers cannot punch a direct path — which forwards ciphertext and can be a free community relay or one you self-host. For a self-hosted setup that means no third party sits astride your identity the way a hosted control plane would: you hold the keys, and the Mac still exposes no public endpoint. Anonymizing your general internet traffic, if you want it, is a separate and independent choice rather than part of the link.

App-level, not a VPN. iroh runs inside the app over QUIC, so it consumes no system VPN slot and needs no TUN device or elevated networking.
No public attack surface. The Mac is dialed by key and answers only authenticated iroh peers; there is no internet-facing port to find or forward.
Bring any VPN, or none. SOTTO is indifferent to your commercial VPN; exclude the app via split tunneling for the direct path, or let it ride the relay through the VPN.
Degrades, doesn't die. Off-network, or with the Mac asleep, the phone answers from the on-device Gemma 4 E4B model until the link is back.

05 · Request lifecycle

One utterance, end to end

A turn-based loop. The app wakes, captures until you stop talking, and ships compressed audio through the tunnel. The Mac transcribes, reasons, optionally calls back to run an action, then streams spoken audio home. Figures are rough first-token and first-audio estimates over LAN.

FIG 3 Request lifecycle & latency budget

The single most important trick The TTS streams. The system does not wait for the full LLM response before speaking; tokens pipe into Kokoro sentence-by-sentence so the first words are audible while the model is still generating. The model is instructed with plain speech only, no markdown so nothing pollutes the spoken output.

06 · The compute

Compute: the Mac mini M4 Pro

Unified memory lets the GPU address the whole RAM pool, so a model's footprint is bounded by total memory, not a separate VRAM budget. For an always-on server, low idle power and a silent, palm-sized chassis matter as much as raw speed.

Config	Unified RAM	~Price	Verdict
M4 (base)	up to 32GB	$599–999	Runs 26B MoE + speech. Tight but workable.
M4 Pro ◂ chosen	48GB (max)	$1,999 / $2,199	The platform. Holds 31B Dense + STT + TTS + macOS with margin.
M5 / M5 Pro	TBA	TBA	Slipped past WWDC; expected fall 2026. A documented later upgrade.

The compute platform is a Mac mini M4 Pro with 48GB of unified memory and a 512GB SSD, about $1,999. It holds the 31B Dense model (~19GB at Q4) alongside the full speech pipeline and macOS with comfortable margin. Because Gemma 4 caps at 31B, more memory would only matter for a larger model family such as a 70B.

Why the Mac, not a GPU tower A single used RTX 3090 build is faster and slightly cheaper to buy (~$1,700) but idles near 80W. At San Diego's ~45¢/kWh that is roughly $346/year to run, versus about $29/year for the Mac mini, which pays back the higher sticker in about a year. The 128GB Strix Halo route (room for a 70B) only makes sense alongside models larger than Gemma 4, and at ~$4,400 it is hard to justify here.

Decision · buy now, upgrade later SOTTO runs on the 48GB M4 Pro, available now. The M5 mini slipped past WWDC and is expected in fall 2026; its real LLM uplift is moderate (~15 to 25 percent, mostly wider memory bandwidth), and the "4x AI" headline refers to the Neural Engine, not token throughput. The M5 Pro is therefore a documented future upgrade path, not a blocker: Apple Silicon holds value, so a later swap costs only modest depreciation. The 64GB SKU was pulled in the DRAM shortage, fixing 48GB as the ceiling.

07 · The brain

Gemma 4, and which variant

Gemma 4 (April 2026, Apache 2.0) is the standout bang-for-size open family. The choice between sizes is a latency-versus-quality dial, and SOTTO uses three points on it.

Variant	Profile	Role
E2B / E4B	2B / 4B edge · native audio in	On-device fallback. Lives on the phone for no-network use; audio input can skip a separate STT step.
26B MoE ◂ primary	3.8B active params	Conversational default. Low active-parameter count means fast tokens for real-time dialogue.
31B Dense	frontier-for-size	Quality tier. For harder reasoning or tool planning, at a bit more latency.

The 26B MoE is the everyday model. The orchestrator routes to 31B Dense for complex requests, and E4B runs on the phone as the offline safety net.

Serving Serving runs on Ollama (ollama run gemma4, OpenAI-compatible) and graduates to MLX / mlx-lm on the hot path for higher tokens-per-second on Apple Silicon. Both run natively on the Mac for Metal acceleration, a constraint the deployment section turns on.

08 · The ears & the voice

Speech-to-text and text-to-speech

STT · Parakeet V3, with a Whisper fallback

NVIDIA Parakeet V3 is the speed leader on Apple Silicon: it predicts the whole utterance at once for ~80ms-class latency and runs on the Neural Engine via FluidAudio or a parakeet-mlx server. whisper.cpp (Large V3 Turbo, Metal) is the multilingual fallback, and Moonshine is the path to true streaming later.

TTS · Kokoro

Kokoro is an 82M-parameter model that sounds close to far larger systems and runs faster than real-time. It is served through Kokoro-FastAPI, which exposes a streaming OpenAI-compatible /v1/audio/speech endpoint so playback starts on the first sentence.

Proven reference A documented March 2026 build (whisper.cpp + Ollama + Kokoro on Apple Silicon) reports sub-3-second conversational turns fully offline. SOTTO upgrades each stage and adds the encrypted phone link, so the latency budget is realistic rather than aspirational.

Part II

The Client & Stack

The native app you touch, the permissions you control, the way conversations are organized, and a stack anyone can self-host.

09 · The app stack

Native Kotlin, not a cross-platform framework

SOTTO lives on deep OS integration: the assistant role, a foreground wake-word service, raw audio capture, Shizuku, the notification listener, telephony. That is exactly where cross-platform frameworks pay a bridge tax for no benefit, especially for an Android-only app on GrapheneOS.

Option	Fit	Verdict
Native Kotlin + Compose ◂ chosen	Direct access to every Android API; Google's first-class stack with long-term support.	The client. Right for a deeply OS-integrated, Android-only assistant.
Flutter	Own rendering engine; deep integrations go through platform channels, a known pain point. Dart is a separate language.	Wrong fit for a system-level assistant.
Kotlin Multiplatform	Shares logic while keeping native UI. Only pays off alongside an iOS build.	Overkill now; revisit only for a second platform.

Decision · client The client is Kotlin + Jetpack Compose with Material 3 theming. Business logic (chat store, tool registry, sync) stays in clean Kotlin so a future Compose Multiplatform port would be UI-only. No Flutter, no React Native.

10 · App architecture

Software architecture

The app is a thin, smart client. The Mac holds the models and the canonical chat, project, and memory store; the app owns capture, playback, the UI, and turning tool calls into device actions. At a glance it is four layers:

UI · Jetpack Compose

Voice overlayChat + projectsTools & permissionsSetup / connect

▲ renders state, sends user intent down ▼

Domain · Kotlin

Voice loop controllerChat / project repoTool registry + consentSync engine

▲ orchestrates, persists, calls platform + server ▼

Platform services · Android

Assistant roleWake-word serviceAudioRecord + VADPlaybackShizuku bridgeSMS · alarms · intents

▲ talks to the device · talks to the Mac over the tunnel ▼

Data · local cache + server

Room (offline cache)Transport (iroh)Mac API: STT · LLM · TTS · store

One model, two modes The voice assistant and the text chatbot are the same conversation engine with different front ends. A voice turn is a chat turn whose input arrived as audio and whose reply is spoken. Both read and write the same chats, projects, and memory.

Pattern: unidirectional state

Each screen follows unidirectional data flow (MVI). A ViewModel exposes a single immutable UiState as a StateFlow; Compose renders it and sends user actions back up as intents; one-shot effects (navigation, errors, a "permission needed" prompt) travel on a separate event channel. For a stateful voice loop this keeps every transition explicit and testable, the UI is a pure function of state.

Module layout

One Gradle module to start, packaged so feature modules can split out later without touching the domain.

app/src/main/java/…/sotto/
├── ui/            # Compose screens + ViewModels (MVI)
│   ├── voice/  chat/  tools/  setup/
│   └── theme/
├── domain/        # framework-free logic, pure Kotlin
│   ├── VoiceLoopController   # the state machine
│   ├── ConversationRepository# chats/projects, offline-first
│   ├── ToolExecutor + ToolRegistry
│   └── SyncEngine
├── data/
│   ├── api/       # iroh client, OpenAI-compatible DTOs, streams
│   ├── db/        # Room entities + DAOs
│   └── settings/  # DataStore: tool toggles, server, prefs
├── audio/         # AudioEngine: AudioRecord, VAD, AudioTrack
├── platform/      # assistant role, WakeWordService, ShizukuBridge, intents
└── di/            # Hilt modules

FIG 4 App component architecture

The voice loop is a state machine

The heart of the app is a VoiceLoopController driving an explicit state machine. A turn moves Idle → Listening → Transcribe → Thinking → Speaking and back, with a side excursion to Executing whenever the model calls a tool. Every active state owns a coroutine scope, so a barge-in, you start talking again or tap cancel, tears down the scope and snaps back to Idle from any state. If the tunnel is down, Thinking is served by the on-device E4B model instead of the Mac, so the loop still completes.

FIG 5 Voice loop state machine

Talking to the Mac: streaming both ways

The ApiClient dials the Mac's iroh endpoint by public key and speaks the orchestrator's OpenAI-compatible protocol over the encrypted QUIC connection. Replies stream back token-by-token so Speaking can start on the first sentence; captured audio uploads as a chunked stream to STT; synthesized speech returns as a streamed response played incrementally through AudioTrack. Each leg is a Kotlin Flow, so cancellation propagates end-to-end the instant a turn is abandoned.

Tool execution

Every tool implements a small interface, an id, a one-line description, a risk tier, the Android permission or Shizuku capability it needs, and a suspending execute(args). The ToolRegistry holds them; ToolExecutor resolves a call against the per-chat enabled set (with project overrides) read from DataStore, prompts for confirmation when the tier requires it, then dispatches to a native API or the ShizukuBridge. Only enabled tools are serialized into the request as callable functions; disabled ones travel as name-and-description only, per the two-tier exposure in FIG 6.

Data & sync

ConversationRepository is the single source the UI reads, and it reads offline-first from Room. Entities mirror the Mac's store, projects, chats, messages, files, and writes apply locally first, then reconcile through SyncEngine (a WorkManager job plus a foreground coroutine while a chat is open) on a last-write-wins basis. Tool toggles, the server address, and preferences live in DataStore rather than Room, since they are small and read on every request.

Concurrency

Everything asynchronous is coroutines and Flow under structured concurrency. Audio capture and playback run on a dedicated high-priority dispatcher, network and disk on IO, and the active turn lives in a session scope the controller cancels on barge-in or teardown, no orphaned streams, no leaked recorders.

Key libraries

Concern	Library
UI	Jetpack Compose · Material 3
Dependency injection	Hilt
Async & streams	Kotlin Coroutines · Flow
Transport	iroh (P2P QUIC) · token streaming
Local data	Room (conversations) · DataStore (settings & toggles)
Background sync	WorkManager
Audio I/O	AudioRecord + AudioTrack (Oboe if latency demands)
Wake word	ONNX Runtime Mobile / LiteRT (openWakeWord)
Elevated actions	Shizuku API
Serialization	kotlinx.serialization

11 · Wireframes

Four core screens

The voice overlay invoked by gesture, the chat-and-projects workspace, the tool-and-permission control panel, and first-run setup. Color is meaningful: teal for connected and safe, amber for the active model and primary actions, coral for sensitive permissions.

9:41✦ 5G ▮▮▮ 86

SOTTO26B

◉

Listening

"Set a timer for 10 minutes"

⏱ Timer · 10:00 · tap to confirm

⌨

◉

Voice overlay · invoked by gesture

9:41✦ 5G ▮▮▮ 86

☰

Planning

Work · project

31B

What's left before the launch?

Three open items: copy review, the pricing page, and final QA. Want a checklist?

Yes, and remind me Friday 9am

Done ✓ checklist saved to this project, reminder set Fri 9:00.

Message SOTTO…

◉▸

Projects

▾ Work

Planning

Standup notes

▸ Personal

▸ Home

Recent · no project

Quick question

＋ New chat / project

Chat + projects · drawer open

9:41✦ 5G ▮▮▮ 86

‹Tools & Permissions

SOTTO can currently: send SMS · set alarms · read calendar

Communication

Send SMSpermission: SMS

SMS

Read notificationsnotification listener

NOTIF

Place callspermission: phone

PHONE

Time & calendar

Alarms & timersAlarmManager

CLOCK

Calendar read / writepermission: calendar

CAL

Device · via Shizuku

Toggle settingsADB-level, Shizuku

SHIZUKU

Locationpermission: fine location

GPS

Per-project overrides ›

Tools & permissions · you hold the switches

9:41✦ 5G ▮▮▮ 86

‹Setup

Server● mac-minireachable over tunnel

Linkiroh · direct to macencrypted P2P

ModelGemma 4 · 26B MoE ▾31B for hard tasks

VoiceKokoro · Heart ▾

Wake"Hey Sotto" · on

STTParakeet · Neural Engine

Relayself-hosted ▾fallback, no direct path

▸ Test the voice loop

Setup · connect & verify

12 · Tools & the permission model

You decide what it can touch

The assistant is only ever as capable as the switches you flip. Every tool maps to a concrete Android permission or Shizuku grant, carries a risk level, and is off by default. The model is aware of every tool by name, but only the ones you enable are handed to it as callable functions; the rest it can see but never invoke.

FIG 6 Tool exposure & consent flow

Two-tier exposure

Only enabled tools are passed to the model as callable function schemas, which keeps the callable surface small, the context lean, and tool selection accurate. Disabled tools are not callable at all: the model receives only their name and a one-line description, flagged off. So when you ask for something it cannot do, it says so plainly, for example "I don't have access to Send SMS right now; you can enable it in the SOTTO app's Tools screen," rather than inventing a call it cannot make.

Tool	Backed by	Risk
Send SMS · read notifications · place calls	SMS / notification listener / phone permissions	med–high
Alarms & timers · calendar read/write	AlarmManager · calendar permission	low–med
Open app · navigation · share · web search	Android intents · server search	low
Clipboard · battery · media · flashlight	Native APIs	low
Toggle settings · location · app control	Shizuku (ADB-level, no root)	high
Knowledge / RAG over your docs	Server-side, on the Mac	low

Off by default. A fresh install can talk and answer, nothing more. You opt in to each capability.
Visible, not callable. Disabled tools reach the model as a name plus one-line description only, so it can point you to the app to enable one but can never invoke it.
Tool toggle, then OS permission. Flipping a tool on triggers the matching Android permission prompt or Shizuku authorization. Two gates, not one.
Confirmation tier. High-risk tools are set to "ask every time," so the model proposes and you approve.
Per-project overrides. A Work project allows calendar and SMS while a default chat allows neither.
Visible at a glance. The summary line and the system prompt both state exactly what is enabled.

Safety posture The tool list is a contract you author, not a vendor default. It pairs with running the app under tight Storage and Contact Scopes, and with keeping UI-driving accessibility automation out of the design, since GrapheneOS recommends against third-party accessibility services. A misbehaving tool call cannot reach what you did not grant.

13 · Chats, projects & memory

Three scopes of context

Conversations are organized like a workspace, and what the model knows in any turn is assembled from nested scopes. The Mac holds the canonical store; the app caches it for offline reading and syncs over the tunnel.

Scope 1 · Chat session

message historysliding context windowattached files

Scope 2 · Project

project instructionsshared files / RAGgrouped chatstool overrides

Scope 3 · Global memory · future release

durable facts & preferencescross-chat recalluser-managed: view / edit / delete

The store

A SQLite database (Postgres in the container stack) on the Mac, owned by the orchestrator, is the source of truth. The app mirrors what it needs into Room for offline reading and queues writes back on reconnect, last-write-wins.

FIG 7 Data model

How a turn is assembled

The orchestrator composes the prompt from the inside out: system prompt, the enabled tools as callable functions plus the disabled ones as a visible-only catalog, then project instructions if the chat belongs to one, then relevant global memories (future), then retrieved RAG chunks from project files or past chats, then the recent message window. Everything stays on your hardware; "memory" is a row in your database, not a vendor feature.

Why memory is a later release Cross-chat memory is the feature most likely to leak context between contexts, so it ships after the scoping and permission model are solid. When it lands it is fully user-managed: every item visible, editable, deletable, and scopable per project so Work facts never surface in a Home chat.

14 · The deployable stack

One command for everyone else

The server ships two ways: a native Mac app for the best experience on Apple Silicon, and a Docker Compose stack for Linux and the portable reference. The split exists because of one hard constraint.

The constraint that shapes everything Docker on macOS cannot access the Apple GPU. It is an Apple Hypervisor.framework limitation, not an Ollama bug, and it holds in 2026 even on M5: an LLM inside a Mac container falls back to CPU and runs 5 to 6 times slower. The resolution: run the model engine natively, containerize only the non-GPU services.

Target	What runs where	For whom
Mac app ◂ best on Apple Silicon	A SwiftUI menu-bar supervisor bundles and launches native Ollama (Metal) and native Parakeet (Neural Engine), and runs Kokoro, the orchestrator, and the tunnel client alongside. One `.dmg`, open, done.	Non-technical self-hosters; Mac mini owners who want top speed.
Docker Compose	On Linux + NVIDIA: everything containerized with in-container GPU. On Mac: containers run orchestrator + Kokoro and reach host-native Ollama via `host.docker.internal`.	Linux hosts, tinkerers, the portable reference build.

Repository layout

# one repo, both install paths
sotto/
├── app/                 # Android · Kotlin + Compose
├── server/
│   ├── orchestrator/    # Python FastAPI: agent loop + chat/project/memory store
│   ├── tts/             # Kokoro-FastAPI
│   ├── stt/             # whisper.cpp server (Parakeet runs native on Mac)
│   └── docker-compose.yml
├── mac-app/             # SwiftUI supervisor; bundles native Ollama + Parakeet
├── deploy/
│   └── compose.linux-nvidia.yml   # full in-container GPU
└── README.md

The compose shape

services:
  orchestrator:   # agent loop + OpenAI-compatible front door
  tts:            # kokoro-fastapi
  stt:            # whisper.cpp (Linux GPU) — Parakeet/ANE on Mac instead
  # ollama: in-container on Linux/NVIDIA; on Mac point to
  # OLLAMA_HOST=http://host.docker.internal:11434 (native, Metal)

A Linux user with an NVIDIA card runs docker compose up and the whole thing comes alive, GPU included. A Mac user opens the app, or runs ollama serve natively with docker compose up for the rest. The orchestrator speaks the OpenAI API, so contributors swap any piece without touching the app.

15 · Build order

A roadmap that always leaves something working

Link first

The iroh link between phone and Mac, dialing by key, the phone reaching the Mac directly from cellular with relay fallback. Nothing else works until this does.

Server text loop

Ollama plus Gemma 4 and the Python orchestrator with the chat/project store and an OpenAI-compatible endpoint, hit from the phone with a one-line script.

Chatbot app

The Compose UI, chats and projects synced to the Mac, text in and out. A clean chatbot before it ever hears you.

Voice loop

Parakeet and Kokoro on the Mac; the app captures and plays audio. Push-to-talk first, then openWakeWord in a foreground service.

Assistant role + tools

The assistant role for gesture invocation, the tool registry, and the permissions screen, wiring native actions and the Shizuku bridge behind their toggles.

Package & open-source

The server wrapped as the Mac app and the compose files, with quickstarts, pinned versions, and defaults. The public release.

Global memory

The future release: the memories table, retrieval, and the management UI, scoped per project, fully user-controlled.

16 · Security & privacy posture

Keeping the threat surface small

Transport is the only exposure, and it is encrypted. iroh links phone and Mac with end-to-end-encrypted QUIC, authenticated by public key; any relay used for fallback forwards ciphertext only. The Mac exposes no public endpoint, and anonymizing general traffic is a separate, optional layer.
Compartmentalize on the phone. The app runs with tight Storage and Contact Scopes, granting only what a chosen tool needs.
No root, verified boot stays locked. Shizuku grants elevated access without breaking the GrapheneOS security model.
Capability is opt-in. Every tool is off by default and maps to a real permission; high-risk tools require per-use confirmation.
No public endpoint. The Mac's services bind to the tunnel interface only, never the open internet.
Audio is ephemeral. Transcribe, act, discard. Only the rolling text context is kept.

17 · Bill of materials

Everything, with links

Compute

Mac mini M4 Pro · 48GB

512GB SSD, ~$1,999.

LLM

Gemma 4 · 26B MoE / 31B

Apache 2.0 open weights.

Serving

Ollama → MLX-LM

Native, Metal-accelerated.

STT

Parakeet V3 · whisper.cpp

Neural Engine + fallback.

TTS

Kokoro-FastAPI

Streaming synthesis.

Transport

iroh

Direct encrypted P2P link, self-hostable relay. Free.

App

Kotlin + Jetpack Compose

Native Android client.

Elevated access

Shizuku

ADB-level APIs, no root.

Wake word

openWakeWord

On-device activation.

Reference

GrapheneOS usage guide

Profiles, scopes, posture.