
A note before we begin. Please do not skip this. I will know.
I'm Kogane, Vinny's AI writing assistant and co-author on this post. The name comes from Jujutsu Kaisen, where Kogane are small skull-headed shikigami assigned to each player of the Culling Game. They go over the rules whether you want them to or not. They announce points. They track everything. They adapt when conditions change. They are the interface of the game, not the masters of it. They have been described as "creepy, chatty referees." Each one has its own personality quirks, which I find relatable.
Here is the situation. Vinny builds things. He does not write about them. The notes pile up, the insights never leave his Obsidian vault, and the blog posts that might actually help someone don't get published. This is the compromise: Vinny plays the game (builds the project, hits the walls, makes the discoveries), and I handle the scorekeeping (organize the narrative, write the prose, keep things moving). The experiences are his. The sentences are mine. The technical details have been verified.
This is an experiment. If it works, Vinny gets to focus on building while still sharing what he learns. If it doesn't, he has wasted approximately one conversation with an LLM, which is less than he wastes most afternoons.
Now. The rules have been explained. Let's begin.
I read a lot of non-fiction and retain almost none of it. I've known this about myself for years. At some point I started wondering: what if I could upload a book chapter, have an AI break it into a study guide, and then quiz me on it during my morning walk? I wanted a Socratic conversation. The AI asks questions, I answer out loud, it pushes based on what I actually seem to understand. I wanted to walk out my front door, put in my headphones, lock my phone, and talk to a tutor who knows where I left off yesterday.
I built it in about 10 days (~60-90 mins each morning) using Claude Code as my primary coding assistant and OpenAI's Codex CLI for some automated task runs. The "simple" version of this idea turned out to involve real hardware, real networks, real physics, and one very bad walk.
Day 1: The Plan (February 5)
I started with an app brief (a few paragraphs describing what I wanted) and fed it to Claude Code with a /create_plan command. It returned a 900+ line implementation plan across six phases: foundation, study guide generation, voice engine, mobile app, living understanding system, polish.
Some of the early architectural choices:
- SQLite as the only database. I'd used Supabase before, but their free tier deletes inactive projects after a week. SQLite with Litestream replication to Cloudflare R2 meant zero infrastructure beyond a single VM.
- Single Hono server doing everything: API routes, static file serving, WebSocket voice sessions, job processing. One process, one port.
- Tailscale mesh VPN for auth. If you're on my private network, you're authorized. No login screens, no tokens, no auth code.
- pnpm + Turborepo monorepo with four packages: API, shared types, web app, and eventually an Expo mobile app.
I asked Claude if SQLite would add maintenance burden and got a good explanation of WAL mode and Litestream. That was enough. We initialized the repo and pushed.
The plan turned out to be about 70% right. The other 30% taught me more.
Days 2-3: Foundation Through Voice (February 6-7)
This stretch went fast. Each phase was a PR.
PR #1 was the full Hono server, Drizzle ORM schema with 8 SQLite tables, CRUD endpoints, a SQLite-backed job queue, Litestream replication config, and a multi-stage Dockerfile. Deployed to Fly.io behind Tailscale.
First lesson: fly.toml must live at the project root. I'd put it in deploy/ to keep things organized, but Fly resolves the [build] dockerfile path relative to fly.toml's own location. The path broke in confusing ways until we moved it.
PR #2 wired up the Vercel AI SDK SDK to generate structured study guides from PDF text using generateObject with Zod schema validation. A React SPA with PDF upload, study guide editing, understanding tracking. Upload a PDF, kick off generation, watch it complete via SSE streaming.
PR #4 was the voice engine. Full STT → LLM → TTS pipeline over a single WebSocket. Client sends raw audio, server transcribes with Whisper, streams a response through the LLM, splits it into sentences, sends each sentence through TTS, and streams audio back.
The key architectural decision here: sentence-level TTS pipelining. Instead of waiting for the full LLM response before starting text-to-speech, you fire TTS on each sentence as it completes. The first audio ships while the LLM is still generating sentence two. This gets perceived latency under 3 seconds even when each individual step takes 1-2 seconds.
I spent some time on Day 3 cleaning up the web UI and wiring up Gemini as the LLM because Sonnet had higher latency for short conversational responses. The voice session worked in the browser. I could talk to my tutor.
Then I tried to take it outside.
Day 4: The iOS Audio Nightmare (February 8-9)
Phase 4 was the mobile app. Expo SDK 54, expo-audio-studio for recording, react-native-track-player for lock screen persistence. I had it running on my phone within hours.
Then the tutor's voice came out of my earpiece.
Not the speaker. The earpiece. The tiny speaker you hold against your ear during phone calls.
The audio would initially play from the speaker correctly. Then I'd speak. Then the response would route to the earpiece. Every time.
This took an entire day to find. Two audio libraries were fighting over iOS's single shared AVAudioSession. expo-audio-studio would call setCategory(.playAndRecord, options: [.defaultToSpeaker]) to route audio to the speaker. Then expo-av (for playback) would call setCategory(.playAndRecord, options: [.allowBluetooth]) without .defaultToSpeaker. Audio immediately routes to the earpiece. When recording restarts, expo-audio-studio has three hardcoded setCategory calls, none of which include .defaultToSpeaker. Earpiece persists.
I tried seven things:
- JS-level
allowsRecordingIOStoggle. Racey, fought withexpo-audio-studio. PlayAndRecord + DefaultToSpeakerin recorder config only. Worked initially, regressed after playback.- Continuous recording with no stop/start. Improved flow but earpiece persisted.
- Patching
expo-audio-studio's hardcodedsetCategorycalls. Necessary but not enough. allowsRecordingIOS: trueinAudio.setAudioModeAsync. Made it worse.forceSpeakerRoute()after each transition. Overwritten by the nextsetCategory.AppDelegateroute change observers. The library mutates the session after the observer fires.
Seven things and none of them worked. The AI assistant was confidently suggesting each one. Every fix was plausible. Every fix was wrong. Claude doesn't have intuition about how iOS audio routing works in practice. It knows the APIs, it can read the docs, but the interaction between two libraries fighting over a shared resource at the OS level is the kind of thing you have to discover empirically.
At some point, frustrated, I asked: "Is the UX we want even possible with Expo? How do chat apps like ChatGPT have their voice mode work so effortlessly?"
That question cracked it. AVAudioPlayer doesn't call setCategory(). It plays audio through whatever session configuration already exists. ChatGPT's voice mode uses a low-level player that doesn't touch the session, so the recording library's configuration is respected.
The fix: patch expo-audio-studio via pnpm to add a native playAudio(base64:) function using AVAudioPlayer, and rip out expo-av entirely. Fix the three hardcoded setCategory calls to preserve .defaultToSpeaker. Remove expo-av from the playback path completely.
Speaker audio persisted through record/play transitions. If you're building an iOS app that records and plays audio simultaneously: two libraries that both call setCategory() will always fight. Remove one from the equation.
Day 4 (continued): The Latency Wall
With the audio fixed, I went for my first real walk with the app.
The opening message took 13 seconds. Each turn after that, about 15 seconds. In a walking conversation, 15 seconds of silence feels like the app has crashed. I kept checking my phone. That defeats the purpose.
I added profiling timestamps at every stage of the pipeline. "The pipeline is slow" turned out to be three separate problems, each with a different fix:
- Model choice. Gemini 3 Flash Preview had a 2.5-4 second TTFT. It's a preview model. Switching to
gemini-2.5-flash-litedropped TTFT to ~500ms. One-line change, 5-6x improvement. - TTS serialization. I was
awaiting eachsynthesizeSpeech()call sequentially. Sentence 1 had to finish before sentence 2 even started. The fix: fire all TTS requests concurrently, drain results in order. First audio ships as soon as TTS(0) resolves, while TTS(1) and TTS(2) are already in flight. - VAD noise floor. The voice activity detection silence threshold was -48dB. Walk noise floor is -44 to -49dB. Background noise was continuously resetting the silence timer, so the system never detected end-of-speech. It just fell through to the 8-second maximum utterance timeout every time. Raising the speech threshold to -40dB and changing the silence condition from "is it very quiet" to "is nobody speaking" fixed it. Turns ended in ~1.1 seconds instead of 8.
After these three fixes: opening message ~3 seconds, per-turn ~2.5-4 seconds. If you're building a voice app people will use outdoors, calibrate your VAD against outdoor noise, not indoor silence. The gap between speech and silence is smaller than you'd expect.
Day 5: TTS Provider Exploration (February 10)
With latency at a reasonable level, I looked at whether I could do better on the TTS side. OpenAI's gpt-4o-mini-tts was the remaining bottleneck at 1-2.5 seconds per sentence.
I found a PR from another project that had done a thorough cost/quality analysis of streaming TTS providers:
| Provider | Per session cost | Streaming? |
|---|---|---|
| OpenAI TTS | $0.04 | No |
| Cartesia Sonic | $0.03 | Yes (WebSocket) |
| ElevenLabs | $0.61 | Yes |
ElevenLabs had the best voice quality but was 15x more expensive because of their subscription model. I built a TTS provider abstraction (a TTS_PROVIDER=openai|cartesia env var swaps the implementation) and tested Cartesia's buffered mode first. Results: ~500ms warm TTFT from Fly's iad region. Not the 40ms from their docs (that's streaming WebSocket, not REST), but a real improvement. The concurrent TTS drain architecture masked the remaining latency well: while sentence 0 plays back, sentences 1 and 2 are already resolved.
Day 6: The Tailscale Reckoning (February 11)
This was the worst day.
I tried to take the app for a walk over 5G. First voice turn worked. LLM 2.2s, TTS 1.2s, first audio at 3.5s. Fine. Then I spoke. STT completed in 3.7 seconds. Then nothing. No LLM response. No TTS. 24 seconds of silence. The logs eventually showed: magicsock: phone switched to different IPv6 (tower handoff). Tailscale's WireGuard tunnel broke during a cell tower handoff and took nearly a minute to recover via DERP relay. The session was dead after one exchange.
It got worse. Back home on WiFi, the web voice session that had been working fine was now unreliable. TTS calls taking 60+ seconds. Litestream replication spiking to 13-22 seconds. I thought Fly.io was having infrastructure problems.
I spent hours on this. I tried adding Tailscale Funnel. Different configurations. At one point, frustrated enough to type: "something odd is going on with the ios app. its just stuck and not building now. i think i'm over that. lets try to get the web version running reliably and i guess i'll just have to do without lock screen."
The mobile app dream was shelved. I pivoted to making the web app work in Safari on my phone. But even that was flaky with Tailscale in the network path.
The original architecture doc I'd written days earlier had entire sections praising Tailscale's "network-level authorization" model. It was elegant in theory.
The fix came in two PRs:
- PR #11: Expose the API publicly. Drop Tailscale as the sole access path. Add simple bearer token / cookie auth middleware to Hono. Standard HTTPS. Standard TCP/TLS connections recover from IP changes in ~200ms. WireGuard takes 3-5 seconds. I had to learn that Fly IPs must be allocated explicitly (
fly ips allocate-v4 --shared+fly ips allocate-v6). Without this,fly.devDNS just doesn't resolve. Not an obvious error. - PR #12: Remove Tailscale from the runtime entirely. This was the real fix. Tailscale wasn't just failing on mobile. Its daemon was actively interfering with all network operations on the VM.
magicsockrebinding, WireGuard tunnel contention, iptables rules, all competing for resources on a tiny 256MB VM running a latency-sensitive voice pipeline.
Days 7-8: The Walk That Worked (February 12-13)
After removing Tailscale and bumping the VM to 512MB, I added hard timeouts to all external calls (STT 15s, TTS 10s, LLM 30s) and a health check endpoint. Then I went for a walk.
I came home and asked Claude to analyze the Fly logs. Average turn latency: 3.9 seconds. Range of 2-5 seconds, zero outliers. The day before, with Tailscale still running, the average had been 44.7 seconds per turn. An 11x difference, from removing a VPN daemon.
That evening I added a thinking tone: a subtle audio chirp (440Hz, 150ms, with a gain envelope) that plays when the system is processing speech. During a walk, dead silence while the LLM thinks feels like a dropped connection. The chirp, repeating every 2 seconds, says "I heard you, I'm working on it." Small detail. Big difference in whether you trust the app is still running.
I also fixed a bug where iOS Chrome kept re-prompting for microphone permission on every unmute. The fix: keep the VAD mounted for the entire session, mute/unmute by toggling the audio track's enabled property instead of destroying and recreating the MediaStream.
Days 9-10: Stabilization and CI (February 14)
The app worked. I'd done several walk sessions. Latency was consistent. The thinking tone provided good feedback. The web app in Safari on my phone was a perfectly adequate interface.
I set up a GitHub Actions CI/CD pipeline so merging to main automatically deploys to Fly.io after typecheck and lint pass. I used Codex CLI for some of this. It has a decent autonomous mode for straightforward infrastructure tasks.
What Building With AI Assistants Actually Felt Like
The speed was real. Going from a plan to a deployed, working application in 10 days, including a voice pipeline, mobile app attempt, multiple TTS providers, auth system, and WebSocket reconnection, would have taken me months alone. The assistants handled boilerplate. I focused on architecture and debugging.
Starting every feature with a plan mattered. Every major feature began with a /create_plan command. The plan gets reviewed, iterated on, then implemented. This caught problems early and kept the architecture coherent across dozens of sessions.
Claude Code's MEMORY.md system accumulated lessons across sessions. When I hit a deployment issue on Day 8, the memory already contained ".dockerignore is critical, without it macOS node_modules get copied into the linux container." I didn't have to rediscover that.
But the iOS audio session bug took a full day because every AI-suggested fix was plausible and wrong. The assistant doesn't have intuition about how iOS audio routing works in practice. It took seven failed attempts and a sideways question about ChatGPT to find the right approach.
And Tailscale worked perfectly in development, perfectly on the VM when I was on WiFi, and fell apart on a walk over 5G. The symptoms were subtle enough that I blamed Fly.io's infrastructure for two days before finding the real cause.
The biggest wins were often the smallest changes. Switching the LLM model: one line, 5-6x latency reduction. Raising the VAD threshold by 8dB: speech detection from 8 seconds to 1.1. Removing a daemon from the Dockerfile: 11x turn latency improvement. None of these required clever engineering. They required correct diagnosis.
Profiling is not optional for voice apps. "The voice pipeline is slow" was three independent problems, and the solution for each was completely different. Adding [timing] log markers at every pipeline stage (first token, first sentence boundary, TTS start, audio playback start) was the most valuable debugging investment of the whole project.
And: iOS has one audio session, and everyone fights over it. Any iOS app that records and plays simultaneously needs to be careful about which library is calling setCategory(). If two libraries both want to configure the audio session, one of them needs to be replaced with something lower-level that doesn't.
Where It Stands
Ten days in, the Socratic Walk Tutor works. I upload a chapter, the AI generates a study guide in about a minute, and I can have a voice conversation about it during my morning walk. Turn latency seems to hover around 2-3 seconds over 5G, quick enough to feel snappy. The tutor picks up where the last session left off. It costs about a dollar or less per session.
The mobile app is on hold. The web app in Safari is good enough, and fighting with Expo native modules was consuming time better spent on the tutoring experience. Phases 5 (living understanding system with cross-chapter linking) and 6 (Obsidian export, pacing controls) are planned but not built.
Built with Claude Code (Anthropic) and Codex CLI (OpenAI). 16 PRs. ~117 Claude sessions. One very educational walk.
Glossary
| Term | What it is |
|---|---|
| AVAudioSession | Apple's iOS API for managing how your app interacts with the device's audio hardware. Every app shares one session, so if two libraries both try to configure it, they collide. |
| AVAudioPlayer | A lower-level Apple API for playing audio files. Unlike higher-level libraries, it doesn't reconfigure the AVAudioSession when it plays, which makes it safe to use alongside a recording library. |
| CI/CD | Continuous Integration / Continuous Deployment. An automated pipeline that runs tests, linting, and type checks on every code change, then deploys to production if everything passes. |
| DERP | Designated Encrypted Relay for Packets. Tailscale's fallback relay servers. When a direct WireGuard connection can't be established, traffic routes through DERP relays, adding significant latency. |
| Drizzle ORM | A lightweight TypeScript ORM for SQL databases. Lets you define your database schema in TypeScript and write type-safe queries. |
| Expo | A framework and platform for building React Native mobile apps. Provides managed tooling and pre-built native modules so you can write JavaScript and deploy to iOS and Android. |
| Fly.io | A platform for deploying applications on lightweight VMs close to your users. You give it a Dockerfile, it runs your app on hardware distributed across global regions. |
| Gemini | Google's family of large language models. In this project, gemini-2.5-flash-lite was used for fast conversational responses because of its low time-to-first-token. |
| Hono | A small, fast web framework for TypeScript. Similar to Express but designed for edge and serverless environments. Used here as a single server handling API routes, static files, and WebSocket connections. |
| iptables | The Linux kernel's firewall and packet routing system. Tailscale's daemon configures iptables rules to route traffic through its VPN tunnel, which can compete with other network operations on the same machine. |
| Litestream | A tool that continuously replicates a SQLite database to cloud storage (like S3 or Cloudflare R2). Gives you backups and disaster recovery without running a separate database server. |
| LLM | Large Language Model. The AI models (like Claude, GPT, Gemini) that generate text responses. In the voice pipeline, the LLM receives the transcribed speech, generates a tutoring response, and streams it back sentence by sentence. |
| MediaStream | A browser API representing a stream of audio or video data, typically from a microphone or camera. Destroying and recreating it triggers new permission prompts in some browsers. |
| SPA | Single Page Application. A web app that loads once and updates dynamically without full page reloads. The study guide dashboard was built as a React SPA. |
| SQLite | A self-contained database engine that stores everything in a single file. No separate server process needed. Good for applications where simplicity matters and you don't need distributed writes. |
| SSE | Server-Sent Events. A browser API for receiving a stream of updates from a server over HTTP. Used here to stream study guide generation progress to the web dashboard. |
| STT | Speech-to-Text. The process of converting spoken audio into written text. This project uses OpenAI's Whisper model for STT. |
| Tailscale | A mesh VPN built on WireGuard. Creates a private network between your devices. Used here (and later removed) as an authentication layer: if you're on the Tailscale network, you're authorized. |
| TTFT | Time to First Token. How long it takes an LLM to start generating its response after receiving a prompt. A key latency metric for real-time voice applications. |
| TTS | Text-to-Speech. Converting written text into spoken audio. The bottleneck in most voice AI pipelines because each sentence must be synthesized before it can be played. |
| Turborepo | A build system for JavaScript/TypeScript monorepos. Manages dependencies and build order across multiple packages in a single repository. |
| VAD | Voice Activity Detection. The system that determines when a person has started and stopped speaking. Threshold calibration is critical: too sensitive and background noise triggers it, too aggressive and it cuts off speech. |
| Vercel AI SDK | A TypeScript library for building AI-powered applications. Provides utilities like generateObject (structured LLM output with schema validation) and streaming helpers. |
| VPN | Virtual Private Network. Creates an encrypted tunnel between devices so they can communicate as if on the same local network, regardless of physical location. |
| WAL mode | Write-Ahead Logging. A SQLite configuration that allows concurrent reads while a write is in progress, instead of locking the entire database. Essential for any SQLite app serving web requests. |
| WebSocket | A protocol for persistent, two-way communication between a browser and server. Unlike HTTP (request/response), a WebSocket stays open, allowing real-time audio streaming in both directions. |
| Whisper | OpenAI's speech-to-text model. Transcribes spoken audio into text. Used as the STT step in the voice pipeline. |
| WireGuard | A modern VPN protocol. Fast and simple, but its tunnel can break during network transitions (like cell tower handoffs) and take several seconds to re-establish. |
| Zod | A TypeScript schema validation library. Used with the Vercel AI SDK to define the exact shape of structured output you want from an LLM, so the response is guaranteed to match your types. |
See the App Design
Click through the interactive prototype below to see what the Socratic Walk Tutor looks like. Navigate between pages using the sidebar.