Mounir RAJI

I Added Voice to Neog. Here's What Actually Broke.

Six hours for a 100% local voice pipeline: Kokoro GPU, faster-whisper via Tailscale, a Python proxy, and a backtick that broke my config at 3 AM.

· 7 min de lecture

Personal AI Agent Series — Article 5 Neog × OpenClaw × speaches × Kokoro


I spent a night making my AI agent talk. Not metaphorically — literally. Real Telegram voice messages arriving, transcribed locally, processed by a local LLM, turned into audio by a local voice model. Zero cloud. Zero API keys rotating on someone else’s server.

Except it didn’t go as planned.

I thought it would take two hours. It took six. And most of the problems weren’t where I expected them.


What I Was Trying to Build

Since the beginning of this series, I’ve been building Neog — my personal, self-hosted AI agent running on an Ubuntu VM that answers me on Telegram. Previous articles covered the cloud-to-local migration, the three-layer memory architecture, and the reasoning model configuration.

Today, I’m adding voice. The goal is simple on paper:

🎤 I speak → local transcription → local LLM → 🔊 local audio response

The full pipeline, without any data leaving my Tailscale network.


The First Problem Nobody Mentions

OpenClaw supports TTS through multiple providers: ElevenLabs, OpenAI, and Edge TTS. For STT, it supports cloud providers or local CLI wrappers.

I have a speaches instance running on my Windows PC with an RTX 3080. speaches exposes an OpenAI-compatible API — perfect, OpenClaw knows how to talk to that API. The plan: configure speaches as the TTS provider in OpenClaw.

Five minutes of config. Two minutes of testing. Then:

HTTP: 422 Unprocessable Entity

Again. Again. Again.

A 422 on every TTS call, with no explanation in the logs. Here’s the lesson nobody writes in tutorials: OpenClaw sends response_format: opus to its TTS provider for Telegram messages. Opus is required for the round voice note bubbles. And speaches doesn’t support Opus.

This isn’t a bug. It’s not a misconfiguration. It’s a format incompatibility documented nowhere across the two projects combined.


The Fix That Shouldn’t Exist but Works

A Python micro-proxy. Fifty lines. It sits between OpenClaw and speaches, intercepts each TTS request, and replaces response_format: opus with response_format: mp3 before forwarding.

if self.path == "/v1/audio/speech":
    data = json.loads(body)
    if data.get("response_format", "") in ("opus", "aac", ""):
        data["response_format"] = "mp3"
        body = json.dumps(data).encode()

This is the classic “shim” approach — a translation layer between two systems that don’t speak directly to each other. It’s been around since the 70s in various forms. We’re still using it in 2026 to make Kokoro talk to Telegram.

The result: Kokoro generates MP3. Telegram receives an audio file instead of a round voice bubble. Functionally identical — the audio plays the same. Visually different — a download icon instead of an inline play button.

I accepted that trade-off. The round bubble requires ElevenLabs or OpenAI TTS — two paid cloud providers. Zero cloud. Zero compromise.


The GPU That Isn’t Where You Think

My Ubuntu VM runs on a Windows machine with an RTX 3080. Problem: the VM doesn’t have GPU access. VirtualBox doesn’t support native GPU passthrough for desktop NVIDIA GPUs in this configuration.

The intuitive solution would be to install speaches with CUDA support inside the VM. Wrong approach.

The correct solution: speaches runs directly on Windows with Docker Desktop (latest-cuda), exposed on the Tailscale network. OpenClaw inside the VM calls it via the PC’s Tailscale IP — exactly how it calls LM Studio.

VM (OpenClaw)
  → LM Studio on Windows PC:11434  ✅ already working
  → speaches GPU on Windows PC:8010  ← same principle

The conceptual model is remote inference, not GPU passthrough. The GPU stays on the host machine, the API travels over the local network. With Tailscale, latency is negligible.

Result: STT transcription drops from 30 seconds (CPU inside VM) to 3-5 seconds (RTX 3080 over local network).


The Language Problem I Created Myself

Once the voice pipeline was working, I wanted to add automatic language detection. My agent should respond in French when I speak French, in English when I speak English.

faster-whisper detects language with a probability in its verbose_json output. The plan: the STT wrapper detects the language, writes the language code to a voice.conf file, the agent reads that file, responds in the correct language, and Kokoro uses the matching voice.

Simple. Except I created the voice.conf file from the Ubuntu host:

true > /opt/neog/config/bin/voice.conf

And from inside the Docker container, the file didn’t exist. File permissions, ownership, mounted volume synchronization — the file appears to the host but not to the container.

The rule I learned and should have already known: if a file needs to be written by a Docker process, it must be created from inside that container.

docker exec neog-gateway sh -c 'true > /home/node/.openclaw/bin/voice.conf'

Not from the host. Never.


What Happened Between 3 AM and 4 AM

The backtick.

OpenClaw’s Docker version has a poorly documented bug: it prepends a backtick (`) to openclaw.json on every startup. This single character invalidates the JSON. Every time I restarted the gateway to apply a config change, the file silently corrupted itself.

Detection took a while because Python scripts were failing without error messages — json.load() raises an exception that I was catching with except: pass.

Definitive fix: an entrypoint wrapper in Docker Compose that strips the backtick before launching the gateway.

entrypoint:
  - /bin/sh
  - -c
  - |
    sed -i '1s/^\`//' /home/node/.openclaw/openclaw.json 2>/dev/null || true
    exec docker-entrypoint.sh openclaw gateway

One line of sed. Applied on every startup. The kind of fix that should have taken two minutes if I’d known where to look.


The Result

At 4 AM, the pipeline was complete.

I speak French in Telegram. An OGG voice note arrives in OpenClaw. ffmpeg converts it to WAV. faster-whisper transcribes it in 4 seconds with a language detection probability of 0.99. The agent reads voice.conf, sees LANG=fr, responds in French. Kokoro generates audio with the ff_siwis voice. speaches-proxy patches the format. The MP3 file arrives on Telegram.

I speak English. Same pipeline. faster-whisper detects english. voice.confLANG=en. Kokoro uses af_heart. Response in English with an American English voice.

No data left my Tailscale network.


What It Actually Cost in Time

StepEstimatedActual
Initial LLM config30 min1h
GitOps setup20 min30 min
STT working30 min1h
Kokoro TTS (422 issue)30 min2h
Remote GPU STT20 min45 min
Language detection30 min1h
Backtick fix5 min30 min
Total~3h~6h

The gap is real. Most of it comes from problems nobody documents because nobody else has exactly this setup.


The Final Architecture

🎤 Telegram voice note
  → neog-gateway (OpenClaw, Ubuntu VM 22.04)
  → speaches-gpu (RTX 3080, Windows PC, Tailscale)
      faster-whisper-large-v3-turbo · float16 · 3-5s
  → voice.conf: LANG=fr|en|ar
  → LM Studio (RTX 3080, Windows PC, Tailscale)
      Qwen3.5-9B distilled · ~80 tok/s
  → speaches-proxy (Python, VM)
      opus→mp3 patch
  → Kokoro GPU (RTX 3080, Windows PC)
      ff_siwis FR · af_heart EN · ~1-5s
🔊 MP3 audio → Telegram

Everything runs on hardware I own. No external requests. No expiring API keys.


What I Learned

On format incompatibilities: Open-source projects rarely assemble perfectly. The translation layer (shim, proxy, wrapper) is a normal part of architecture — not technical debt, an integration decision.

On remote GPU: “GPU in the cloud” and “GPU on a PC on the same network” are conceptually identical from a code perspective. One costs €50/month, the other is already paid for.

On Docker and shared files: Mounted volumes aren’t transparent. Permissions matter. Always create runtime files from inside the container.

On late-night debugging: Silent errors are expensive. except: pass should be banned from diagnostic scripts.


What’s Next

Arabic voice. Kokoro has no Arabic voice — I’ll need to integrate Piper with ar_JO-kareem-medium. That’s a future article.

And the three-layer memory architecture article is still pending. Next in the series.


Personal AI Agent Series — Article 1 (initial setup) · Article 2 (morning brief) · Article 3 (cloud→local) · Article 4 (three-layer memory)

Partager cet article

Articles similaires