Mounir RAJI

My Voice Pipeline Was Working. Except It Wasn't.

I published an article describing FR/EN language switching as working. It wasn't. Here are the 4 bugs I missed, and how to fix them.

· 7 min de lecture 🇫🇷 Lire en français

Personal AI Agent Series — Article 6 Neog × OpenClaw × speaches × Kokoro


Five days ago, I published an article that ended like this:

“I speak English. Same pipeline. LANG=en. Kokoro uses af_heart. Response in English.”

That was wrong.

Not wrong in the sense that I lied — wrong in the sense that the initial test appeared to work, and I didn’t push far enough to see the edge case. I spoke English once, got an English response, and concluded that language switching worked. It didn’t.

What I had was language detection on the first message. Not mid-session switching. Not a return to French afterward. Not consistency between voice and text.

I found out this week. Here’s what was broken and how I fixed it.


The Symptom

After using the agent for a few days in real conditions, I noticed something strange. I speak French. I get a French response — with an English voice. I speak English. I get a French response — with an English voice.

The voice had switched to English and never came back. The text response stayed French regardless.

Digging into it, I found not one bug but four. Nested. Each one hiding the next.


Bug 1 — The Whisper Script Only Updated voice.conf Once

The whisper script — the CLI wrapper handling STT transcription — contained this condition:

if [ ! -s "$CONF" ]; then
  echo "LANG=$LANG_CODE" > "$CONF"
fi

-s tests whether the file is non-empty. Translation: voice.conf is only written if the file is empty. On the first voice message, detection runs. On the second, third, twentieth — the file already exists, so nothing changes.

Consequence: the language detected on the first message of a session stays locked until /new. If I start in English, the entire session runs in English even if I switch to French.

This is the kind of condition that seems logical when you write it — “only overwrite if empty” — and creates exactly the opposite of what you want.


Bug 2 — OpenClaw Loads openclaw.json Once

The whisper script was doing something else inside that conditional block. In addition to writing voice.conf, it patched openclaw.json directly:

python3 -c "
import json
p = '$OC_JSON'
c = json.load(open(p))
c['messages']['tts']['openai']['voice'] = '$VOICE'
json.dump(c, open(p, 'w'), indent=2)
"

The intention: tell OpenClaw to use af_heart for English by modifying its TTS config.

The problem: OpenClaw reads openclaw.json at container startup, loads everything into memory, and never reads the file again. Modifying openclaw.json while the session is running has strictly no effect on the running process. The voice configured at startup is the voice used until the next restart.

This code ran on every first voice message. It was modifying a file nobody was reading.


Bug 3 — The Proxy Ignored voice.conf

Between OpenClaw and speaches sits speaches-proxy.py — the 50-line Python micro-proxy from the previous article. Its original job: intercept TTS requests and replace response_format: opus with mp3.

The proxy never looked at voice.conf. It forwarded requests as-is, including the voice field sent by OpenClaw — which used its in-memory config loaded at startup: ff_siwis (French) forever.

Result: regardless of what voice.conf contained, the voice sent to Kokoro was always the one from the static config.

The proxy was the only place in the pipeline where runtime intervention was possible. It wasn’t doing it.


Bug 4 — SOUL.md Overrode Language Detection

Neog has a SOUL.md file that defines its identity and core rules. It contained:

## Language — Non-negotiable
Always respond in French to Moun, unless explicitly asked otherwise.

The general.md prompt correctly said “read voice.conf and respond in the detected language.” But SOUL.md is loaded first and treated as foundational. With a 9B model juggling multiple instructions, the explicit “always in French” rule beats the more contextual instruction in general.md.

Result: even if bugs 1-3 had been fixed, text responses would have stayed in French.


The 3 Fixes

Fix 1 — The whisper Script: Always Update voice.conf

Remove the condition. That’s it.

# Before
if [ ! -s "$CONF" ]; then
  echo "LANG=$LANG_CODE" > "$CONF"
  # ... python3 patching openclaw.json (dead code)
fi

# After
echo "LANG=$LANG_CODE" > "$CONF"

voice.conf is now updated on every voice message. Mid-session FR→EN→FR switching works. The python3 block that patched openclaw.json at runtime is removed — it was dead code from the start.

Fix 2 — The Proxy: Read voice.conf on Every TTS Request

The proxy already mounts the config/bin/ volume — it has access to voice.conf via /scripts/voice.conf. The fix is to read it before each TTS request and override the voice field:

VOICE_CONF = "/scripts/voice.conf"

VOICE_MAP = {
    "fr": "ff_siwis",
    "en": "af_heart",
    "ar": "ff_siwis",  # Kokoro has no native Arabic voice
}

def get_voice_from_conf():
    try:
        with open(VOICE_CONF, "r") as f:
            content = f.read().strip()
        for line in content.splitlines():
            if line.startswith("LANG="):
                lang = line.split("=", 1)[1].strip().lower()
                return VOICE_MAP.get(lang, "ff_siwis")
    except (FileNotFoundError, OSError):
        pass
    return None

# In do_POST, on /v1/audio/speech:
voice = get_voice_from_conf()
if voice:
    data["voice"] = voice

The proxy becomes the runtime intervention point. OpenClaw sends ff_siwis in its request — the proxy reads voice.conf, sees LANG=en, overwrites with af_heart before forwarding to speaches. OpenClaw doesn’t need to reload its config.

Fix 3 — SOUL.md and general.md: Explicit Exception for Voice Messages

In SOUL.md:

## Language — Non-negotiable

Always respond in French to Moun, except:
- Moun writes in another language → respond in that language
- **Voice message + `voice.conf` indicates another language → respond in that language**

In general.md, replace “read voice.conf at the start of the session” with:

**Voice messages — strict rule, before each response:**
Read `/home/node/.openclaw/bin/voice.conf` using the `fs` tool before responding.
- `LANG=en` → respond in English (mandatory — overrides the French default from SOUL.md)

The exception is now explicit and cross-referenced in both files. The model has a clear instruction on which rule takes priority.


What This Reveals About OpenClaw’s Architecture

Bug 2 forced me to understand something the documentation doesn’t state explicitly: OpenClaw is a Node.js process that loads its config at startup and never re-reads it.

The TTS parameters in openclaw.json — provider, model, voice — are fixed at initialization. You can modify the file while the container is running; it has no effect until the next restart.

This changes how you think about dynamic adaptations. If you want to modify behavior at runtime, you need to intervene outside of OpenClaw — in a proxy, a wrapper, a file the agent itself can read via its tools (fs). The proxy is the natural place for request-level modifications. The workspace files (voice.conf, SOUL.md) are the place for agent behavior modifications.

This distinction — static config vs dynamic behavior — is worth establishing from the start when designing an OpenClaw setup.


The Result

After the three fixes:

I speak French → voice.conf = LANG=fr → proxy sends ff_siwis → agent responds in French I speak English → voice.conf = LANG=en → proxy sends af_heart → agent responds in English I switch back to French → voice.conf updated immediately → clean switch

Mid-session language switching works. Actually this time.


What I Learned

Testing the happy path isn’t enough. I tested “speak English → English response.” I didn’t test “speak English, switch to French, switch back to English.” The first test passes. The subsequent ones don’t.

“Seems to work” is dangerous information. The initial pipeline gave the impression of working because the first message was correctly detected. The if [ ! -s ] condition was invisible in a quick test.

A proxy is more powerful than it looks. Inserting a translation layer between two systems lets you modify behavior without touching either component. OpenClaw doesn’t know the voice changed. speaches doesn’t know where the instruction came from. The proxy is the only point that sees both.

Foundational rules need explicit exceptions. A general instruction (“read voice.conf and respond accordingly”) loses against a foundational rule (“always French”) when both apply. The exception has to be written where the rule is — not just where it’s used.


For the original article: Local TTS — Kokoro, speaches GPU, and What Broke

Next: Arabic voice support. Kokoro has no Arabic voice — an alternative will need to be integrated. Exploration in progress.


Personal AI Agent Series — Article 1 · Article 2 · Article 3 · Article 4 · Article 5

Partager cet article

Articles similaires