Voice mode: all bundled ASR models fail silently — MultiModalProcessor routing bug for nemotron_speech (RNNT) in Foundry Local Core

### Describe the bug

`/voice` records audio successfully (level meter reacts, mic capture confirmed via raw PulseAudio capture) but **every transcription comes back empty**, for **all three** models offered in the `/voice` model picker:
- `nemotron-3.5-asr-streaming-0.6b`
- `nemotron-speech-streaming-en-0.6b`
- `nemotron-speech-streaming-es-0.6b`

All three share the same `nemotron_speech` (RNNT) architecture, so switching models does not help — there is no working model in the current picker.

### Root cause (traced to source)

The Foundry Local Core native audio transcription path throws on every call:

```
Microsoft.ML.OnnxRuntimeGenAI.OnnxRuntimeGenAIException: MultiModalProcessor cannot be created. nemotron_speech is not a registered multi-modal model type.
   at Microsoft.ML.OnnxRuntimeGenAI.Result.VerifySuccess(IntPtr)
   at Microsoft.ML.OnnxRuntimeGenAI.MultiModalProcessor..ctor(Model)
   at Microsoft.AI.Foundry.Local.AudioClient.Transcribe(String, String, Nullable`1)
```

This is **not** a stale/outdated `onnxruntime-genai` engine issue. I downloaded and compared the public `onnxruntime-genai` v0.14.0 release (published 2026-05-29, after the "Multilingual Streaming Nemotron ASR" PR merged 2026-05-22) against the bundled runtime (labeled `0.14.1` in `deps_versions.json`, but not a publicly-tagged release). Both contain full, identical `NemotronSpeechModel`/`NemotronSpeechState` support (confirmed via symbol inspection of `libonnxruntime-genai.so`).

Cross-checking against `microsoft/onnxruntime-genai`'s `src/models/model.cpp` (https://github.com/microsoft/onnxruntime-genai/blob/main/src/models/model.cpp) confirms this is architecturally expected:

- `nemotron_speech` is an **RNNT-type** model (`ModelType::IsRNNT`) and is correctly routed to `NemotronSpeechModel` when the model is **loaded** (`CreateModel()` in `model.cpp`) — this is why `Model loaded successfully: nemotron-3.5-asr-streaming-0.6b-generic-cpu:3` appears in the Foundry log and the model shows as loaded.
- However, `MultiModalProcessor`'s constructor uses a **separate, hardcoded `processor_factory_` registry** that only contains vision+text chat model types (e.g. `phi4mm`, `gemma4`, etc.) — RNNT/TDT/ALM (Whisper) audio model types were never meant to go through `MultiModalProcessor` at all.

So the bug is that **`Microsoft.AI.Foundry.Local.Core`'s `AudioClient.Transcribe`/streaming session path unconditionally constructs a `MultiModalProcessor` for audio transcription**, instead of dispatching RNNT/TDT/ALM model types (`nemotron_speech`, Parakeet TDT, Whisper) through their own dedicated processing path (the same one used successfully at model-load time). Since `Microsoft.AI.Foundry.Local.Core.so` is a closed-source component, this can't be worked around or patched client-side — no `onnxruntime-genai` version swap fixes it, since the engine itself already works correctly and simply refuses (correctly, by its own design) to construct a `MultiModalProcessor` for a non-multimodal-chat model type.

### Affected version

GitHub Copilot CLI 1.0.69-0 (also reproduced on 1.0.66-1 — same bundled runtime pins: `onnxruntime-genai` 0.14.1, `foundry-local-core` 1.2.3, so the bug is not CLI-version-specific).

### Steps to reproduce the behavior

1. On Linux x64 (reproduced under WSL2/WSLg, but likely affects any Linux x64 install), run `/voice`, enable it, select any of the three offered models.
2. Press the voice-record shortcut, speak clearly for several seconds (confirmed via raw `parec` capture that real speech is captured at healthy signal levels, and the in-app level meter visibly reacts).
3. Stop recording.

### Expected behavior

Spoken audio is transcribed to text and inserted into the input box.

### Actual behavior

No error is shown to the user; the recording UI behaves normally, but the transcript is always empty. `~/.github-copilot-cli/logs/foundry.core*.log` shows:
```
[ERR] Error executing audio_transcribe: Microsoft.ML.OnnxRuntimeGenAI.OnnxRuntimeGenAIException: MultiModalProcessor cannot be created. nemotron_speech is not a registered multi-modal model type.
...
Audio stream stop: session <id>, final flush text: '', full transcript: ''
```

### Additional context

- Confirmed mic capture and PulseAudio pipeline are fully healthy (not a WSLg/audio-hardware issue) via direct raw `parec -d RDPSource` capture showing strong, correctly-timed speech signal.
- Confirmed via a standalone script using the bundled `foundry-local-sdk` package directly (bypassing the Copilot CLI JS entirely) that `audioClient.transcribe()` throws the identical `MultiModalProcessor`/`nemotron_speech` error against a real captured WAV file, ruling out any Copilot-CLI-specific bug in the JS layer.
- Suggested fix: in Foundry Local Core's audio transcription path, dispatch by `ModelType::IsRNNT/IsTDT/IsALM` (mirroring `CreateModel()`'s dispatch in `onnxruntime-genai`'s `model.cpp`) instead of unconditionally constructing a `MultiModalProcessor`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Voice mode: all bundled ASR models fail silently — MultiModalProcessor routing bug for nemotron_speech (RNNT) in Foundry Local Core #4024

Describe the bug

Root cause (traced to source)

Affected version

Steps to reproduce the behavior

Expected behavior

Actual behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Voice mode: all bundled ASR models fail silently — MultiModalProcessor routing bug for nemotron_speech (RNNT) in Foundry Local Core #4024

Description

Describe the bug

Root cause (traced to source)

Affected version

Steps to reproduce the behavior

Expected behavior

Actual behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions