Skip to content

Voice mode: all bundled ASR models fail silently — MultiModalProcessor routing bug for nemotron_speech (RNNT) in Foundry Local Core #4024

Description

@sylvanc

Describe the bug

/voice records audio successfully (level meter reacts, mic capture confirmed via raw PulseAudio capture) but every transcription comes back empty, for all three models offered in the /voice model picker:

  • nemotron-3.5-asr-streaming-0.6b
  • nemotron-speech-streaming-en-0.6b
  • nemotron-speech-streaming-es-0.6b

All three share the same nemotron_speech (RNNT) architecture, so switching models does not help — there is no working model in the current picker.

Root cause (traced to source)

The Foundry Local Core native audio transcription path throws on every call:

Microsoft.ML.OnnxRuntimeGenAI.OnnxRuntimeGenAIException: MultiModalProcessor cannot be created. nemotron_speech is not a registered multi-modal model type.
   at Microsoft.ML.OnnxRuntimeGenAI.Result.VerifySuccess(IntPtr)
   at Microsoft.ML.OnnxRuntimeGenAI.MultiModalProcessor..ctor(Model)
   at Microsoft.AI.Foundry.Local.AudioClient.Transcribe(String, String, Nullable`1)

This is not a stale/outdated onnxruntime-genai engine issue. I downloaded and compared the public onnxruntime-genai v0.14.0 release (published 2026-05-29, after the "Multilingual Streaming Nemotron ASR" PR merged 2026-05-22) against the bundled runtime (labeled 0.14.1 in deps_versions.json, but not a publicly-tagged release). Both contain full, identical NemotronSpeechModel/NemotronSpeechState support (confirmed via symbol inspection of libonnxruntime-genai.so).

Cross-checking against microsoft/onnxruntime-genai's src/models/model.cpp (https://github.com/microsoft/onnxruntime-genai/blob/main/src/models/model.cpp) confirms this is architecturally expected:

  • nemotron_speech is an RNNT-type model (ModelType::IsRNNT) and is correctly routed to NemotronSpeechModel when the model is loaded (CreateModel() in model.cpp) — this is why Model loaded successfully: nemotron-3.5-asr-streaming-0.6b-generic-cpu:3 appears in the Foundry log and the model shows as loaded.
  • However, MultiModalProcessor's constructor uses a separate, hardcoded processor_factory_ registry that only contains vision+text chat model types (e.g. phi4mm, gemma4, etc.) — RNNT/TDT/ALM (Whisper) audio model types were never meant to go through MultiModalProcessor at all.

So the bug is that Microsoft.AI.Foundry.Local.Core's AudioClient.Transcribe/streaming session path unconditionally constructs a MultiModalProcessor for audio transcription, instead of dispatching RNNT/TDT/ALM model types (nemotron_speech, Parakeet TDT, Whisper) through their own dedicated processing path (the same one used successfully at model-load time). Since Microsoft.AI.Foundry.Local.Core.so is a closed-source component, this can't be worked around or patched client-side — no onnxruntime-genai version swap fixes it, since the engine itself already works correctly and simply refuses (correctly, by its own design) to construct a MultiModalProcessor for a non-multimodal-chat model type.

Affected version

GitHub Copilot CLI 1.0.69-0 (also reproduced on 1.0.66-1 — same bundled runtime pins: onnxruntime-genai 0.14.1, foundry-local-core 1.2.3, so the bug is not CLI-version-specific).

Steps to reproduce the behavior

  1. On Linux x64 (reproduced under WSL2/WSLg, but likely affects any Linux x64 install), run /voice, enable it, select any of the three offered models.
  2. Press the voice-record shortcut, speak clearly for several seconds (confirmed via raw parec capture that real speech is captured at healthy signal levels, and the in-app level meter visibly reacts).
  3. Stop recording.

Expected behavior

Spoken audio is transcribed to text and inserted into the input box.

Actual behavior

No error is shown to the user; the recording UI behaves normally, but the transcript is always empty. ~/.github-copilot-cli/logs/foundry.core*.log shows:

[ERR] Error executing audio_transcribe: Microsoft.ML.OnnxRuntimeGenAI.OnnxRuntimeGenAIException: MultiModalProcessor cannot be created. nemotron_speech is not a registered multi-modal model type.
...
Audio stream stop: session <id>, final flush text: '', full transcript: ''

Additional context

  • Confirmed mic capture and PulseAudio pipeline are fully healthy (not a WSLg/audio-hardware issue) via direct raw parec -d RDPSource capture showing strong, correctly-timed speech signal.
  • Confirmed via a standalone script using the bundled foundry-local-sdk package directly (bypassing the Copilot CLI JS entirely) that audioClient.transcribe() throws the identical MultiModalProcessor/nemotron_speech error against a real captured WAV file, ruling out any Copilot-CLI-specific bug in the JS layer.
  • Suggested fix: in Foundry Local Core's audio transcription path, dispatch by ModelType::IsRNNT/IsTDT/IsALM (mirroring CreateModel()'s dispatch in onnxruntime-genai's model.cpp) instead of unconditionally constructing a MultiModalProcessor.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:modelsModel selection, availability, switching, rate limits, and model-specific behavior

    Type

    Fields

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions