voice-dictation：不是本地语音输入法，而是一条按住说话的录音、转写和光标回填链

对应官方文档：claude-code-docs/docs/voice-dictation.md 里的 Voice dictation。

先搞清楚这是什么

voice-dictation 的本质，不是 Claude Code 内置了一套本地离线语音识别，也不是像聊天应用那样点击一下就开启持续监听模式。

它真正实现的是一套高度集成的 hold-to-talk（按住说话） 运行时流程。这条流程跨越了本地音频硬件、操作系统事件监听、云端流式转录协议以及终端 UI 的实时回填。

你可以把它理解为一个由本地录音驱动、远端实时转写、最后再将结果精确回填到当前 Prompt 光标位置的“虚拟键盘”。它的核心不在于“语音设置”，而在于如何把“按键重复 -> 录音采集 -> WebSocket 传输 -> 临时预览回填 -> 最终文本落盘”这几个离散的环节，接成一条低延迟、高可靠的连续管线。

实现机制

表面上，用户通过 /voice 命令开启功能，但 claude-code-opensource/src/commands/voice/voice.ts 在背后做了大量 Pre-flight 检查。它会确认你是否使用了 Claude.ai OAuth（目前唯一支持的鉴权方式）、本地录音依赖（如 SoX 或 arecord）是否可用，以及麦克风权限是否放行。

真正驱动整个交互体验的黑科技隐藏在以下几个模块中：

1. 按键驱动与“热启动”判定

在 claude-code-opensource/src/hooks/useVoiceIntegration.tsx 中，系统并不是简单地监听按键按下。对于 Space（空格）这类常用字符键，为了防止误触，它设计了一个 Warmup（预热）阈值。只有当按键产生的重复事件频率超过阈值时，才会真正激活录音状态机，并自动剥掉之前因“漏入”输入框的冗余空格。而对于 Cmd+I 或 Ctrl+I 这类组合键，则会立即触发。

2. 光标锚定与上下文保留

不同于简单的文本追加，useVoiceIntegration.tsx 在录音启动瞬间会记录当前光标前后的文本片段（Prefix/Suffix）。这使得语音转写结果可以像“插入模式”一样，精准地填充在一段文字的中间，而不是只能傻傻地跟在末尾。

3. 先录后连：极致的低延迟响应

在 claude-code-opensource/src/hooks/useVoice.ts 中，Claude Code 采用了一种“抢跑”策略。一旦用户触发录音，本地录音模块会立即启动采集，同时异步开启远端 WebSocket 连接。在网络握手完成前的音频数据会被缓存在内存中，连接成功后立即补发。这种设计确保了用户开口说的头几个字不会因为网络延迟而丢失。

4. 本地采集的优先级退避

claude-code-opensource/src/services/voice.ts 展示了极强的环境适应性。它会优先尝试使用内置的 Native Audio 模块；如果失败（常见于 Linux 环境），则按顺序退避尝试 arecord 或 SoX (rec)。

5. 云端转写与 Keyterms 辅助

转录逻辑由 claude-code-opensource/src/services/voiceStreamSTT.ts 负责。它通过 WebSocket 将 PCM 音频流发送至 /api/ws/speech_to_text/voice_stream。为了解决专业术语（如特定的代码变量名、Git 分支名）识别不准的问题，claude-code-opensource/src/services/voiceKeyterms.ts 会在连接初期自动提取项目上下文作为 Keyterms 提示。这解释了为什么它对 regex、JSON 或你的项目特定路径的识别率远高于普通通用 STT 服务。

边界条件

非本地化：转写过程完全依赖 Anthropic 云端服务，必须保持在线且仅支持 OAuth 登录。
环境限制：在 SSH 远程会话中，除非通过特定手段转发音频流，否则无法直接利用宿主机的麦克风。
按键依赖：hold-to-talk 的稳定性很大程度上取决于终端模拟器对“按键重复事件”的处理逻辑。如果终端拦截了长按重复，该功能将无法正常运作。
隐私边界：音频流在录制期间是实时上传的。虽然 Claude Code 声称不会存储这些音频用于训练，但在处理极高敏感度的商业机密时，仍需意识到这是一条云端传输通道。
回退机制：如果检测到不支持的语言，系统会自动回退到英文转录，而不是直接报错。

接下来看什么

如果你对按键如何绑定到 voice:pushToTalk 感兴趣，请深入研究 Keybindings 体系。
如果你想研究转写结果如何在终端 UI 上实现“半透明预览”效果，可以看 TextInput 组件中对 voiceInterimRange 的渲染逻辑。
关于语音输入后如何触发后续的任务执行，请参考 UserPromptSubmit：输入提交前的最后修剪。

源码锚点

claude-code-opensource/src/commands/voice/voice.ts：/voice 命令入口与环境预检。

📄 src/commands/voice/voice.ts — `/voice` 命令入口与环境预检。L14-43 of 151

typescript

const LANG_HINT_MAX_SHOWS = 2

export const call: LocalCommandCall = async () => {
  // Check auth and kill-switch before allowing voice mode
  if (!isVoiceModeEnabled()) {
    // Differentiate: OAuth-less users get an auth hint, everyone else
    // gets nothing (command shouldn't be reachable when the kill-switch is on).
    if (!isAnthropicAuthEnabled()) {
      return {
        type: 'text' as const,
        value:
          'Voice mode requires a Claude.ai account. Please run /login to sign in.',
      }
    }
    return {
      type: 'text' as const,
      value: 'Voice mode is not available.',
    }
  }

  const currentSettings = getInitialSettings()
  const isCurrentlyEnabled = currentSettings.voiceEnabled === true

  // Toggle OFF — no checks needed
  if (isCurrentlyEnabled) {
    const result = updateSettingsForSource('userSettings', {
      voiceEnabled: false,
    })
    if (result.error) {
      return {

claude-code-opensource/src/hooks/useVoiceIntegration.tsx：按键 hold 检测、Warmup 判定及光标回填逻辑。

📄 src/hooks/useVoiceIntegration.tsx — 按键 hold 检测、Warmup 判定及光标回填逻辑。L603-620 of 677

tsx

    // ── Warmup (bare-char only; modifier combos activated above) ──
    // First WARMUP_THRESHOLD chars flow to the text input so normal
    // typing has zero latency (a single press types normally).
    // Subsequent rapid chars are swallowed so the input stays aligned
    // with the warmup UI. Strip defensively (listener order is not
    // guaranteed — text input may have already added the char). The
    // floor preserves the intentional warmup chars; the strip is a
    // no-op when nothing leaked. Check countBefore so the event that
    // crosses the threshold still flows through (terminal batching).
    if (countBefore >= WARMUP_THRESHOLD) {
      e.stopImmediatePropagation();
      stripTrailing(repeatCount, {
        char: bareChar,
        floor: charsInInputRef.current
      });
    } else {
      charsInInputRef.current += repeatCount;
    }

claude-code-opensource/src/hooks/useVoice.ts：录音会话状态机管理，处理“先录后连”缓存逻辑。

📄 src/hooks/useVoice.ts — 录音会话状态机管理，处理“先录后连”缓存逻辑。L19-23 of 1145

typescript

  type FinalizeSource,
  isVoiceStreamAvailable,
  type VoiceStreamConnection,
} from '../services/voiceStreamSTT.js'
import { logForDebugging } from '../utils/debug.js'

claude-code-opensource/src/services/voice.ts：多后端音频采集适配层。

📄 src/services/voice.ts — 多后端音频采集适配层。L20-36 of 526

typescript

type AudioNapi = typeof import('audio-capture-napi')
let audioNapi: AudioNapi | null = null
let audioNapiPromise: Promise<AudioNapi> | null = null

function loadAudioNapi(): Promise<AudioNapi> {
  audioNapiPromise ??= (async () => {
    const t0 = Date.now()
    const mod = await import('audio-capture-napi')
    // vendor/audio-capture-src/index.ts defers require(...node) until the
    // first function call — trigger it here so timing reflects real cost.
    mod.isNativeAudioAvailable()
    audioNapi = mod
    logForDebugging(`[voice] audio-capture-napi loaded in ${Date.now() - t0}ms`)
    return mod
  })()
  return audioNapiPromise
}

claude-code-opensource/src/services/voiceStreamSTT.ts：WebSocket 转录协议实现。

📄 src/services/voiceStreamSTT.ts — WebSocket 转录协议实现。L5-14 of 545

typescript

// Connects to Anthropic's voice_stream WebSocket endpoint using the same
// OAuth credentials as Claude Code.  The endpoint uses conversation_engine
// backed models for speech-to-text.  Designed for hold-to-talk: hold the
// keybinding to record, release to stop and submit.
//
// The wire protocol uses JSON control messages (KeepAlive, CloseStream) and
// binary audio frames.  The server responds with TranscriptText and
// TranscriptEndpoint JSON messages.

import type { ClientRequest, IncomingMessage } from 'http'

claude-code-opensource/src/services/voiceKeyterms.ts：项目上下文关键词提取与注入。

📄 src/services/voiceKeyterms.ts — 项目上下文关键词提取与注入。L13-42 of 107

typescript

const GLOBAL_KEYTERMS: readonly string[] = [
  // Terms Deepgram consistently mangles without keyword hints.
  // Note: "Claude" and "Anthropic" are already server-side base keyterms.
  // Avoid terms nobody speaks aloud as-spelled (stdout → "standard out").
  'MCP',
  'symlink',
  'grep',
  'regex',
  'localhost',
  'codebase',
  'TypeScript',
  'JSON',
  'OAuth',
  'webhook',
  'gRPC',
  'dotfiles',
  'subagent',
  'worktree',
]

// ─── Helpers ────────────────────────────────────────────────────────

/**
 * Split an identifier (camelCase, PascalCase, kebab-case, snake_case, or
 * path segments) into individual words.  Fragments of 2 chars or fewer are
 * discarded to avoid noise.
 */
export function splitIdentifier(name: string): string[] {
  return name
    .replace(/([a-z])([A-Z])/g, '$1 $2')

claude-code-opensource/src/voice/voiceModeEnabled.ts：功能开关与 Feature Gate 校验。

📄 src/voice/voiceModeEnabled.ts — 功能开关与 Feature Gate 校验。L21-32 of 55

typescript

    ? !getFeatureValue_CACHED_MAY_BE_STALE('tengu_amber_quartz_disabled', false)
    : false
}

/**
 * Auth-only check for voice mode. Returns true when the user has a valid
 * Anthropic OAuth token. Backed by the memoized getClaudeAIOAuthTokens —
 * first call spawns `security` on macOS (~20-50ms), subsequent calls are
 * cache hits. The memoize clears on token refresh (~once/hour), so one
 * cold spawn per refresh is expected. Cheap enough for usage-time checks.
 */
export function hasVoiceAuth(): boolean {

voice-dictation：不是本地语音输入法，而是一条按住说话的录音、转写和光标回填链 ​

先搞清楚这是什么 ​

实现机制 ​

1. 按键驱动与“热启动”判定 ​

2. 光标锚定与上下文保留 ​

3. 先录后连：极致的低延迟响应 ​

4. 本地采集的优先级退避 ​

5. 云端转写与 Keyterms 辅助 ​

边界条件 ​

接下来看什么 ​

源码锚点 ​

voice-dictation：不是本地语音输入法，而是一条按住说话的录音、转写和光标回填链

先搞清楚这是什么

实现机制

1. 按键驱动与“热启动”判定

2. 光标锚定与上下文保留

3. 先录后连：极致的低延迟响应

4. 本地采集的优先级退避

5. 云端转写与 Keyterms 辅助

边界条件

接下来看什么

源码锚点