Skip to content
源码分析手册

voice-dictation:不是本地语音输入法,而是一条按住说话的录音、转写和光标回填链

对应官方文档:claude-code-docs/docs/voice-dictation.md 里的 Voice dictation

先搞清楚这是什么

voice-dictation 的本质,不是 Claude Code 内置了一套本地离线语音识别,也不是像聊天应用那样点击一下就开启持续监听模式。

它真正实现的是一套高度集成的 hold-to-talk(按住说话) 运行时流程。这条流程跨越了本地音频硬件、操作系统事件监听、云端流式转录协议以及终端 UI 的实时回填。

你可以把它理解为一个由本地录音驱动、远端实时转写、最后再将结果精确回填到当前 Prompt 光标位置的“虚拟键盘”。它的核心不在于“语音设置”,而在于如何把“按键重复 -> 录音采集 -> WebSocket 传输 -> 临时预览回填 -> 最终文本落盘”这几个离散的环节,接成一条低延迟、高可靠的连续管线。

实现机制

表面上,用户通过 /voice 命令开启功能,但 claude-code-opensource/src/commands/voice/voice.ts 在背后做了大量 Pre-flight 检查。它会确认你是否使用了 Claude.ai OAuth(目前唯一支持的鉴权方式)、本地录音依赖(如 SoX 或 arecord)是否可用,以及麦克风权限是否放行。

真正驱动整个交互体验的黑科技隐藏在以下几个模块中:

1. 按键驱动与“热启动”判定

claude-code-opensource/src/hooks/useVoiceIntegration.tsx 中,系统并不是简单地监听按键按下。对于 Space(空格)这类常用字符键,为了防止误触,它设计了一个 Warmup(预热)阈值。只有当按键产生的重复事件频率超过阈值时,才会真正激活录音状态机,并自动剥掉之前因“漏入”输入框的冗余空格。而对于 Cmd+ICtrl+I 这类组合键,则会立即触发。

2. 光标锚定与上下文保留

不同于简单的文本追加,useVoiceIntegration.tsx 在录音启动瞬间会记录当前光标前后的文本片段(Prefix/Suffix)。这使得语音转写结果可以像“插入模式”一样,精准地填充在一段文字的中间,而不是只能傻傻地跟在末尾。

3. 先录后连:极致的低延迟响应

claude-code-opensource/src/hooks/useVoice.ts 中,Claude Code 采用了一种“抢跑”策略。一旦用户触发录音,本地录音模块会立即启动采集,同时异步开启远端 WebSocket 连接。在网络握手完成前的音频数据会被缓存在内存中,连接成功后立即补发。这种设计确保了用户开口说的头几个字不会因为网络延迟而丢失。

4. 本地采集的优先级退避

claude-code-opensource/src/services/voice.ts 展示了极强的环境适应性。它会优先尝试使用内置的 Native Audio 模块;如果失败(常见于 Linux 环境),则按顺序退避尝试 arecord 或 SoX (rec)。

5. 云端转写与 Keyterms 辅助

转录逻辑由 claude-code-opensource/src/services/voiceStreamSTT.ts 负责。它通过 WebSocket 将 PCM 音频流发送至 /api/ws/speech_to_text/voice_stream。 为了解决专业术语(如特定的代码变量名、Git 分支名)识别不准的问题,claude-code-opensource/src/services/voiceKeyterms.ts 会在连接初期自动提取项目上下文作为 Keyterms 提示。这解释了为什么它对 regexJSON 或你的项目特定路径的识别率远高于普通通用 STT 服务。

边界条件

  • 非本地化:转写过程完全依赖 Anthropic 云端服务,必须保持在线且仅支持 OAuth 登录。
  • 环境限制:在 SSH 远程会话中,除非通过特定手段转发音频流,否则无法直接利用宿主机的麦克风。
  • 按键依赖:hold-to-talk 的稳定性很大程度上取决于终端模拟器对“按键重复事件”的处理逻辑。如果终端拦截了长按重复,该功能将无法正常运作。
  • 隐私边界:音频流在录制期间是实时上传的。虽然 Claude Code 声称不会存储这些音频用于训练,但在处理极高敏感度的商业机密时,仍需意识到这是一条云端传输通道。
  • 回退机制:如果检测到不支持的语言,系统会自动回退到英文转录,而不是直接报错。

接下来看什么

  • 如果你对按键如何绑定到 voice:pushToTalk 感兴趣,请深入研究 Keybindings 体系
  • 如果你想研究转写结果如何在终端 UI 上实现“半透明预览”效果,可以看 TextInput 组件中对 voiceInterimRange 的渲染逻辑。
  • 关于语音输入后如何触发后续的任务执行,请参考 UserPromptSubmit:输入提交前的最后修剪

源码锚点

  • claude-code-opensource/src/commands/voice/voice.ts/voice 命令入口与环境预检。
📄 src/commands/voice/voice.ts — `/voice` 命令入口与环境预检。L14-43 of 151
typescript
const LANG_HINT_MAX_SHOWS = 2

export const call: LocalCommandCall = async () => {
  // Check auth and kill-switch before allowing voice mode
  if (!isVoiceModeEnabled()) {
    // Differentiate: OAuth-less users get an auth hint, everyone else
    // gets nothing (command shouldn't be reachable when the kill-switch is on).
    if (!isAnthropicAuthEnabled()) {
      return {
        type: 'text' as const,
        value:
          'Voice mode requires a Claude.ai account. Please run /login to sign in.',
      }
    }
    return {
      type: 'text' as const,
      value: 'Voice mode is not available.',
    }
  }

  const currentSettings = getInitialSettings()
  const isCurrentlyEnabled = currentSettings.voiceEnabled === true

  // Toggle OFF — no checks needed
  if (isCurrentlyEnabled) {
    const result = updateSettingsForSource('userSettings', {
      voiceEnabled: false,
    })
    if (result.error) {
      return {
  • claude-code-opensource/src/hooks/useVoiceIntegration.tsx:按键 hold 检测、Warmup 判定及光标回填逻辑。
📄 src/hooks/useVoiceIntegration.tsx — 按键 hold 检测、Warmup 判定及光标回填逻辑。L603-620 of 677
tsx
    // ── Warmup (bare-char only; modifier combos activated above) ──
    // First WARMUP_THRESHOLD chars flow to the text input so normal
    // typing has zero latency (a single press types normally).
    // Subsequent rapid chars are swallowed so the input stays aligned
    // with the warmup UI. Strip defensively (listener order is not
    // guaranteed — text input may have already added the char). The
    // floor preserves the intentional warmup chars; the strip is a
    // no-op when nothing leaked. Check countBefore so the event that
    // crosses the threshold still flows through (terminal batching).
    if (countBefore >= WARMUP_THRESHOLD) {
      e.stopImmediatePropagation();
      stripTrailing(repeatCount, {
        char: bareChar,
        floor: charsInInputRef.current
      });
    } else {
      charsInInputRef.current += repeatCount;
    }
  • claude-code-opensource/src/hooks/useVoice.ts:录音会话状态机管理,处理“先录后连”缓存逻辑。
📄 src/hooks/useVoice.ts — 录音会话状态机管理,处理“先录后连”缓存逻辑。L19-23 of 1145
typescript
  type FinalizeSource,
  isVoiceStreamAvailable,
  type VoiceStreamConnection,
} from '../services/voiceStreamSTT.js'
import { logForDebugging } from '../utils/debug.js'
  • claude-code-opensource/src/services/voice.ts:多后端音频采集适配层。
📄 src/services/voice.ts — 多后端音频采集适配层。L20-36 of 526
typescript
type AudioNapi = typeof import('audio-capture-napi')
let audioNapi: AudioNapi | null = null
let audioNapiPromise: Promise<AudioNapi> | null = null

function loadAudioNapi(): Promise<AudioNapi> {
  audioNapiPromise ??= (async () => {
    const t0 = Date.now()
    const mod = await import('audio-capture-napi')
    // vendor/audio-capture-src/index.ts defers require(...node) until the
    // first function call — trigger it here so timing reflects real cost.
    mod.isNativeAudioAvailable()
    audioNapi = mod
    logForDebugging(`[voice] audio-capture-napi loaded in ${Date.now() - t0}ms`)
    return mod
  })()
  return audioNapiPromise
}
  • claude-code-opensource/src/services/voiceStreamSTT.ts:WebSocket 转录协议实现。
📄 src/services/voiceStreamSTT.ts — WebSocket 转录协议实现。L5-14 of 545
typescript
// Connects to Anthropic's voice_stream WebSocket endpoint using the same
// OAuth credentials as Claude Code.  The endpoint uses conversation_engine
// backed models for speech-to-text.  Designed for hold-to-talk: hold the
// keybinding to record, release to stop and submit.
//
// The wire protocol uses JSON control messages (KeepAlive, CloseStream) and
// binary audio frames.  The server responds with TranscriptText and
// TranscriptEndpoint JSON messages.

import type { ClientRequest, IncomingMessage } from 'http'
  • claude-code-opensource/src/services/voiceKeyterms.ts:项目上下文关键词提取与注入。
📄 src/services/voiceKeyterms.ts — 项目上下文关键词提取与注入。L13-42 of 107
typescript
const GLOBAL_KEYTERMS: readonly string[] = [
  // Terms Deepgram consistently mangles without keyword hints.
  // Note: "Claude" and "Anthropic" are already server-side base keyterms.
  // Avoid terms nobody speaks aloud as-spelled (stdout → "standard out").
  'MCP',
  'symlink',
  'grep',
  'regex',
  'localhost',
  'codebase',
  'TypeScript',
  'JSON',
  'OAuth',
  'webhook',
  'gRPC',
  'dotfiles',
  'subagent',
  'worktree',
]

// ─── Helpers ────────────────────────────────────────────────────────

/**
 * Split an identifier (camelCase, PascalCase, kebab-case, snake_case, or
 * path segments) into individual words.  Fragments of 2 chars or fewer are
 * discarded to avoid noise.
 */
export function splitIdentifier(name: string): string[] {
  return name
    .replace(/([a-z])([A-Z])/g, '$1 $2')
  • claude-code-opensource/src/voice/voiceModeEnabled.ts:功能开关与 Feature Gate 校验。
📄 src/voice/voiceModeEnabled.ts — 功能开关与 Feature Gate 校验。L21-32 of 55
typescript
    ? !getFeatureValue_CACHED_MAY_BE_STALE('tengu_amber_quartz_disabled', false)
    : false
}

/**
 * Auth-only check for voice mode. Returns true when the user has a valid
 * Anthropic OAuth token. Backed by the memoized getClaudeAIOAuthTokens —
 * first call spawns `security` on macOS (~20-50ms), subsequent calls are
 * cache hits. The memoize clears on token refresh (~once/hour), so one
 * cold spawn per refresh is expected. Cheap enough for usage-time checks.
 */
export function hasVoiceAuth(): boolean {

基于 Claude Code v2.1.88 开源快照的深度分析