930aaeb
CopilotKitDocs
  • Docs
  • Integrations
  • Reference
Get Started
QuickstartCoding Agents
Concepts
ArchitectureGenerative UI OverviewOSS vs Enterprise
Agentic Protocols
OverviewAG-UIAG-UI MiddlewareMCPA2A
Build Chat UIs
Prebuilt Components
CopilotChatCopilotSidebarCopilotPopup
Custom Look and Feel
CSS CustomizationSlots (Subcomponents)Fully Headless UIReasoning Messages
Multimodal AttachmentsVoice
Build Generative UI
Controlled
Tool-based Generative UITool RenderingState RenderingReasoning
Your Components
Display ComponentsInteractive Components
Declarative
A2UIDynamic Schema A2UIFixed Schema A2UI
Open-Ended
MCP Apps
Adding Agent Powers
Frontend ToolsShared State
Human-in-the-Loop
HITL OverviewPausing the Agent for InputHeadless Interrupts
Sub-AgentsAgent ConfigProgrammatic Control
Agents & Backends
Built-in Agent
Backend
Copilot RuntimeFactory ModeAG-UI
Runtime Server AdapterAuthentication
AG2
Shared state
Reading agent stateWriting agent state
Readables
Observe & Operate
InspectorVS Code Extension
Troubleshooting
Common Copilot IssuesError Debugging & ObservabilityDebug ModeAG-UI Event InspectorHook ExplorerError Observability Connectors
Enterprise
CopilotKit PremiumHow the Enterprise Intelligence Platform WorksHow Threads & Persistence WorkObservabilitySelf-Hosting IntelligenceThreads
Deploy
AWS AgentCore
What's New
Full MCP Apps SupportLangGraph Deep Agents in CopilotKitA2UI Launches with full AG-UI SupportCopilotKit v1.50Generative UI Spec SupportA2A and MCP Handshake
Migrate
Migrate to V2Migrate to 1.8.2
Other
Contributing
Code ContributionsDocumentation Contributions
Anonymous Telemetry
AG2Voice

Voice

Real-time speech-to-text in the chat composer. The user speaks, the runtime transcribes, the agent runs the resulting prompt.

You have a working chat surface and you want users to be able to speak instead of type. By the end of this guide, the chat composer will sprout a mic button, recorded audio will be transcribed by the runtime, and the transcript will auto-send to the agent like any other message.

When to use this#

  • Hands-free or accessibility flows where typing isn't the right input modality.
  • Mobile or kiosk surfaces where a long voice query is faster than thumb-typing.
  • Demo and test loops where you want canned audio to drive the chat without a microphone.

If you only need file uploads (audio, images, video, documents), use Multimodal Attachments instead. Voice is specifically about live transcription of recorded speech into chat input.

Live Demo: AG2 — voiceOpen full demo →

Frontend#

<CopilotChat /> renders the mic button automatically when the runtime advertises audioFileTranscriptionEnabled: true on its /info endpoint. There's nothing to wire up on the chat surface itself:

frontend/src/app/page.tsx — chat surface
L20–69
export default function VoiceDemoPage() {
  const handleTranscribed = useCallback((text: string) => {
    if (typeof document === "undefined") return;
    const textarea = document.querySelector<HTMLTextAreaElement>(
      '[data-testid="copilot-chat-textarea"]',
    );
    if (!textarea) {
      console.warn(
        "[voice-demo] could not find copilot-chat-textarea to populate",
      );
      return;
    }
    const nativeSetter = Object.getOwnPropertyDescriptor(
      window.HTMLTextAreaElement.prototype,
      "value",
    )?.set;
    if (nativeSetter) {
      nativeSetter.call(textarea, text);
    } else {
      textarea.value = text;
    }
    textarea.dispatchEvent(new Event("input", { bubbles: true }));
    textarea.focus();
  }, []);

  return (
    <CopilotKit
      runtimeUrl={RUNTIME_URL}
      agent={AGENT_ID}
      useSingleEndpoint={false}
    >
      <div className="flex h-screen flex-col gap-3 p-6">
        <header>
          <h1 className="text-lg font-semibold">Voice input</h1>
          <p className="text-sm text-black/60 dark:text-white/60">
            Click the microphone to record, or play the bundled sample audio.
            Speech is transcribed into the input field — you click send.
          </p>
        </header>
        <SampleAudioButton
          onTranscribed={handleTranscribed}
          sampleText={SAMPLE_TEXT}
        />
        <div className="min-h-0 flex-1 overflow-hidden rounded-md border border-black/10 dark:border-white/10">
          <CopilotChat agentId={AGENT_ID} className="h-full" />
        </div>
      </div>
    </CopilotKit>
  );
}

When the user clicks the mic, the chat captures audio, POSTs it to the runtime's /transcribe endpoint, drops the resulting transcript into the composer, and submits.

Driving the demo without a mic#

For Playwright runs, screenshots, or any flow where prompting for mic permissions is awkward, ship a button that POSTs a bundled audio clip directly to the same /transcribe endpoint:

frontend/src/app/sample-audio-button.tsx
L25–47
export function SampleAudioButton({
  onTranscribed,
  sampleText,
}: SampleAudioButtonProps) {
  return (
    <div
      data-testid="voice-sample-audio"
      className="flex items-center gap-3 rounded-md border border-black/10 bg-black/[0.02] px-3 py-2 text-sm dark:border-white/10 dark:bg-white/[0.02]"
    >
      <button
        type="button"
        data-testid="voice-sample-audio-button"
        onClick={() => onTranscribed(sampleText)}
        className="rounded border border-black/10 bg-white px-3 py-1 text-xs font-medium hover:bg-black/5 dark:border-white/10 dark:bg-black/30 dark:hover:bg-white/10"
      >
        Play sample
      </button>
      <span className="text-black/60 dark:text-white/60">
        Sample: &ldquo;{sampleText}&rdquo;
      </span>
    </div>
  );
}

The caller can drop the resulting text into the composer's textarea (matched via data-testid="copilot-chat-textarea") using the native value setter and a synthetic input event so React's managed state updates correctly.

Backend#

Wire up the V2 runtime with a TranscriptionService. The V1 wrapper drops the transcriptionService option, so use createCopilotRuntimeHandler from @copilotkit/runtime/v2 directly:

app/api/copilotkit-voice/[[...slug]]/route.ts
L56–75
  const runtime = new CopilotRuntime({
    // @ts-ignore -- see main route.ts; published agents type generic mismatch
    agents: {
      "voice-demo": voiceDemoAgent,
      default: voiceDemoAgent,
    },
    transcriptionService: new GuardedOpenAITranscriptionService(),
  });

  cachedHandler = createCopilotRuntimeHandler({
    runtime,
    basePath: "/api/copilotkit-voice",
  });
  return cachedHandler;
}

export const POST = (req: NextRequest) => getHandler()(req);
export const GET = (req: NextRequest) => getHandler()(req);
export const PUT = (req: NextRequest) => getHandler()(req);
export const DELETE = (req: NextRequest) => getHandler()(req);

With transcriptionService set, the runtime advertises audioFileTranscriptionEnabled: true on /info (which is what tells the chat to render the mic button) and routes POST /transcribe to the service.

Custom transcription backends#

TranscriptionService from @copilotkit/runtime/v2 is an abstract class. Subclass it to plug in any transcription provider — Whisper, AssemblyAI, Deepgram, your own model. The library ships TranscriptionServiceOpenAI as the canonical reference implementation.

A useful pattern is wrapping your service in a guard that returns a clean 4xx when credentials aren't configured, instead of an opaque 5xx from the underlying SDK:

backend — guarded transcription service
L30–50
class GuardedOpenAITranscriptionService extends TranscriptionService {
  private delegate: TranscriptionServiceOpenAI | null;

  constructor() {
    super();
    const apiKey = process.env.OPENAI_API_KEY;
    this.delegate = apiKey
      ? new TranscriptionServiceOpenAI({ openai: new OpenAI({ apiKey }) })
      : null;
  }

  async transcribeFile(options: TranscribeFileOptions): Promise<string> {
    if (!this.delegate) {
      throw new Error(
        "OPENAI_API_KEY not configured for this deployment (api key missing). " +
          "Set OPENAI_API_KEY to enable voice transcription.",
      );
    }
    return this.delegate.transcribeFile(options);
  }
}
Supported by
Built-in Agent (TanStack AI)LangGraph (Python)LangGraph (TypeScript)LangGraph (FastAPI)Google ADKMastraCrewAI (Crews)PydanticAIClaude Agent SDK (Python)Claude Agent SDK (TypeScript)AgnoAG2LlamaIndexAWS StrandsLangroidMS Agent Framework (Python)MS Agent Framework (.NET)Spring AI
On this page
When to use thisFrontendDriving the demo without a micBackendCustom transcription backends