fbe4888
CopilotKitDocs
  • Docs
  • Integrations
  • Reference
  • Free Developer Access
Get Started
QuickstartCoding Agents
Concepts
ArchitectureGenerative UI OverviewOSS vs Enterprise
Agentic Protocols
OverviewAG-UIAG-UI MiddlewareMCPA2A
Build Chat UIs
Prebuilt Components
CopilotChatCopilotSidebarCopilotPopup
Custom Look and Feel
CSS CustomizationSlots (Subcomponents)Fully Headless UIReasoning Messages
Multimodal AttachmentsVoice
Build Generative UI
Controlled
Tool-based Generative UITool RenderingState RenderingReasoning
Your Components
Display ComponentsInteractive Components
Declarative
A2UIDynamic Schema A2UIFixed Schema A2UI
Open-Ended
MCP Apps
Adding Agent Powers
Frontend ToolsShared State
Human-in-the-Loop
HITL OverviewPausing the Agent for InputHeadless Interrupts
Sub-AgentsAgent ConfigProgrammatic Control
Agents & Backends
Built-in Agent
Backend
Copilot RuntimeFactory ModeAG-UI
Runtime Server AdapterAuthentication
Observe & Operate
InspectorVS Code Extension
Troubleshooting
Common Copilot IssuesError Debugging & ObservabilityDebug ModeAG-UI Event InspectorHook ExplorerError Observability Connectors
Enterprise
CopilotKit PremiumHow the Enterprise Intelligence Platform WorksHow Threads & Persistence WorkObservabilitySelf-Hosting IntelligenceThreads
Deploy
AWS AgentCore
What's New
Full MCP Apps SupportLangGraph Deep Agents in CopilotKitA2UI Launches with full AG-UI SupportCopilotKit v1.50Generative UI Spec SupportA2A and MCP Handshake
Migrate
Migrate to V2Migrate to 1.8.2
Other
Contributing
Code ContributionsDocumentation Contributions
Anonymous Telemetry
LangroidVoice

Voice

Real-time speech-to-text in the chat composer. The user speaks, the runtime transcribes, the agent runs the resulting prompt.

You have a working chat surface and you want users to be able to speak instead of type. By the end of this guide, the chat composer will sprout a mic button, recorded audio will be transcribed by the runtime, and the transcript will auto-send to the agent like any other message.

When to use this#

  • Hands-free or accessibility flows where typing isn't the right input modality.
  • Mobile or kiosk surfaces where a long voice query is faster than thumb-typing.
  • Demo and test loops where you want canned audio to drive the chat without a microphone.

If you only need file uploads (audio, images, video, documents), use Multimodal Attachments instead. Voice is specifically about live transcription of recorded speech into chat input.

Live Demo: Langroid — voiceOpen full demo →

Frontend#

<CopilotChat /> renders the mic button automatically when the runtime advertises audioFileTranscriptionEnabled: true on its /info endpoint. There's nothing to wire up on the chat surface itself:

frontend/src/app/page.tsx — chat surface
L3–79
import { useCallback } from "react";
import { CopilotKit, CopilotChat } from "@copilotkit/react-core/v2";
import { SampleAudioButton } from "./sample-audio-button";

const RUNTIME_URL = "/api/copilotkit-voice";
const AGENT_ID = "voice-demo";
const SAMPLE_TEXT = "What is the weather in Tokyo?";

// Voice demo for the Langroid integration.
//
// Two affordances live on this page:
//
// 1. The default mic button rendered by <CopilotChat /> when the runtime at
//    RUNTIME_URL advertises `audioFileTranscriptionEnabled: true`. Click it,
//    speak, click again — text is transcribed into the composer.
//
// 2. The <SampleAudioButton /> below the chat, which synchronously injects a
//    canned phrase into the chat's textarea (bypassing mic permissions and
//    the runtime so Playwright and screenshot flows work too). Real
//    transcription only flows through the mic path.
//
// Injecting text into the composer goes through the DOM: CopilotChat owns its
// input state internally (no external controlled-input API on v2 today), but
// the textarea is tagged `data-testid="copilot-chat-textarea"`. Setting its
// value via the native HTMLTextareaElement value setter and dispatching a
// synthetic `input` event is the React-compatible way to flip the managed
// state without reaching into CopilotChat's internals.
export default function VoiceDemoPage() {
  const handleTranscribed = useCallback((text: string) => {
    if (typeof document === "undefined") return;
    const textarea = document.querySelector<HTMLTextAreaElement>(
      '[data-testid="copilot-chat-textarea"]',
    );
    if (!textarea) {
      console.warn(
        "[voice-demo] could not find copilot-chat-textarea to populate",
      );
      return;
    }
    const nativeSetter = Object.getOwnPropertyDescriptor(
      window.HTMLTextAreaElement.prototype,
      "value",
    )?.set;
    if (nativeSetter) {
      nativeSetter.call(textarea, text);
    } else {
      textarea.value = text;
    }
    textarea.dispatchEvent(new Event("input", { bubbles: true }));
    textarea.focus();
  }, []);

  return (
    <CopilotKit
      runtimeUrl={RUNTIME_URL}
      agent={AGENT_ID}
      useSingleEndpoint={false}
    >
      <div className="flex h-screen flex-col gap-3 p-6">
        <header>
          <h1 className="text-lg font-semibold">Voice input</h1>
          <p className="text-sm text-black/60 dark:text-white/60">
            Click the microphone to record, or play the bundled sample audio.
            Speech is transcribed into the input field — you click send.
          </p>
        </header>
        <SampleAudioButton
          onTranscribed={handleTranscribed}
          sampleText={SAMPLE_TEXT}
        />
        <div className="min-h-0 flex-1 overflow-hidden rounded-md border border-black/10 dark:border-white/10">
          <CopilotChat agentId={AGENT_ID} className="h-full" />
        </div>
      </div>
    </CopilotKit>
  );
}

When the user clicks the mic, the chat captures audio, POSTs it to the runtime's /transcribe endpoint, drops the resulting transcript into the composer, and submits.

Driving the demo without a mic#

For Playwright runs, screenshots, or any flow where prompting for mic permissions is awkward, ship a button that POSTs a bundled audio clip directly to the same /transcribe endpoint:

frontend/src/app/sample-audio-button.tsx
L25–47
export function SampleAudioButton({
  onTranscribed,
  sampleText,
}: SampleAudioButtonProps) {
  return (
    <div
      data-testid="voice-sample-audio"
      className="flex items-center gap-3 rounded-md border border-black/10 bg-black/[0.02] px-3 py-2 text-sm dark:border-white/10 dark:bg-white/[0.02]"
    >
      <button
        type="button"
        data-testid="voice-sample-audio-button"
        onClick={() => onTranscribed(sampleText)}
        className="rounded border border-black/10 bg-white px-3 py-1 text-xs font-medium hover:bg-black/5 dark:border-white/10 dark:bg-black/30 dark:hover:bg-white/10"
      >
        Play sample
      </button>
      <span className="text-black/60 dark:text-white/60">
        Sample: &ldquo;{sampleText}&rdquo;
      </span>
    </div>
  );
}

The caller can drop the resulting text into the composer's textarea (matched via data-testid="copilot-chat-textarea") using the native value setter and a synthetic input event so React's managed state updates correctly.

Backend#

Wire up the V2 runtime with a TranscriptionService. The V1 wrapper drops the transcriptionService option, so use createCopilotRuntimeHandler from @copilotkit/runtime/v2 directly:

app/api/copilotkit-voice/[[...slug]]/route.ts
L24–102
import type { NextRequest } from "next/server";
import {
  CopilotRuntime,
  TranscriptionService,
  createCopilotRuntimeHandler,
} from "@copilotkit/runtime/v2";
import type { TranscribeFileOptions } from "@copilotkit/runtime/v2";
import { HttpAgent } from "@ag-ui/client";
import { TranscriptionServiceOpenAI } from "@copilotkit/voice";
import OpenAI from "openai";

const AGENT_URL = process.env.AGENT_URL || "http://localhost:8000";

const voiceDemoAgent = new HttpAgent({ url: `${AGENT_URL}/` });

/**
 * Transcription service wrapper that reports a clean, typed auth error when
 * OPENAI_API_KEY is not configured. When the key is present we delegate to
 * the real OpenAI-backed service.
 *
 * "api key" substring in the thrown error is matched by the V2 runtime's
 * `handleTranscribe` and mapped to `AUTH_FAILED → HTTP 401`.
 */
class GuardedOpenAITranscriptionService extends TranscriptionService {
  private delegate: TranscriptionServiceOpenAI | null;

  constructor() {
    super();
    const apiKey = process.env.OPENAI_API_KEY;
    this.delegate = apiKey
      ? new TranscriptionServiceOpenAI({ openai: new OpenAI({ apiKey }) })
      : null;
  }

  async transcribeFile(options: TranscribeFileOptions): Promise<string> {
    if (!this.delegate) {
      throw new Error(
        "OPENAI_API_KEY not configured for this deployment (api key missing). " +
          "Set OPENAI_API_KEY to enable voice transcription.",
      );
    }
    return this.delegate.transcribeFile(options);
  }
}

// Cache the runtime + handler across invocations so the transcription service
// is constructed once per Node process instead of per request.
let cachedHandler: ((req: Request) => Promise<Response>) | null = null;
function getHandler(): (req: Request) => Promise<Response> {
  if (cachedHandler) return cachedHandler;

  const runtime = new CopilotRuntime({
    // @ts-ignore -- Published CopilotRuntime agents type wraps Record in
    // MaybePromise<NonEmptyRecord<...>> which rejects plain Records; fixed in
    // source, pending release.
    agents: {
      // The page mounts <CopilotKit agent="voice-demo">; resolve that to the
      // unified Langroid AG-UI endpoint.
      "voice-demo": voiceDemoAgent,
      // useAgent() with no args defaults to "default"; alias so any internal
      // default-agent lookups resolve against the same agent.
      default: voiceDemoAgent,
    },
    transcriptionService: new GuardedOpenAITranscriptionService(),
  });

  cachedHandler = createCopilotRuntimeHandler({
    runtime,
    basePath: "/api/copilotkit-voice",
  });
  return cachedHandler;
}

// Next.js App Router bindings. Catchall slug forwards every sub-path
// (`/info`, `/agent/:id/run`, `/transcribe`, ...) to the V2 handler.
export const POST = (req: NextRequest) => getHandler()(req);
export const GET = (req: NextRequest) => getHandler()(req);
export const PUT = (req: NextRequest) => getHandler()(req);
export const DELETE = (req: NextRequest) => getHandler()(req);

With transcriptionService set, the runtime advertises audioFileTranscriptionEnabled: true on /info (which is what tells the chat to render the mic button) and routes POST /transcribe to the service.

Custom transcription backends#

TranscriptionService from @copilotkit/runtime/v2 is an abstract class. Subclass it to plug in any transcription provider — Whisper, AssemblyAI, Deepgram, your own model. The library ships TranscriptionServiceOpenAI as the canonical reference implementation.

A useful pattern is wrapping your service in a guard that returns a clean 4xx when credentials aren't configured, instead of an opaque 5xx from the underlying SDK:

backend — guarded transcription service
L24–67
import type { NextRequest } from "next/server";
import {
  CopilotRuntime,
  TranscriptionService,
  createCopilotRuntimeHandler,
} from "@copilotkit/runtime/v2";
import type { TranscribeFileOptions } from "@copilotkit/runtime/v2";
import { HttpAgent } from "@ag-ui/client";
import { TranscriptionServiceOpenAI } from "@copilotkit/voice";
import OpenAI from "openai";

const AGENT_URL = process.env.AGENT_URL || "http://localhost:8000";

const voiceDemoAgent = new HttpAgent({ url: `${AGENT_URL}/` });

/**
 * Transcription service wrapper that reports a clean, typed auth error when
 * OPENAI_API_KEY is not configured. When the key is present we delegate to
 * the real OpenAI-backed service.
 *
 * "api key" substring in the thrown error is matched by the V2 runtime's
 * `handleTranscribe` and mapped to `AUTH_FAILED → HTTP 401`.
 */
class GuardedOpenAITranscriptionService extends TranscriptionService {
  private delegate: TranscriptionServiceOpenAI | null;

  constructor() {
    super();
    const apiKey = process.env.OPENAI_API_KEY;
    this.delegate = apiKey
      ? new TranscriptionServiceOpenAI({ openai: new OpenAI({ apiKey }) })
      : null;
  }

  async transcribeFile(options: TranscribeFileOptions): Promise<string> {
    if (!this.delegate) {
      throw new Error(
        "OPENAI_API_KEY not configured for this deployment (api key missing). " +
          "Set OPENAI_API_KEY to enable voice transcription.",
      );
    }
    return this.delegate.transcribeFile(options);
  }
}
Supported by
CopilotKit's Built-in AgentLangGraph (Python)LangGraph (TypeScript)LangGraph (FastAPI)Google ADKMastraCrewAI (Crews)PydanticAIClaude Agent SDK (Python)Claude Agent SDK (TypeScript)AgnoAG2LlamaIndexAWS StrandsLangroidMS Agent Framework (Python)MS Agent Framework (.NET)Spring AI
On this page
When to use thisFrontendDriving the demo without a micBackendCustom transcription backends