Voice
Real-time speech-to-text in the chat composer. The user speaks, the runtime transcribes, the agent runs the resulting prompt.
You have a working chat surface and you want users to be able to speak instead of type. By the end of this guide, the chat composer will sprout a mic button, recorded audio will be transcribed by the runtime, and the transcript will auto-send to the agent like any other message.
When to use this#
- Hands-free or accessibility flows where typing isn't the right input modality.
- Mobile or kiosk surfaces where a long voice query is faster than thumb-typing.
- Demo and test loops where you want canned audio to drive the chat without a microphone.
If you only need file uploads (audio, images, video, documents), use Multimodal Attachments instead. Voice is specifically about live transcription of recorded speech into chat input.
Frontend#
<CopilotChat /> renders the mic button automatically when the runtime advertises audioFileTranscriptionEnabled: true on its /info endpoint. There's nothing to wire up on the chat surface itself:
export default function VoiceDemoPage() {
const handleTranscribed = useCallback((text: string) => {
if (typeof document === "undefined") return;
const textarea = document.querySelector<HTMLTextAreaElement>(
'[data-testid="copilot-chat-textarea"]',
);
if (!textarea) {
console.warn(
"[voice-demo] could not find copilot-chat-textarea to populate",
);
return;
}
const nativeSetter = Object.getOwnPropertyDescriptor(
window.HTMLTextAreaElement.prototype,
"value",
)?.set;
if (nativeSetter) {
nativeSetter.call(textarea, text);
} else {
textarea.value = text;
}
textarea.dispatchEvent(new Event("input", { bubbles: true }));
textarea.focus();
}, []);
return (
<CopilotKit
runtimeUrl={RUNTIME_URL}
agent={AGENT_ID}
useSingleEndpoint={false}
>
<div className="flex h-screen flex-col gap-3 p-6">
<header>
<h1 className="text-lg font-semibold">Voice input</h1>
<p className="text-sm text-black/60 dark:text-white/60">
Click the microphone to record, or play the bundled sample audio.
Speech is transcribed into the input field — you click send.
</p>
</header>
<SampleAudioButton
onTranscribed={handleTranscribed}
sampleText={SAMPLE_TEXT}
/>
<div className="min-h-0 flex-1 overflow-hidden rounded-md border border-black/10 dark:border-white/10">
<CopilotChat agentId={AGENT_ID} className="h-full" />
</div>
</div>
</CopilotKit>
);
}When the user clicks the mic, the chat captures audio, POSTs it to the runtime's /transcribe endpoint, drops the resulting transcript into the composer, and submits.
Driving the demo without a mic#
For Playwright runs, screenshots, or any flow where prompting for mic permissions is awkward, ship a button that POSTs a bundled audio clip directly to the same /transcribe endpoint:
export function SampleAudioButton({
onTranscribed,
sampleText,
}: SampleAudioButtonProps) {
return (
<div
data-testid="voice-sample-audio"
className="flex items-center gap-3 rounded-md border border-black/10 bg-black/[0.02] px-3 py-2 text-sm dark:border-white/10 dark:bg-white/[0.02]"
>
<button
type="button"
data-testid="voice-sample-audio-button"
onClick={() => onTranscribed(sampleText)}
className="rounded border border-black/10 bg-white px-3 py-1 text-xs font-medium hover:bg-black/5 dark:border-white/10 dark:bg-black/30 dark:hover:bg-white/10"
>
Play sample
</button>
<span className="text-black/60 dark:text-white/60">
Sample: “{sampleText}”
</span>
</div>
);
}The caller can drop the resulting text into the composer's textarea (matched via data-testid="copilot-chat-textarea") using the native value setter and a synthetic input event so React's managed state updates correctly.
Backend#
Wire up the V2 runtime with a TranscriptionService. The V1 wrapper drops the transcriptionService option, so use createCopilotRuntimeHandler from @copilotkit/runtime/v2 directly:
const runtime = new CopilotRuntime({
// @ts-ignore -- see main route.ts; published agents type generic mismatch
agents: {
"voice-demo": voiceDemoAgent,
default: voiceDemoAgent,
},
transcriptionService: new GuardedOpenAITranscriptionService(),
});
cachedHandler = createCopilotRuntimeHandler({
runtime,
basePath: "/api/copilotkit-voice",
});
return cachedHandler;
}
export const POST = (req: NextRequest) => getHandler()(req);
export const GET = (req: NextRequest) => getHandler()(req);
export const PUT = (req: NextRequest) => getHandler()(req);
export const DELETE = (req: NextRequest) => getHandler()(req);With transcriptionService set, the runtime advertises audioFileTranscriptionEnabled: true on /info (which is what tells the chat to render the mic button) and routes POST /transcribe to the service.
Custom transcription backends#
TranscriptionService from @copilotkit/runtime/v2 is an abstract class. Subclass it to plug in any transcription provider — Whisper, AssemblyAI, Deepgram, your own model. The library ships TranscriptionServiceOpenAI as the canonical reference implementation.
A useful pattern is wrapping your service in a guard that returns a clean 4xx when credentials aren't configured, instead of an opaque 5xx from the underlying SDK:
class GuardedOpenAITranscriptionService extends TranscriptionService {
private delegate: TranscriptionServiceOpenAI | null;
constructor() {
super();
const apiKey = process.env.OPENAI_API_KEY;
this.delegate = apiKey
? new TranscriptionServiceOpenAI({ openai: new OpenAI({ apiKey }) })
: null;
}
async transcribeFile(options: TranscribeFileOptions): Promise<string> {
if (!this.delegate) {
throw new Error(
"OPENAI_API_KEY not configured for this deployment (api key missing). " +
"Set OPENAI_API_KEY to enable voice transcription.",
);
}
return this.delegate.transcribeFile(options);
}
}