# Plan: Stream AI Output to Deepgram for Faster TTS Synthesis **Priority:** MEDIUM **Dependencies:** None **Affects:** Voice interaction latency, user experience ## Overview Currently, the app waits for the complete AI response before sending it to Deepgram for TTS. This creates a laggy experience. By streaming the AI output directly to Deepgram as it's generated, we can start playing audio much faster and create a more responsive voice interaction. ## Current Implementation ### Current Flow (SLOW) ``` User speaks → Deepgram transcribe → Send to AI ↓ Wait for full response (3-10s) ↓ Send complete text to Deepgram TTS ↓ Wait for audio generation (1-3s) ↓ Play audio ``` **Total latency:** 4-13 seconds before first audio plays ## Proposed Implementation ### New Flow (FAST) ``` User speaks → Deepgram transcribe → Stream to AI ↓ Stream chunks to Deepgram TTS ↓ (chunks arrive) Play audio chunks immediately ``` **Total latency:** 1-2 seconds before first audio plays ## Technical Approach ### 1. Modify AI SDK Integration Currently using `useChat` from Vercel AI SDK with async completion: ```typescript // Current (app/api/chat/route.ts) const result = await streamText({ model: google('gemini-2.0-flash-exp'), messages, system: systemPrompt, }); return result.toDataStreamResponse(); ``` Need to add TTS streaming: ```typescript // New approach const result = streamText({ model: google('gemini-2.0-flash-exp'), messages, system: systemPrompt, async onChunk({ chunk }) { // Stream each chunk to Deepgram TTS if (chunk.type === 'text-delta') { await streamToDeepgram(chunk.textDelta); } }, }); return result.toDataStreamResponse(); ``` ### 2. Create Deepgram TTS Streaming Service #### `lib/deepgram-tts-stream.ts` ```typescript import { createClient, LiveClient } from '@deepgram/sdk'; export class DeepgramTTSStream { private client: LiveClient; private audioQueue: Uint8Array[] = []; private isPlaying = false; constructor(apiKey: string) { const deepgram = createClient(apiKey); this.client = deepgram.speak.live({ model: 'aura-asteria-en', encoding: 'linear16', sample_rate: 24000, }); this.client.on('data', (data: Buffer) => { this.audioQueue.push(new Uint8Array(data)); this.playNextChunk(); }); } async streamText(text: string) { // Send text chunk to Deepgram for synthesis this.client.send(text); } async flush() { // Signal end of text stream this.client.close(); } private async playNextChunk() { if (this.isPlaying || this.audioQueue.length === 0) return; this.isPlaying = true; const chunk = this.audioQueue.shift()!; // Play audio chunk using Web Audio API await this.playAudioChunk(chunk); this.isPlaying = false; this.playNextChunk(); // Play next if available } private async playAudioChunk(chunk: Uint8Array) { const audioContext = new AudioContext({ sampleRate: 24000 }); const audioBuffer = audioContext.createBuffer( 1, // mono chunk.length / 2, // 16-bit samples 24000 ); const channelData = audioBuffer.getChannelData(0); for (let i = 0; i < chunk.length / 2; i++) { // Convert 16-bit PCM to float32 const sample = (chunk[i * 2] | (chunk[i * 2 + 1] << 8)); channelData[i] = sample / 32768.0; } const source = audioContext.createBufferSource(); source.buffer = audioBuffer; source.connect(audioContext.destination); return new Promise((resolve) => { source.onended = resolve; source.start(); }); } } ``` ### 3. Create Server-Sent Events (SSE) Endpoint for TTS #### `app/api/chat-with-tts/route.ts` ```typescript import { DeepgramTTSStream } from '@/lib/deepgram-tts-stream'; import { streamText } from 'ai'; import { google } from '@ai-sdk/google'; export async function POST(request: Request) { const { messages } = await request.json(); // Create a TransformStream for SSE const stream = new TransformStream(); const writer = stream.writable.getWriter(); const encoder = new TextEncoder(); // Start streaming AI response (async () => { const ttsStream = new DeepgramTTSStream(process.env.DEEPGRAM_API_KEY!); try { const result = streamText({ model: google('gemini-2.0-flash-exp'), messages, async onChunk({ chunk }) { if (chunk.type === 'text-delta') { // Send text to client await writer.write( encoder.encode(`data: ${JSON.stringify({ text: chunk.textDelta })}\n\n`) ); // Stream to Deepgram TTS await ttsStream.streamText(chunk.textDelta); } }, }); await result.text; // Wait for completion await ttsStream.flush(); await writer.write(encoder.encode('data: [DONE]\n\n')); } catch (error) { await writer.write( encoder.encode(`data: ${JSON.stringify({ error: error.message })}\n\n`) ); } finally { await writer.close(); } })(); return new Response(stream.readable, { headers: { 'Content-Type': 'text/event-stream', 'Cache-Control': 'no-cache', Connection: 'keep-alive', }, }); } ``` ### 4. Update Client to Consume SSE with TTS #### `components/ChatInterface.tsx` ```typescript const [isTTSEnabled, setIsTTSEnabled] = useState(false); const ttsStreamRef = useRef(null); async function sendMessageWithTTS(message: string) { const response = await fetch('/api/chat-with-tts', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ messages: [...messages, { role: 'user', content: message }] }), }); const reader = response.body!.getReader(); const decoder = new TextDecoder(); // Initialize TTS stream if (isTTSEnabled) { ttsStreamRef.current = new DeepgramTTSStream(); } let fullText = ''; while (true) { const { done, value } = await reader.read(); if (done) break; const chunk = decoder.decode(value); const lines = chunk.split('\n'); for (const line of lines) { if (line.startsWith('data: ')) { const data = line.slice(6); if (data === '[DONE]') { if (ttsStreamRef.current) { await ttsStreamRef.current.flush(); } break; } try { const parsed = JSON.parse(data); if (parsed.text) { fullText += parsed.text; // Update UI with incremental text setMessages((prev) => { const last = prev[prev.length - 1]; if (last && last.role === 'assistant') { return [...prev.slice(0, -1), { ...last, content: fullText }]; } return [...prev, { role: 'assistant', content: fullText }]; }); // Stream to TTS if (ttsStreamRef.current) { await ttsStreamRef.current.streamText(parsed.text); } } } catch (e) { console.error('Failed to parse SSE data:', e); } } } } } ``` ## Alternative: Use Deepgram's Native Streaming TTS Deepgram has a WebSocket-based streaming TTS API that's even more efficient: ```typescript const deepgram = createClient(process.env.DEEPGRAM_API_KEY); const connection = deepgram.speak.live({ model: 'aura-asteria-en', encoding: 'linear16', sample_rate: 24000, }); connection.on('open', () => { console.log('TTS connection established'); }); connection.on('data', (audioData: Buffer) => { // Play audio chunk immediately playAudioBuffer(audioData); }); // As AI chunks arrive, send to Deepgram aiStream.on('text-delta', (text) => { connection.send(text); }); // When AI completes aiStream.on('finish', () => { connection.close(); }); ``` ## Implementation Steps 1. **Research Deepgram TTS Streaming API** - Review docs: https://developers.deepgram.com/docs/tts-streaming - Test WebSocket connection manually - Understand audio format and buffering 2. **Create TTS streaming service** - `lib/deepgram-tts-stream.ts` - Implement audio queue and playback - Handle reconnection and errors 3. **Modify API route for streaming** - Create `/api/chat-with-tts` route - Implement SSE response - Connect AI stream to TTS stream 4. **Update client components** - Add TTS toggle in UI - Implement SSE consumption - Connect to audio playback 5. **Test with Playwright MCP** - Enable TTS - Send message - Verify audio starts playing quickly (< 2s) - Verify audio quality - Test error handling (network drop, TTS failure) 6. **Add Magnitude test** ```typescript test('TTS streams audio with low latency', async (agent) => { await agent.open('http://localhost:3000/chat'); await agent.act('Enable TTS in settings'); await agent.act('Send message "Hello"'); await agent.check('Audio starts playing within 2 seconds'); await agent.check('Audio continues as AI generates response'); await agent.check('Audio completes without gaps'); }); ``` ## Performance Targets - **Time to first audio:** < 2 seconds (vs current 4-13s) - **Perceived latency:** Near real-time streaming - **Audio quality:** No degradation from current implementation - **Reliability:** Graceful fallback if streaming fails ## Success Criteria - ✅ TTS audio starts playing within 2 seconds of AI response beginning - ✅ Audio streams continuously as AI generates text - ✅ No perceptible gaps or stuttering in audio playback - ✅ Graceful fallback to batch TTS if streaming fails - ✅ Playwright MCP manual test passes - ✅ Magnitude test passes - ✅ No regression in audio quality ## Files to Create 1. `lib/deepgram-tts-stream.ts` - TTS streaming service 2. `app/api/chat-with-tts/route.ts` - SSE endpoint for TTS 3. `tests/playwright/tts-streaming.spec.ts` - Manual Playwright test 4. `tests/magnitude/tts-streaming.mag.ts` - Magnitude test ## Files to Update 1. `components/ChatInterface.tsx` - Add TTS streaming consumption 2. `app/theme.ts` - Add TTS toggle styling if needed