Created detailed markdown plans for all items in todo.md: 1. 01-playwright-scaffolding.md - Base Playwright infrastructure 2. 02-magnitude-tests-comprehensive.md - Complete test coverage 3. 03-stream-ai-to-deepgram-tts.md - TTS latency optimization 4. 04-fix-galaxy-node-clicking.md - Galaxy navigation bugs 5. 05-dark-light-mode-theme.md - Dark/light mode with dynamic favicons 6. 06-fix-double-border-desktop.md - UI polish 7. 07-delete-backup-files.md - Code cleanup 8. 08-ai-transition-to-edit.md - Intelligent node creation flow 9. 09-umap-minimum-nodes-analysis.md - Technical analysis Each plan includes: - Detailed problem analysis - Proposed solutions with code examples - Manual Playwright MCP testing strategy - Magnitude test specifications - Implementation steps - Success criteria Ready to implement in sequence. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
10 KiB
Plan: Stream AI Output to Deepgram for Faster TTS Synthesis
Priority: MEDIUM Dependencies: None Affects: Voice interaction latency, user experience
Overview
Currently, the app waits for the complete AI response before sending it to Deepgram for TTS. This creates a laggy experience. By streaming the AI output directly to Deepgram as it's generated, we can start playing audio much faster and create a more responsive voice interaction.
Current Implementation
Current Flow (SLOW)
User speaks → Deepgram transcribe → Send to AI
↓
Wait for full response (3-10s)
↓
Send complete text to Deepgram TTS
↓
Wait for audio generation (1-3s)
↓
Play audio
Total latency: 4-13 seconds before first audio plays
Proposed Implementation
New Flow (FAST)
User speaks → Deepgram transcribe → Stream to AI
↓
Stream chunks to Deepgram TTS
↓ (chunks arrive)
Play audio chunks immediately
Total latency: 1-2 seconds before first audio plays
Technical Approach
1. Modify AI SDK Integration
Currently using useChat from Vercel AI SDK with async completion:
// Current (app/api/chat/route.ts)
const result = await streamText({
model: google('gemini-2.0-flash-exp'),
messages,
system: systemPrompt,
});
return result.toDataStreamResponse();
Need to add TTS streaming:
// New approach
const result = streamText({
model: google('gemini-2.0-flash-exp'),
messages,
system: systemPrompt,
async onChunk({ chunk }) {
// Stream each chunk to Deepgram TTS
if (chunk.type === 'text-delta') {
await streamToDeepgram(chunk.textDelta);
}
},
});
return result.toDataStreamResponse();
2. Create Deepgram TTS Streaming Service
lib/deepgram-tts-stream.ts
import { createClient, LiveClient } from '@deepgram/sdk';
export class DeepgramTTSStream {
private client: LiveClient;
private audioQueue: Uint8Array[] = [];
private isPlaying = false;
constructor(apiKey: string) {
const deepgram = createClient(apiKey);
this.client = deepgram.speak.live({
model: 'aura-asteria-en',
encoding: 'linear16',
sample_rate: 24000,
});
this.client.on('data', (data: Buffer) => {
this.audioQueue.push(new Uint8Array(data));
this.playNextChunk();
});
}
async streamText(text: string) {
// Send text chunk to Deepgram for synthesis
this.client.send(text);
}
async flush() {
// Signal end of text stream
this.client.close();
}
private async playNextChunk() {
if (this.isPlaying || this.audioQueue.length === 0) return;
this.isPlaying = true;
const chunk = this.audioQueue.shift()!;
// Play audio chunk using Web Audio API
await this.playAudioChunk(chunk);
this.isPlaying = false;
this.playNextChunk(); // Play next if available
}
private async playAudioChunk(chunk: Uint8Array) {
const audioContext = new AudioContext({ sampleRate: 24000 });
const audioBuffer = audioContext.createBuffer(
1, // mono
chunk.length / 2, // 16-bit samples
24000
);
const channelData = audioBuffer.getChannelData(0);
for (let i = 0; i < chunk.length / 2; i++) {
// Convert 16-bit PCM to float32
const sample = (chunk[i * 2] | (chunk[i * 2 + 1] << 8));
channelData[i] = sample / 32768.0;
}
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
source.connect(audioContext.destination);
return new Promise((resolve) => {
source.onended = resolve;
source.start();
});
}
}
3. Create Server-Sent Events (SSE) Endpoint for TTS
app/api/chat-with-tts/route.ts
import { DeepgramTTSStream } from '@/lib/deepgram-tts-stream';
import { streamText } from 'ai';
import { google } from '@ai-sdk/google';
export async function POST(request: Request) {
const { messages } = await request.json();
// Create a TransformStream for SSE
const stream = new TransformStream();
const writer = stream.writable.getWriter();
const encoder = new TextEncoder();
// Start streaming AI response
(async () => {
const ttsStream = new DeepgramTTSStream(process.env.DEEPGRAM_API_KEY!);
try {
const result = streamText({
model: google('gemini-2.0-flash-exp'),
messages,
async onChunk({ chunk }) {
if (chunk.type === 'text-delta') {
// Send text to client
await writer.write(
encoder.encode(`data: ${JSON.stringify({ text: chunk.textDelta })}\n\n`)
);
// Stream to Deepgram TTS
await ttsStream.streamText(chunk.textDelta);
}
},
});
await result.text; // Wait for completion
await ttsStream.flush();
await writer.write(encoder.encode('data: [DONE]\n\n'));
} catch (error) {
await writer.write(
encoder.encode(`data: ${JSON.stringify({ error: error.message })}\n\n`)
);
} finally {
await writer.close();
}
})();
return new Response(stream.readable, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
Connection: 'keep-alive',
},
});
}
4. Update Client to Consume SSE with TTS
components/ChatInterface.tsx
const [isTTSEnabled, setIsTTSEnabled] = useState(false);
const ttsStreamRef = useRef<DeepgramTTSStream | null>(null);
async function sendMessageWithTTS(message: string) {
const response = await fetch('/api/chat-with-tts', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ messages: [...messages, { role: 'user', content: message }] }),
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
// Initialize TTS stream
if (isTTSEnabled) {
ttsStreamRef.current = new DeepgramTTSStream();
}
let fullText = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n');
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = line.slice(6);
if (data === '[DONE]') {
if (ttsStreamRef.current) {
await ttsStreamRef.current.flush();
}
break;
}
try {
const parsed = JSON.parse(data);
if (parsed.text) {
fullText += parsed.text;
// Update UI with incremental text
setMessages((prev) => {
const last = prev[prev.length - 1];
if (last && last.role === 'assistant') {
return [...prev.slice(0, -1), { ...last, content: fullText }];
}
return [...prev, { role: 'assistant', content: fullText }];
});
// Stream to TTS
if (ttsStreamRef.current) {
await ttsStreamRef.current.streamText(parsed.text);
}
}
} catch (e) {
console.error('Failed to parse SSE data:', e);
}
}
}
}
}
Alternative: Use Deepgram's Native Streaming TTS
Deepgram has a WebSocket-based streaming TTS API that's even more efficient:
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);
const connection = deepgram.speak.live({
model: 'aura-asteria-en',
encoding: 'linear16',
sample_rate: 24000,
});
connection.on('open', () => {
console.log('TTS connection established');
});
connection.on('data', (audioData: Buffer) => {
// Play audio chunk immediately
playAudioBuffer(audioData);
});
// As AI chunks arrive, send to Deepgram
aiStream.on('text-delta', (text) => {
connection.send(text);
});
// When AI completes
aiStream.on('finish', () => {
connection.close();
});
Implementation Steps
-
Research Deepgram TTS Streaming API
- Review docs: https://developers.deepgram.com/docs/tts-streaming
- Test WebSocket connection manually
- Understand audio format and buffering
-
Create TTS streaming service
lib/deepgram-tts-stream.ts- Implement audio queue and playback
- Handle reconnection and errors
-
Modify API route for streaming
- Create
/api/chat-with-ttsroute - Implement SSE response
- Connect AI stream to TTS stream
- Create
-
Update client components
- Add TTS toggle in UI
- Implement SSE consumption
- Connect to audio playback
-
Test with Playwright MCP
- Enable TTS
- Send message
- Verify audio starts playing quickly (< 2s)
- Verify audio quality
- Test error handling (network drop, TTS failure)
-
Add Magnitude test
test('TTS streams audio with low latency', async (agent) => { await agent.open('http://localhost:3000/chat'); await agent.act('Enable TTS in settings'); await agent.act('Send message "Hello"'); await agent.check('Audio starts playing within 2 seconds'); await agent.check('Audio continues as AI generates response'); await agent.check('Audio completes without gaps'); });
Performance Targets
- Time to first audio: < 2 seconds (vs current 4-13s)
- Perceived latency: Near real-time streaming
- Audio quality: No degradation from current implementation
- Reliability: Graceful fallback if streaming fails
Success Criteria
- ✅ TTS audio starts playing within 2 seconds of AI response beginning
- ✅ Audio streams continuously as AI generates text
- ✅ No perceptible gaps or stuttering in audio playback
- ✅ Graceful fallback to batch TTS if streaming fails
- ✅ Playwright MCP manual test passes
- ✅ Magnitude test passes
- ✅ No regression in audio quality
Files to Create
lib/deepgram-tts-stream.ts- TTS streaming serviceapp/api/chat-with-tts/route.ts- SSE endpoint for TTStests/playwright/tts-streaming.spec.ts- Manual Playwright testtests/magnitude/tts-streaming.mag.ts- Magnitude test
Files to Update
components/ChatInterface.tsx- Add TTS streaming consumptionapp/theme.ts- Add TTS toggle styling if needed