Files
app/plans/03-stream-ai-to-deepgram-tts.md
Albert b96159ec02 docs: Add comprehensive implementation plans for all todo items
Created detailed markdown plans for all items in todo.md:

1. 01-playwright-scaffolding.md - Base Playwright infrastructure
2. 02-magnitude-tests-comprehensive.md - Complete test coverage
3. 03-stream-ai-to-deepgram-tts.md - TTS latency optimization
4. 04-fix-galaxy-node-clicking.md - Galaxy navigation bugs
5. 05-dark-light-mode-theme.md - Dark/light mode with dynamic favicons
6. 06-fix-double-border-desktop.md - UI polish
7. 07-delete-backup-files.md - Code cleanup
8. 08-ai-transition-to-edit.md - Intelligent node creation flow
9. 09-umap-minimum-nodes-analysis.md - Technical analysis

Each plan includes:
- Detailed problem analysis
- Proposed solutions with code examples
- Manual Playwright MCP testing strategy
- Magnitude test specifications
- Implementation steps
- Success criteria

Ready to implement in sequence.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-09 21:07:42 +00:00

383 lines
10 KiB
Markdown

# Plan: Stream AI Output to Deepgram for Faster TTS Synthesis
**Priority:** MEDIUM
**Dependencies:** None
**Affects:** Voice interaction latency, user experience
## Overview
Currently, the app waits for the complete AI response before sending it to Deepgram for TTS. This creates a laggy experience. By streaming the AI output directly to Deepgram as it's generated, we can start playing audio much faster and create a more responsive voice interaction.
## Current Implementation
### Current Flow (SLOW)
```
User speaks → Deepgram transcribe → Send to AI
Wait for full response (3-10s)
Send complete text to Deepgram TTS
Wait for audio generation (1-3s)
Play audio
```
**Total latency:** 4-13 seconds before first audio plays
## Proposed Implementation
### New Flow (FAST)
```
User speaks → Deepgram transcribe → Stream to AI
Stream chunks to Deepgram TTS
↓ (chunks arrive)
Play audio chunks immediately
```
**Total latency:** 1-2 seconds before first audio plays
## Technical Approach
### 1. Modify AI SDK Integration
Currently using `useChat` from Vercel AI SDK with async completion:
```typescript
// Current (app/api/chat/route.ts)
const result = await streamText({
model: google('gemini-2.0-flash-exp'),
messages,
system: systemPrompt,
});
return result.toDataStreamResponse();
```
Need to add TTS streaming:
```typescript
// New approach
const result = streamText({
model: google('gemini-2.0-flash-exp'),
messages,
system: systemPrompt,
async onChunk({ chunk }) {
// Stream each chunk to Deepgram TTS
if (chunk.type === 'text-delta') {
await streamToDeepgram(chunk.textDelta);
}
},
});
return result.toDataStreamResponse();
```
### 2. Create Deepgram TTS Streaming Service
#### `lib/deepgram-tts-stream.ts`
```typescript
import { createClient, LiveClient } from '@deepgram/sdk';
export class DeepgramTTSStream {
private client: LiveClient;
private audioQueue: Uint8Array[] = [];
private isPlaying = false;
constructor(apiKey: string) {
const deepgram = createClient(apiKey);
this.client = deepgram.speak.live({
model: 'aura-asteria-en',
encoding: 'linear16',
sample_rate: 24000,
});
this.client.on('data', (data: Buffer) => {
this.audioQueue.push(new Uint8Array(data));
this.playNextChunk();
});
}
async streamText(text: string) {
// Send text chunk to Deepgram for synthesis
this.client.send(text);
}
async flush() {
// Signal end of text stream
this.client.close();
}
private async playNextChunk() {
if (this.isPlaying || this.audioQueue.length === 0) return;
this.isPlaying = true;
const chunk = this.audioQueue.shift()!;
// Play audio chunk using Web Audio API
await this.playAudioChunk(chunk);
this.isPlaying = false;
this.playNextChunk(); // Play next if available
}
private async playAudioChunk(chunk: Uint8Array) {
const audioContext = new AudioContext({ sampleRate: 24000 });
const audioBuffer = audioContext.createBuffer(
1, // mono
chunk.length / 2, // 16-bit samples
24000
);
const channelData = audioBuffer.getChannelData(0);
for (let i = 0; i < chunk.length / 2; i++) {
// Convert 16-bit PCM to float32
const sample = (chunk[i * 2] | (chunk[i * 2 + 1] << 8));
channelData[i] = sample / 32768.0;
}
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
source.connect(audioContext.destination);
return new Promise((resolve) => {
source.onended = resolve;
source.start();
});
}
}
```
### 3. Create Server-Sent Events (SSE) Endpoint for TTS
#### `app/api/chat-with-tts/route.ts`
```typescript
import { DeepgramTTSStream } from '@/lib/deepgram-tts-stream';
import { streamText } from 'ai';
import { google } from '@ai-sdk/google';
export async function POST(request: Request) {
const { messages } = await request.json();
// Create a TransformStream for SSE
const stream = new TransformStream();
const writer = stream.writable.getWriter();
const encoder = new TextEncoder();
// Start streaming AI response
(async () => {
const ttsStream = new DeepgramTTSStream(process.env.DEEPGRAM_API_KEY!);
try {
const result = streamText({
model: google('gemini-2.0-flash-exp'),
messages,
async onChunk({ chunk }) {
if (chunk.type === 'text-delta') {
// Send text to client
await writer.write(
encoder.encode(`data: ${JSON.stringify({ text: chunk.textDelta })}\n\n`)
);
// Stream to Deepgram TTS
await ttsStream.streamText(chunk.textDelta);
}
},
});
await result.text; // Wait for completion
await ttsStream.flush();
await writer.write(encoder.encode('data: [DONE]\n\n'));
} catch (error) {
await writer.write(
encoder.encode(`data: ${JSON.stringify({ error: error.message })}\n\n`)
);
} finally {
await writer.close();
}
})();
return new Response(stream.readable, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
Connection: 'keep-alive',
},
});
}
```
### 4. Update Client to Consume SSE with TTS
#### `components/ChatInterface.tsx`
```typescript
const [isTTSEnabled, setIsTTSEnabled] = useState(false);
const ttsStreamRef = useRef<DeepgramTTSStream | null>(null);
async function sendMessageWithTTS(message: string) {
const response = await fetch('/api/chat-with-tts', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ messages: [...messages, { role: 'user', content: message }] }),
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
// Initialize TTS stream
if (isTTSEnabled) {
ttsStreamRef.current = new DeepgramTTSStream();
}
let fullText = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n');
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = line.slice(6);
if (data === '[DONE]') {
if (ttsStreamRef.current) {
await ttsStreamRef.current.flush();
}
break;
}
try {
const parsed = JSON.parse(data);
if (parsed.text) {
fullText += parsed.text;
// Update UI with incremental text
setMessages((prev) => {
const last = prev[prev.length - 1];
if (last && last.role === 'assistant') {
return [...prev.slice(0, -1), { ...last, content: fullText }];
}
return [...prev, { role: 'assistant', content: fullText }];
});
// Stream to TTS
if (ttsStreamRef.current) {
await ttsStreamRef.current.streamText(parsed.text);
}
}
} catch (e) {
console.error('Failed to parse SSE data:', e);
}
}
}
}
}
```
## Alternative: Use Deepgram's Native Streaming TTS
Deepgram has a WebSocket-based streaming TTS API that's even more efficient:
```typescript
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);
const connection = deepgram.speak.live({
model: 'aura-asteria-en',
encoding: 'linear16',
sample_rate: 24000,
});
connection.on('open', () => {
console.log('TTS connection established');
});
connection.on('data', (audioData: Buffer) => {
// Play audio chunk immediately
playAudioBuffer(audioData);
});
// As AI chunks arrive, send to Deepgram
aiStream.on('text-delta', (text) => {
connection.send(text);
});
// When AI completes
aiStream.on('finish', () => {
connection.close();
});
```
## Implementation Steps
1. **Research Deepgram TTS Streaming API**
- Review docs: https://developers.deepgram.com/docs/tts-streaming
- Test WebSocket connection manually
- Understand audio format and buffering
2. **Create TTS streaming service**
- `lib/deepgram-tts-stream.ts`
- Implement audio queue and playback
- Handle reconnection and errors
3. **Modify API route for streaming**
- Create `/api/chat-with-tts` route
- Implement SSE response
- Connect AI stream to TTS stream
4. **Update client components**
- Add TTS toggle in UI
- Implement SSE consumption
- Connect to audio playback
5. **Test with Playwright MCP**
- Enable TTS
- Send message
- Verify audio starts playing quickly (< 2s)
- Verify audio quality
- Test error handling (network drop, TTS failure)
6. **Add Magnitude test**
```typescript
test('TTS streams audio with low latency', async (agent) => {
await agent.open('http://localhost:3000/chat');
await agent.act('Enable TTS in settings');
await agent.act('Send message "Hello"');
await agent.check('Audio starts playing within 2 seconds');
await agent.check('Audio continues as AI generates response');
await agent.check('Audio completes without gaps');
});
```
## Performance Targets
- **Time to first audio:** < 2 seconds (vs current 4-13s)
- **Perceived latency:** Near real-time streaming
- **Audio quality:** No degradation from current implementation
- **Reliability:** Graceful fallback if streaming fails
## Success Criteria
- ✅ TTS audio starts playing within 2 seconds of AI response beginning
- ✅ Audio streams continuously as AI generates text
- ✅ No perceptible gaps or stuttering in audio playback
- ✅ Graceful fallback to batch TTS if streaming fails
- ✅ Playwright MCP manual test passes
- ✅ Magnitude test passes
- ✅ No regression in audio quality
## Files to Create
1. `lib/deepgram-tts-stream.ts` - TTS streaming service
2. `app/api/chat-with-tts/route.ts` - SSE endpoint for TTS
3. `tests/playwright/tts-streaming.spec.ts` - Manual Playwright test
4. `tests/magnitude/tts-streaming.mag.ts` - Magnitude test
## Files to Update
1. `components/ChatInterface.tsx` - Add TTS streaming consumption
2. `app/theme.ts` - Add TTS toggle styling if needed