docs: Add comprehensive implementation plans for all todo items
Created detailed markdown plans for all items in todo.md: 1. 01-playwright-scaffolding.md - Base Playwright infrastructure 2. 02-magnitude-tests-comprehensive.md - Complete test coverage 3. 03-stream-ai-to-deepgram-tts.md - TTS latency optimization 4. 04-fix-galaxy-node-clicking.md - Galaxy navigation bugs 5. 05-dark-light-mode-theme.md - Dark/light mode with dynamic favicons 6. 06-fix-double-border-desktop.md - UI polish 7. 07-delete-backup-files.md - Code cleanup 8. 08-ai-transition-to-edit.md - Intelligent node creation flow 9. 09-umap-minimum-nodes-analysis.md - Technical analysis Each plan includes: - Detailed problem analysis - Proposed solutions with code examples - Manual Playwright MCP testing strategy - Magnitude test specifications - Implementation steps - Success criteria Ready to implement in sequence. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
382
plans/03-stream-ai-to-deepgram-tts.md
Normal file
382
plans/03-stream-ai-to-deepgram-tts.md
Normal file
@@ -0,0 +1,382 @@
|
||||
# Plan: Stream AI Output to Deepgram for Faster TTS Synthesis
|
||||
|
||||
**Priority:** MEDIUM
|
||||
**Dependencies:** None
|
||||
**Affects:** Voice interaction latency, user experience
|
||||
|
||||
## Overview
|
||||
|
||||
Currently, the app waits for the complete AI response before sending it to Deepgram for TTS. This creates a laggy experience. By streaming the AI output directly to Deepgram as it's generated, we can start playing audio much faster and create a more responsive voice interaction.
|
||||
|
||||
## Current Implementation
|
||||
|
||||
### Current Flow (SLOW)
|
||||
```
|
||||
User speaks → Deepgram transcribe → Send to AI
|
||||
↓
|
||||
Wait for full response (3-10s)
|
||||
↓
|
||||
Send complete text to Deepgram TTS
|
||||
↓
|
||||
Wait for audio generation (1-3s)
|
||||
↓
|
||||
Play audio
|
||||
```
|
||||
|
||||
**Total latency:** 4-13 seconds before first audio plays
|
||||
|
||||
## Proposed Implementation
|
||||
|
||||
### New Flow (FAST)
|
||||
```
|
||||
User speaks → Deepgram transcribe → Stream to AI
|
||||
↓
|
||||
Stream chunks to Deepgram TTS
|
||||
↓ (chunks arrive)
|
||||
Play audio chunks immediately
|
||||
```
|
||||
|
||||
**Total latency:** 1-2 seconds before first audio plays
|
||||
|
||||
## Technical Approach
|
||||
|
||||
### 1. Modify AI SDK Integration
|
||||
|
||||
Currently using `useChat` from Vercel AI SDK with async completion:
|
||||
|
||||
```typescript
|
||||
// Current (app/api/chat/route.ts)
|
||||
const result = await streamText({
|
||||
model: google('gemini-2.0-flash-exp'),
|
||||
messages,
|
||||
system: systemPrompt,
|
||||
});
|
||||
|
||||
return result.toDataStreamResponse();
|
||||
```
|
||||
|
||||
Need to add TTS streaming:
|
||||
|
||||
```typescript
|
||||
// New approach
|
||||
const result = streamText({
|
||||
model: google('gemini-2.0-flash-exp'),
|
||||
messages,
|
||||
system: systemPrompt,
|
||||
async onChunk({ chunk }) {
|
||||
// Stream each chunk to Deepgram TTS
|
||||
if (chunk.type === 'text-delta') {
|
||||
await streamToDeepgram(chunk.textDelta);
|
||||
}
|
||||
},
|
||||
});
|
||||
|
||||
return result.toDataStreamResponse();
|
||||
```
|
||||
|
||||
### 2. Create Deepgram TTS Streaming Service
|
||||
|
||||
#### `lib/deepgram-tts-stream.ts`
|
||||
```typescript
|
||||
import { createClient, LiveClient } from '@deepgram/sdk';
|
||||
|
||||
export class DeepgramTTSStream {
|
||||
private client: LiveClient;
|
||||
private audioQueue: Uint8Array[] = [];
|
||||
private isPlaying = false;
|
||||
|
||||
constructor(apiKey: string) {
|
||||
const deepgram = createClient(apiKey);
|
||||
this.client = deepgram.speak.live({
|
||||
model: 'aura-asteria-en',
|
||||
encoding: 'linear16',
|
||||
sample_rate: 24000,
|
||||
});
|
||||
|
||||
this.client.on('data', (data: Buffer) => {
|
||||
this.audioQueue.push(new Uint8Array(data));
|
||||
this.playNextChunk();
|
||||
});
|
||||
}
|
||||
|
||||
async streamText(text: string) {
|
||||
// Send text chunk to Deepgram for synthesis
|
||||
this.client.send(text);
|
||||
}
|
||||
|
||||
async flush() {
|
||||
// Signal end of text stream
|
||||
this.client.close();
|
||||
}
|
||||
|
||||
private async playNextChunk() {
|
||||
if (this.isPlaying || this.audioQueue.length === 0) return;
|
||||
|
||||
this.isPlaying = true;
|
||||
const chunk = this.audioQueue.shift()!;
|
||||
|
||||
// Play audio chunk using Web Audio API
|
||||
await this.playAudioChunk(chunk);
|
||||
|
||||
this.isPlaying = false;
|
||||
this.playNextChunk(); // Play next if available
|
||||
}
|
||||
|
||||
private async playAudioChunk(chunk: Uint8Array) {
|
||||
const audioContext = new AudioContext({ sampleRate: 24000 });
|
||||
const audioBuffer = audioContext.createBuffer(
|
||||
1, // mono
|
||||
chunk.length / 2, // 16-bit samples
|
||||
24000
|
||||
);
|
||||
|
||||
const channelData = audioBuffer.getChannelData(0);
|
||||
for (let i = 0; i < chunk.length / 2; i++) {
|
||||
// Convert 16-bit PCM to float32
|
||||
const sample = (chunk[i * 2] | (chunk[i * 2 + 1] << 8));
|
||||
channelData[i] = sample / 32768.0;
|
||||
}
|
||||
|
||||
const source = audioContext.createBufferSource();
|
||||
source.buffer = audioBuffer;
|
||||
source.connect(audioContext.destination);
|
||||
|
||||
return new Promise((resolve) => {
|
||||
source.onended = resolve;
|
||||
source.start();
|
||||
});
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Create Server-Sent Events (SSE) Endpoint for TTS
|
||||
|
||||
#### `app/api/chat-with-tts/route.ts`
|
||||
```typescript
|
||||
import { DeepgramTTSStream } from '@/lib/deepgram-tts-stream';
|
||||
import { streamText } from 'ai';
|
||||
import { google } from '@ai-sdk/google';
|
||||
|
||||
export async function POST(request: Request) {
|
||||
const { messages } = await request.json();
|
||||
|
||||
// Create a TransformStream for SSE
|
||||
const stream = new TransformStream();
|
||||
const writer = stream.writable.getWriter();
|
||||
const encoder = new TextEncoder();
|
||||
|
||||
// Start streaming AI response
|
||||
(async () => {
|
||||
const ttsStream = new DeepgramTTSStream(process.env.DEEPGRAM_API_KEY!);
|
||||
|
||||
try {
|
||||
const result = streamText({
|
||||
model: google('gemini-2.0-flash-exp'),
|
||||
messages,
|
||||
async onChunk({ chunk }) {
|
||||
if (chunk.type === 'text-delta') {
|
||||
// Send text to client
|
||||
await writer.write(
|
||||
encoder.encode(`data: ${JSON.stringify({ text: chunk.textDelta })}\n\n`)
|
||||
);
|
||||
|
||||
// Stream to Deepgram TTS
|
||||
await ttsStream.streamText(chunk.textDelta);
|
||||
}
|
||||
},
|
||||
});
|
||||
|
||||
await result.text; // Wait for completion
|
||||
await ttsStream.flush();
|
||||
|
||||
await writer.write(encoder.encode('data: [DONE]\n\n'));
|
||||
} catch (error) {
|
||||
await writer.write(
|
||||
encoder.encode(`data: ${JSON.stringify({ error: error.message })}\n\n`)
|
||||
);
|
||||
} finally {
|
||||
await writer.close();
|
||||
}
|
||||
})();
|
||||
|
||||
return new Response(stream.readable, {
|
||||
headers: {
|
||||
'Content-Type': 'text/event-stream',
|
||||
'Cache-Control': 'no-cache',
|
||||
Connection: 'keep-alive',
|
||||
},
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Update Client to Consume SSE with TTS
|
||||
|
||||
#### `components/ChatInterface.tsx`
|
||||
```typescript
|
||||
const [isTTSEnabled, setIsTTSEnabled] = useState(false);
|
||||
const ttsStreamRef = useRef<DeepgramTTSStream | null>(null);
|
||||
|
||||
async function sendMessageWithTTS(message: string) {
|
||||
const response = await fetch('/api/chat-with-tts', {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({ messages: [...messages, { role: 'user', content: message }] }),
|
||||
});
|
||||
|
||||
const reader = response.body!.getReader();
|
||||
const decoder = new TextDecoder();
|
||||
|
||||
// Initialize TTS stream
|
||||
if (isTTSEnabled) {
|
||||
ttsStreamRef.current = new DeepgramTTSStream();
|
||||
}
|
||||
|
||||
let fullText = '';
|
||||
|
||||
while (true) {
|
||||
const { done, value } = await reader.read();
|
||||
if (done) break;
|
||||
|
||||
const chunk = decoder.decode(value);
|
||||
const lines = chunk.split('\n');
|
||||
|
||||
for (const line of lines) {
|
||||
if (line.startsWith('data: ')) {
|
||||
const data = line.slice(6);
|
||||
if (data === '[DONE]') {
|
||||
if (ttsStreamRef.current) {
|
||||
await ttsStreamRef.current.flush();
|
||||
}
|
||||
break;
|
||||
}
|
||||
|
||||
try {
|
||||
const parsed = JSON.parse(data);
|
||||
if (parsed.text) {
|
||||
fullText += parsed.text;
|
||||
// Update UI with incremental text
|
||||
setMessages((prev) => {
|
||||
const last = prev[prev.length - 1];
|
||||
if (last && last.role === 'assistant') {
|
||||
return [...prev.slice(0, -1), { ...last, content: fullText }];
|
||||
}
|
||||
return [...prev, { role: 'assistant', content: fullText }];
|
||||
});
|
||||
|
||||
// Stream to TTS
|
||||
if (ttsStreamRef.current) {
|
||||
await ttsStreamRef.current.streamText(parsed.text);
|
||||
}
|
||||
}
|
||||
} catch (e) {
|
||||
console.error('Failed to parse SSE data:', e);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Alternative: Use Deepgram's Native Streaming TTS
|
||||
|
||||
Deepgram has a WebSocket-based streaming TTS API that's even more efficient:
|
||||
|
||||
```typescript
|
||||
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);
|
||||
|
||||
const connection = deepgram.speak.live({
|
||||
model: 'aura-asteria-en',
|
||||
encoding: 'linear16',
|
||||
sample_rate: 24000,
|
||||
});
|
||||
|
||||
connection.on('open', () => {
|
||||
console.log('TTS connection established');
|
||||
});
|
||||
|
||||
connection.on('data', (audioData: Buffer) => {
|
||||
// Play audio chunk immediately
|
||||
playAudioBuffer(audioData);
|
||||
});
|
||||
|
||||
// As AI chunks arrive, send to Deepgram
|
||||
aiStream.on('text-delta', (text) => {
|
||||
connection.send(text);
|
||||
});
|
||||
|
||||
// When AI completes
|
||||
aiStream.on('finish', () => {
|
||||
connection.close();
|
||||
});
|
||||
```
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
1. **Research Deepgram TTS Streaming API**
|
||||
- Review docs: https://developers.deepgram.com/docs/tts-streaming
|
||||
- Test WebSocket connection manually
|
||||
- Understand audio format and buffering
|
||||
|
||||
2. **Create TTS streaming service**
|
||||
- `lib/deepgram-tts-stream.ts`
|
||||
- Implement audio queue and playback
|
||||
- Handle reconnection and errors
|
||||
|
||||
3. **Modify API route for streaming**
|
||||
- Create `/api/chat-with-tts` route
|
||||
- Implement SSE response
|
||||
- Connect AI stream to TTS stream
|
||||
|
||||
4. **Update client components**
|
||||
- Add TTS toggle in UI
|
||||
- Implement SSE consumption
|
||||
- Connect to audio playback
|
||||
|
||||
5. **Test with Playwright MCP**
|
||||
- Enable TTS
|
||||
- Send message
|
||||
- Verify audio starts playing quickly (< 2s)
|
||||
- Verify audio quality
|
||||
- Test error handling (network drop, TTS failure)
|
||||
|
||||
6. **Add Magnitude test**
|
||||
```typescript
|
||||
test('TTS streams audio with low latency', async (agent) => {
|
||||
await agent.open('http://localhost:3000/chat');
|
||||
await agent.act('Enable TTS in settings');
|
||||
await agent.act('Send message "Hello"');
|
||||
|
||||
await agent.check('Audio starts playing within 2 seconds');
|
||||
await agent.check('Audio continues as AI generates response');
|
||||
await agent.check('Audio completes without gaps');
|
||||
});
|
||||
```
|
||||
|
||||
## Performance Targets
|
||||
|
||||
- **Time to first audio:** < 2 seconds (vs current 4-13s)
|
||||
- **Perceived latency:** Near real-time streaming
|
||||
- **Audio quality:** No degradation from current implementation
|
||||
- **Reliability:** Graceful fallback if streaming fails
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- ✅ TTS audio starts playing within 2 seconds of AI response beginning
|
||||
- ✅ Audio streams continuously as AI generates text
|
||||
- ✅ No perceptible gaps or stuttering in audio playback
|
||||
- ✅ Graceful fallback to batch TTS if streaming fails
|
||||
- ✅ Playwright MCP manual test passes
|
||||
- ✅ Magnitude test passes
|
||||
- ✅ No regression in audio quality
|
||||
|
||||
## Files to Create
|
||||
|
||||
1. `lib/deepgram-tts-stream.ts` - TTS streaming service
|
||||
2. `app/api/chat-with-tts/route.ts` - SSE endpoint for TTS
|
||||
3. `tests/playwright/tts-streaming.spec.ts` - Manual Playwright test
|
||||
4. `tests/magnitude/tts-streaming.mag.ts` - Magnitude test
|
||||
|
||||
## Files to Update
|
||||
|
||||
1. `components/ChatInterface.tsx` - Add TTS streaming consumption
|
||||
2. `app/theme.ts` - Add TTS toggle styling if needed
|
||||
Reference in New Issue
Block a user