Created detailed markdown plans for all items in todo.md: 1. 01-playwright-scaffolding.md - Base Playwright infrastructure 2. 02-magnitude-tests-comprehensive.md - Complete test coverage 3. 03-stream-ai-to-deepgram-tts.md - TTS latency optimization 4. 04-fix-galaxy-node-clicking.md - Galaxy navigation bugs 5. 05-dark-light-mode-theme.md - Dark/light mode with dynamic favicons 6. 06-fix-double-border-desktop.md - UI polish 7. 07-delete-backup-files.md - Code cleanup 8. 08-ai-transition-to-edit.md - Intelligent node creation flow 9. 09-umap-minimum-nodes-analysis.md - Technical analysis Each plan includes: - Detailed problem analysis - Proposed solutions with code examples - Manual Playwright MCP testing strategy - Magnitude test specifications - Implementation steps - Success criteria Ready to implement in sequence. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
383 lines
10 KiB
Markdown
383 lines
10 KiB
Markdown
# Plan: Stream AI Output to Deepgram for Faster TTS Synthesis
|
|
|
|
**Priority:** MEDIUM
|
|
**Dependencies:** None
|
|
**Affects:** Voice interaction latency, user experience
|
|
|
|
## Overview
|
|
|
|
Currently, the app waits for the complete AI response before sending it to Deepgram for TTS. This creates a laggy experience. By streaming the AI output directly to Deepgram as it's generated, we can start playing audio much faster and create a more responsive voice interaction.
|
|
|
|
## Current Implementation
|
|
|
|
### Current Flow (SLOW)
|
|
```
|
|
User speaks → Deepgram transcribe → Send to AI
|
|
↓
|
|
Wait for full response (3-10s)
|
|
↓
|
|
Send complete text to Deepgram TTS
|
|
↓
|
|
Wait for audio generation (1-3s)
|
|
↓
|
|
Play audio
|
|
```
|
|
|
|
**Total latency:** 4-13 seconds before first audio plays
|
|
|
|
## Proposed Implementation
|
|
|
|
### New Flow (FAST)
|
|
```
|
|
User speaks → Deepgram transcribe → Stream to AI
|
|
↓
|
|
Stream chunks to Deepgram TTS
|
|
↓ (chunks arrive)
|
|
Play audio chunks immediately
|
|
```
|
|
|
|
**Total latency:** 1-2 seconds before first audio plays
|
|
|
|
## Technical Approach
|
|
|
|
### 1. Modify AI SDK Integration
|
|
|
|
Currently using `useChat` from Vercel AI SDK with async completion:
|
|
|
|
```typescript
|
|
// Current (app/api/chat/route.ts)
|
|
const result = await streamText({
|
|
model: google('gemini-2.0-flash-exp'),
|
|
messages,
|
|
system: systemPrompt,
|
|
});
|
|
|
|
return result.toDataStreamResponse();
|
|
```
|
|
|
|
Need to add TTS streaming:
|
|
|
|
```typescript
|
|
// New approach
|
|
const result = streamText({
|
|
model: google('gemini-2.0-flash-exp'),
|
|
messages,
|
|
system: systemPrompt,
|
|
async onChunk({ chunk }) {
|
|
// Stream each chunk to Deepgram TTS
|
|
if (chunk.type === 'text-delta') {
|
|
await streamToDeepgram(chunk.textDelta);
|
|
}
|
|
},
|
|
});
|
|
|
|
return result.toDataStreamResponse();
|
|
```
|
|
|
|
### 2. Create Deepgram TTS Streaming Service
|
|
|
|
#### `lib/deepgram-tts-stream.ts`
|
|
```typescript
|
|
import { createClient, LiveClient } from '@deepgram/sdk';
|
|
|
|
export class DeepgramTTSStream {
|
|
private client: LiveClient;
|
|
private audioQueue: Uint8Array[] = [];
|
|
private isPlaying = false;
|
|
|
|
constructor(apiKey: string) {
|
|
const deepgram = createClient(apiKey);
|
|
this.client = deepgram.speak.live({
|
|
model: 'aura-asteria-en',
|
|
encoding: 'linear16',
|
|
sample_rate: 24000,
|
|
});
|
|
|
|
this.client.on('data', (data: Buffer) => {
|
|
this.audioQueue.push(new Uint8Array(data));
|
|
this.playNextChunk();
|
|
});
|
|
}
|
|
|
|
async streamText(text: string) {
|
|
// Send text chunk to Deepgram for synthesis
|
|
this.client.send(text);
|
|
}
|
|
|
|
async flush() {
|
|
// Signal end of text stream
|
|
this.client.close();
|
|
}
|
|
|
|
private async playNextChunk() {
|
|
if (this.isPlaying || this.audioQueue.length === 0) return;
|
|
|
|
this.isPlaying = true;
|
|
const chunk = this.audioQueue.shift()!;
|
|
|
|
// Play audio chunk using Web Audio API
|
|
await this.playAudioChunk(chunk);
|
|
|
|
this.isPlaying = false;
|
|
this.playNextChunk(); // Play next if available
|
|
}
|
|
|
|
private async playAudioChunk(chunk: Uint8Array) {
|
|
const audioContext = new AudioContext({ sampleRate: 24000 });
|
|
const audioBuffer = audioContext.createBuffer(
|
|
1, // mono
|
|
chunk.length / 2, // 16-bit samples
|
|
24000
|
|
);
|
|
|
|
const channelData = audioBuffer.getChannelData(0);
|
|
for (let i = 0; i < chunk.length / 2; i++) {
|
|
// Convert 16-bit PCM to float32
|
|
const sample = (chunk[i * 2] | (chunk[i * 2 + 1] << 8));
|
|
channelData[i] = sample / 32768.0;
|
|
}
|
|
|
|
const source = audioContext.createBufferSource();
|
|
source.buffer = audioBuffer;
|
|
source.connect(audioContext.destination);
|
|
|
|
return new Promise((resolve) => {
|
|
source.onended = resolve;
|
|
source.start();
|
|
});
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. Create Server-Sent Events (SSE) Endpoint for TTS
|
|
|
|
#### `app/api/chat-with-tts/route.ts`
|
|
```typescript
|
|
import { DeepgramTTSStream } from '@/lib/deepgram-tts-stream';
|
|
import { streamText } from 'ai';
|
|
import { google } from '@ai-sdk/google';
|
|
|
|
export async function POST(request: Request) {
|
|
const { messages } = await request.json();
|
|
|
|
// Create a TransformStream for SSE
|
|
const stream = new TransformStream();
|
|
const writer = stream.writable.getWriter();
|
|
const encoder = new TextEncoder();
|
|
|
|
// Start streaming AI response
|
|
(async () => {
|
|
const ttsStream = new DeepgramTTSStream(process.env.DEEPGRAM_API_KEY!);
|
|
|
|
try {
|
|
const result = streamText({
|
|
model: google('gemini-2.0-flash-exp'),
|
|
messages,
|
|
async onChunk({ chunk }) {
|
|
if (chunk.type === 'text-delta') {
|
|
// Send text to client
|
|
await writer.write(
|
|
encoder.encode(`data: ${JSON.stringify({ text: chunk.textDelta })}\n\n`)
|
|
);
|
|
|
|
// Stream to Deepgram TTS
|
|
await ttsStream.streamText(chunk.textDelta);
|
|
}
|
|
},
|
|
});
|
|
|
|
await result.text; // Wait for completion
|
|
await ttsStream.flush();
|
|
|
|
await writer.write(encoder.encode('data: [DONE]\n\n'));
|
|
} catch (error) {
|
|
await writer.write(
|
|
encoder.encode(`data: ${JSON.stringify({ error: error.message })}\n\n`)
|
|
);
|
|
} finally {
|
|
await writer.close();
|
|
}
|
|
})();
|
|
|
|
return new Response(stream.readable, {
|
|
headers: {
|
|
'Content-Type': 'text/event-stream',
|
|
'Cache-Control': 'no-cache',
|
|
Connection: 'keep-alive',
|
|
},
|
|
});
|
|
}
|
|
```
|
|
|
|
### 4. Update Client to Consume SSE with TTS
|
|
|
|
#### `components/ChatInterface.tsx`
|
|
```typescript
|
|
const [isTTSEnabled, setIsTTSEnabled] = useState(false);
|
|
const ttsStreamRef = useRef<DeepgramTTSStream | null>(null);
|
|
|
|
async function sendMessageWithTTS(message: string) {
|
|
const response = await fetch('/api/chat-with-tts', {
|
|
method: 'POST',
|
|
headers: { 'Content-Type': 'application/json' },
|
|
body: JSON.stringify({ messages: [...messages, { role: 'user', content: message }] }),
|
|
});
|
|
|
|
const reader = response.body!.getReader();
|
|
const decoder = new TextDecoder();
|
|
|
|
// Initialize TTS stream
|
|
if (isTTSEnabled) {
|
|
ttsStreamRef.current = new DeepgramTTSStream();
|
|
}
|
|
|
|
let fullText = '';
|
|
|
|
while (true) {
|
|
const { done, value } = await reader.read();
|
|
if (done) break;
|
|
|
|
const chunk = decoder.decode(value);
|
|
const lines = chunk.split('\n');
|
|
|
|
for (const line of lines) {
|
|
if (line.startsWith('data: ')) {
|
|
const data = line.slice(6);
|
|
if (data === '[DONE]') {
|
|
if (ttsStreamRef.current) {
|
|
await ttsStreamRef.current.flush();
|
|
}
|
|
break;
|
|
}
|
|
|
|
try {
|
|
const parsed = JSON.parse(data);
|
|
if (parsed.text) {
|
|
fullText += parsed.text;
|
|
// Update UI with incremental text
|
|
setMessages((prev) => {
|
|
const last = prev[prev.length - 1];
|
|
if (last && last.role === 'assistant') {
|
|
return [...prev.slice(0, -1), { ...last, content: fullText }];
|
|
}
|
|
return [...prev, { role: 'assistant', content: fullText }];
|
|
});
|
|
|
|
// Stream to TTS
|
|
if (ttsStreamRef.current) {
|
|
await ttsStreamRef.current.streamText(parsed.text);
|
|
}
|
|
}
|
|
} catch (e) {
|
|
console.error('Failed to parse SSE data:', e);
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Alternative: Use Deepgram's Native Streaming TTS
|
|
|
|
Deepgram has a WebSocket-based streaming TTS API that's even more efficient:
|
|
|
|
```typescript
|
|
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);
|
|
|
|
const connection = deepgram.speak.live({
|
|
model: 'aura-asteria-en',
|
|
encoding: 'linear16',
|
|
sample_rate: 24000,
|
|
});
|
|
|
|
connection.on('open', () => {
|
|
console.log('TTS connection established');
|
|
});
|
|
|
|
connection.on('data', (audioData: Buffer) => {
|
|
// Play audio chunk immediately
|
|
playAudioBuffer(audioData);
|
|
});
|
|
|
|
// As AI chunks arrive, send to Deepgram
|
|
aiStream.on('text-delta', (text) => {
|
|
connection.send(text);
|
|
});
|
|
|
|
// When AI completes
|
|
aiStream.on('finish', () => {
|
|
connection.close();
|
|
});
|
|
```
|
|
|
|
## Implementation Steps
|
|
|
|
1. **Research Deepgram TTS Streaming API**
|
|
- Review docs: https://developers.deepgram.com/docs/tts-streaming
|
|
- Test WebSocket connection manually
|
|
- Understand audio format and buffering
|
|
|
|
2. **Create TTS streaming service**
|
|
- `lib/deepgram-tts-stream.ts`
|
|
- Implement audio queue and playback
|
|
- Handle reconnection and errors
|
|
|
|
3. **Modify API route for streaming**
|
|
- Create `/api/chat-with-tts` route
|
|
- Implement SSE response
|
|
- Connect AI stream to TTS stream
|
|
|
|
4. **Update client components**
|
|
- Add TTS toggle in UI
|
|
- Implement SSE consumption
|
|
- Connect to audio playback
|
|
|
|
5. **Test with Playwright MCP**
|
|
- Enable TTS
|
|
- Send message
|
|
- Verify audio starts playing quickly (< 2s)
|
|
- Verify audio quality
|
|
- Test error handling (network drop, TTS failure)
|
|
|
|
6. **Add Magnitude test**
|
|
```typescript
|
|
test('TTS streams audio with low latency', async (agent) => {
|
|
await agent.open('http://localhost:3000/chat');
|
|
await agent.act('Enable TTS in settings');
|
|
await agent.act('Send message "Hello"');
|
|
|
|
await agent.check('Audio starts playing within 2 seconds');
|
|
await agent.check('Audio continues as AI generates response');
|
|
await agent.check('Audio completes without gaps');
|
|
});
|
|
```
|
|
|
|
## Performance Targets
|
|
|
|
- **Time to first audio:** < 2 seconds (vs current 4-13s)
|
|
- **Perceived latency:** Near real-time streaming
|
|
- **Audio quality:** No degradation from current implementation
|
|
- **Reliability:** Graceful fallback if streaming fails
|
|
|
|
## Success Criteria
|
|
|
|
- ✅ TTS audio starts playing within 2 seconds of AI response beginning
|
|
- ✅ Audio streams continuously as AI generates text
|
|
- ✅ No perceptible gaps or stuttering in audio playback
|
|
- ✅ Graceful fallback to batch TTS if streaming fails
|
|
- ✅ Playwright MCP manual test passes
|
|
- ✅ Magnitude test passes
|
|
- ✅ No regression in audio quality
|
|
|
|
## Files to Create
|
|
|
|
1. `lib/deepgram-tts-stream.ts` - TTS streaming service
|
|
2. `app/api/chat-with-tts/route.ts` - SSE endpoint for TTS
|
|
3. `tests/playwright/tts-streaming.spec.ts` - Manual Playwright test
|
|
4. `tests/magnitude/tts-streaming.mag.ts` - Magnitude test
|
|
|
|
## Files to Update
|
|
|
|
1. `components/ChatInterface.tsx` - Add TTS streaming consumption
|
|
2. `app/theme.ts` - Add TTS toggle styling if needed
|