Files
app/plans/03-stream-ai-to-deepgram-tts.md
Albert b96159ec02 docs: Add comprehensive implementation plans for all todo items
Created detailed markdown plans for all items in todo.md:

1. 01-playwright-scaffolding.md - Base Playwright infrastructure
2. 02-magnitude-tests-comprehensive.md - Complete test coverage
3. 03-stream-ai-to-deepgram-tts.md - TTS latency optimization
4. 04-fix-galaxy-node-clicking.md - Galaxy navigation bugs
5. 05-dark-light-mode-theme.md - Dark/light mode with dynamic favicons
6. 06-fix-double-border-desktop.md - UI polish
7. 07-delete-backup-files.md - Code cleanup
8. 08-ai-transition-to-edit.md - Intelligent node creation flow
9. 09-umap-minimum-nodes-analysis.md - Technical analysis

Each plan includes:
- Detailed problem analysis
- Proposed solutions with code examples
- Manual Playwright MCP testing strategy
- Magnitude test specifications
- Implementation steps
- Success criteria

Ready to implement in sequence.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-09 21:07:42 +00:00

10 KiB

Plan: Stream AI Output to Deepgram for Faster TTS Synthesis

Priority: MEDIUM Dependencies: None Affects: Voice interaction latency, user experience

Overview

Currently, the app waits for the complete AI response before sending it to Deepgram for TTS. This creates a laggy experience. By streaming the AI output directly to Deepgram as it's generated, we can start playing audio much faster and create a more responsive voice interaction.

Current Implementation

Current Flow (SLOW)

User speaks → Deepgram transcribe → Send to AI
                                      ↓
                                  Wait for full response (3-10s)
                                      ↓
                                  Send complete text to Deepgram TTS
                                      ↓
                                  Wait for audio generation (1-3s)
                                      ↓
                                  Play audio

Total latency: 4-13 seconds before first audio plays

Proposed Implementation

New Flow (FAST)

User speaks → Deepgram transcribe → Stream to AI
                                      ↓
                                  Stream chunks to Deepgram TTS
                                      ↓ (chunks arrive)
                                  Play audio chunks immediately

Total latency: 1-2 seconds before first audio plays

Technical Approach

1. Modify AI SDK Integration

Currently using useChat from Vercel AI SDK with async completion:

// Current (app/api/chat/route.ts)
const result = await streamText({
  model: google('gemini-2.0-flash-exp'),
  messages,
  system: systemPrompt,
});

return result.toDataStreamResponse();

Need to add TTS streaming:

// New approach
const result = streamText({
  model: google('gemini-2.0-flash-exp'),
  messages,
  system: systemPrompt,
  async onChunk({ chunk }) {
    // Stream each chunk to Deepgram TTS
    if (chunk.type === 'text-delta') {
      await streamToDeepgram(chunk.textDelta);
    }
  },
});

return result.toDataStreamResponse();

2. Create Deepgram TTS Streaming Service

lib/deepgram-tts-stream.ts

import { createClient, LiveClient } from '@deepgram/sdk';

export class DeepgramTTSStream {
  private client: LiveClient;
  private audioQueue: Uint8Array[] = [];
  private isPlaying = false;

  constructor(apiKey: string) {
    const deepgram = createClient(apiKey);
    this.client = deepgram.speak.live({
      model: 'aura-asteria-en',
      encoding: 'linear16',
      sample_rate: 24000,
    });

    this.client.on('data', (data: Buffer) => {
      this.audioQueue.push(new Uint8Array(data));
      this.playNextChunk();
    });
  }

  async streamText(text: string) {
    // Send text chunk to Deepgram for synthesis
    this.client.send(text);
  }

  async flush() {
    // Signal end of text stream
    this.client.close();
  }

  private async playNextChunk() {
    if (this.isPlaying || this.audioQueue.length === 0) return;

    this.isPlaying = true;
    const chunk = this.audioQueue.shift()!;

    // Play audio chunk using Web Audio API
    await this.playAudioChunk(chunk);

    this.isPlaying = false;
    this.playNextChunk(); // Play next if available
  }

  private async playAudioChunk(chunk: Uint8Array) {
    const audioContext = new AudioContext({ sampleRate: 24000 });
    const audioBuffer = audioContext.createBuffer(
      1, // mono
      chunk.length / 2, // 16-bit samples
      24000
    );

    const channelData = audioBuffer.getChannelData(0);
    for (let i = 0; i < chunk.length / 2; i++) {
      // Convert 16-bit PCM to float32
      const sample = (chunk[i * 2] | (chunk[i * 2 + 1] << 8));
      channelData[i] = sample / 32768.0;
    }

    const source = audioContext.createBufferSource();
    source.buffer = audioBuffer;
    source.connect(audioContext.destination);

    return new Promise((resolve) => {
      source.onended = resolve;
      source.start();
    });
  }
}

3. Create Server-Sent Events (SSE) Endpoint for TTS

app/api/chat-with-tts/route.ts

import { DeepgramTTSStream } from '@/lib/deepgram-tts-stream';
import { streamText } from 'ai';
import { google } from '@ai-sdk/google';

export async function POST(request: Request) {
  const { messages } = await request.json();

  // Create a TransformStream for SSE
  const stream = new TransformStream();
  const writer = stream.writable.getWriter();
  const encoder = new TextEncoder();

  // Start streaming AI response
  (async () => {
    const ttsStream = new DeepgramTTSStream(process.env.DEEPGRAM_API_KEY!);

    try {
      const result = streamText({
        model: google('gemini-2.0-flash-exp'),
        messages,
        async onChunk({ chunk }) {
          if (chunk.type === 'text-delta') {
            // Send text to client
            await writer.write(
              encoder.encode(`data: ${JSON.stringify({ text: chunk.textDelta })}\n\n`)
            );

            // Stream to Deepgram TTS
            await ttsStream.streamText(chunk.textDelta);
          }
        },
      });

      await result.text; // Wait for completion
      await ttsStream.flush();

      await writer.write(encoder.encode('data: [DONE]\n\n'));
    } catch (error) {
      await writer.write(
        encoder.encode(`data: ${JSON.stringify({ error: error.message })}\n\n`)
      );
    } finally {
      await writer.close();
    }
  })();

  return new Response(stream.readable, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
      Connection: 'keep-alive',
    },
  });
}

4. Update Client to Consume SSE with TTS

components/ChatInterface.tsx

const [isTTSEnabled, setIsTTSEnabled] = useState(false);
const ttsStreamRef = useRef<DeepgramTTSStream | null>(null);

async function sendMessageWithTTS(message: string) {
  const response = await fetch('/api/chat-with-tts', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ messages: [...messages, { role: 'user', content: message }] }),
  });

  const reader = response.body!.getReader();
  const decoder = new TextDecoder();

  // Initialize TTS stream
  if (isTTSEnabled) {
    ttsStreamRef.current = new DeepgramTTSStream();
  }

  let fullText = '';

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(value);
    const lines = chunk.split('\n');

    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = line.slice(6);
        if (data === '[DONE]') {
          if (ttsStreamRef.current) {
            await ttsStreamRef.current.flush();
          }
          break;
        }

        try {
          const parsed = JSON.parse(data);
          if (parsed.text) {
            fullText += parsed.text;
            // Update UI with incremental text
            setMessages((prev) => {
              const last = prev[prev.length - 1];
              if (last && last.role === 'assistant') {
                return [...prev.slice(0, -1), { ...last, content: fullText }];
              }
              return [...prev, { role: 'assistant', content: fullText }];
            });

            // Stream to TTS
            if (ttsStreamRef.current) {
              await ttsStreamRef.current.streamText(parsed.text);
            }
          }
        } catch (e) {
          console.error('Failed to parse SSE data:', e);
        }
      }
    }
  }
}

Alternative: Use Deepgram's Native Streaming TTS

Deepgram has a WebSocket-based streaming TTS API that's even more efficient:

const deepgram = createClient(process.env.DEEPGRAM_API_KEY);

const connection = deepgram.speak.live({
  model: 'aura-asteria-en',
  encoding: 'linear16',
  sample_rate: 24000,
});

connection.on('open', () => {
  console.log('TTS connection established');
});

connection.on('data', (audioData: Buffer) => {
  // Play audio chunk immediately
  playAudioBuffer(audioData);
});

// As AI chunks arrive, send to Deepgram
aiStream.on('text-delta', (text) => {
  connection.send(text);
});

// When AI completes
aiStream.on('finish', () => {
  connection.close();
});

Implementation Steps

  1. Research Deepgram TTS Streaming API

  2. Create TTS streaming service

    • lib/deepgram-tts-stream.ts
    • Implement audio queue and playback
    • Handle reconnection and errors
  3. Modify API route for streaming

    • Create /api/chat-with-tts route
    • Implement SSE response
    • Connect AI stream to TTS stream
  4. Update client components

    • Add TTS toggle in UI
    • Implement SSE consumption
    • Connect to audio playback
  5. Test with Playwright MCP

    • Enable TTS
    • Send message
    • Verify audio starts playing quickly (< 2s)
    • Verify audio quality
    • Test error handling (network drop, TTS failure)
  6. Add Magnitude test

    test('TTS streams audio with low latency', async (agent) => {
      await agent.open('http://localhost:3000/chat');
      await agent.act('Enable TTS in settings');
      await agent.act('Send message "Hello"');
    
      await agent.check('Audio starts playing within 2 seconds');
      await agent.check('Audio continues as AI generates response');
      await agent.check('Audio completes without gaps');
    });
    

Performance Targets

  • Time to first audio: < 2 seconds (vs current 4-13s)
  • Perceived latency: Near real-time streaming
  • Audio quality: No degradation from current implementation
  • Reliability: Graceful fallback if streaming fails

Success Criteria

  • TTS audio starts playing within 2 seconds of AI response beginning
  • Audio streams continuously as AI generates text
  • No perceptible gaps or stuttering in audio playback
  • Graceful fallback to batch TTS if streaming fails
  • Playwright MCP manual test passes
  • Magnitude test passes
  • No regression in audio quality

Files to Create

  1. lib/deepgram-tts-stream.ts - TTS streaming service
  2. app/api/chat-with-tts/route.ts - SSE endpoint for TTS
  3. tests/playwright/tts-streaming.spec.ts - Manual Playwright test
  4. tests/magnitude/tts-streaming.mag.ts - Magnitude test

Files to Update

  1. components/ChatInterface.tsx - Add TTS streaming consumption
  2. app/theme.ts - Add TTS toggle styling if needed