Files
app/docs/voice-mode-prd.md
Albert 0ed2d6c0b3 feat: Improve UI layout and navigation
- Increase logo size (48x48 desktop, 56x56 mobile) for better visibility
- Add logo as favicon
- Add logo to mobile header
- Move user menu to navigation bars (sidebar on desktop, bottom bar on mobile)
- Fix desktop chat layout - container structure prevents voice controls cutoff
- Fix mobile bottom bar - use icon-only ActionIcons instead of truncated text buttons
- Hide Create Node/New Conversation buttons on mobile to save header space
- Make fixed header and voice controls work properly with containers

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-09 14:43:11 +00:00

4.7 KiB

Voice Mode PRD

User Flows

Flow 1: Starting Voice Conversation (No Previous Messages)

  1. User clicks "Start Voice Conversation" button
  2. System enters listening mode
  3. Button shows "Listening... Start speaking"
  4. Microphone indicator appears

Flow 2: Starting Voice Conversation (With Previous AI Message)

  1. User clicks "Start Voice Conversation" button
  2. System checks for most recent AI message
  3. If found and not already spoken in this session:
    • System generates and plays TTS for that message
    • Button shows "Generating speech..." then "AI is speaking..."
    • Skip button appears
  4. After audio finishes OR user clicks skip:
    • System enters listening mode

Flow 3: User Speaks

  1. User speaks (while in listening state)
  2. System detects speech, button shows "Speaking..."
  3. System receives interim transcripts (updates display)
  4. System receives finalized phrases (appends to transcript)
  5. After each finalized phrase, 3-second silence timer starts
  6. Button shows countdown: "Speaking... (auto-submits in 2.1s)"
  7. If user continues speaking, timer resets

Flow 4: Submit and AI Response

  1. After 3 seconds of silence, transcript is submitted
  2. Button shows "Processing..."
  3. User message appears in chat
  4. AI streams response (appears in chat)
  5. When streaming completes:
    • System generates TTS for AI response
    • Button shows "Generating speech..."
    • When TTS ready, plays audio
    • Button shows "AI is speaking..."
    • Skip button appears
  6. After audio finishes OR user clicks skip:
    • System returns to listening mode

Flow 5: Skipping AI Audio

  1. While AI is generating or speaking (button shows "Generating speech..." or "AI is speaking...")
  2. Skip button is visible
  3. User clicks Skip
  4. Audio stops immediately
  5. System enters listening mode
  6. Button shows "Listening... Start speaking"

Flow 6: Exiting Voice Mode

  1. User clicks voice button (at any time)
  2. System stops all audio
  3. System closes microphone connection
  4. Returns to text mode
  5. Button shows "Start Voice Conversation"

Critical Rules

  1. Latest Message Only: AI ONLY plays the most recent assistant message. Never re-play old messages.
  2. Skip Always Works: Skip button must IMMEDIATELY stop audio and return to listening.
  3. One Message Per Turn: Each user speech -> one submission -> one AI response -> one audio playback.
  4. Clean State: Every state transition should cancel any incompatible ongoing operations.

State Machine

text
  ├─ TOGGLE_VOICE_MODE → voice.idle

voice.idle
  ├─ Check for latest AI message not yet spoken
  │  ├─ If found → Send AI_RESPONSE_READY → voice.aiGenerating
  │  └─ If not found → Send START_LISTENING → voice.listening
  └─ TOGGLE_VOICE_MODE → text

voice.listening
  ├─ USER_STARTED_SPEAKING → voice.userSpeaking
  ├─ TRANSCRIPT_UPDATE → (update context.input for display)
  └─ TOGGLE_VOICE_MODE → text

voice.userSpeaking
  ├─ FINALIZED_PHRASE → voice.timingOut (starts 3s timer)
  ├─ TRANSCRIPT_UPDATE → (update context.input for display)
  └─ TOGGLE_VOICE_MODE → text

voice.timingOut
  ├─ FINALIZED_PHRASE → voice.timingOut (restart 3s timer)
  ├─ TRANSCRIPT_UPDATE → (update context.input for display)
  ├─ SILENCE_TIMEOUT → voice.processing
  └─ TOGGLE_VOICE_MODE → text

voice.processing
  ├─ (Effect: submit if not submitted, wait for AI response)
  ├─ When AI response ready → Send AI_RESPONSE_READY → voice.aiGenerating
  └─ TOGGLE_VOICE_MODE → text

voice.aiGenerating
  ├─ TTS_PLAYING → voice.aiSpeaking
  ├─ SKIP_AUDIO → voice.listening
  └─ TOGGLE_VOICE_MODE → text

voice.aiSpeaking
  ├─ TTS_FINISHED → voice.listening
  ├─ SKIP_AUDIO → voice.listening
  └─ TOGGLE_VOICE_MODE → text

Test Cases

Test 1: Basic Conversation

  1. Click "Start Voice Conversation"
  2. Skip initial greeting
  3. Say "Hello"
  4. Wait for AI response
  5. Let AI audio play completely
  6. Say "How are you?"
  7. Skip AI audio
  8. Say "Goodbye"

Expected: 3 exchanges, AI only plays latest message each time

Test 2: Multiple Skips

  1. Start voice mode
  2. Skip greeting immediately
  3. Say "Test one"
  4. Skip AI response immediately
  5. Say "Test two"
  6. Skip AI response immediately

Expected: All skips work instantly, no audio bleeding

Test 3: Re-entering Voice Mode

  1. Start voice mode
  2. Say "Hello"
  3. Let AI respond
  4. Exit voice mode (click button again)
  5. Re-enter voice mode

Expected: AI reads the most recent message (its last response)

Test 4: Long Speech

  1. Start voice mode
  2. Skip greeting
  3. Say a long sentence with multiple pauses < 3 seconds
  4. Wait for final 3s timeout

Expected: All speech is captured in one transcript