# Voice Mode PRD ## User Flows ### Flow 1: Starting Voice Conversation (No Previous Messages) 1. User clicks "Start Voice Conversation" button 2. System enters listening mode 3. Button shows "Listening... Start speaking" 4. Microphone indicator appears ### Flow 2: Starting Voice Conversation (With Previous AI Message) 1. User clicks "Start Voice Conversation" button 2. System checks for most recent AI message 3. If found and not already spoken in this session: - System generates and plays TTS for that message - Button shows "Generating speech..." then "AI is speaking..." - Skip button appears 4. After audio finishes OR user clicks skip: - System enters listening mode ### Flow 3: User Speaks 1. User speaks (while in listening state) 2. System detects speech, button shows "Speaking..." 3. System receives interim transcripts (updates display) 4. System receives finalized phrases (appends to transcript) 5. After each finalized phrase, 3-second silence timer starts 6. Button shows countdown: "Speaking... (auto-submits in 2.1s)" 7. If user continues speaking, timer resets ### Flow 4: Submit and AI Response 1. After 3 seconds of silence, transcript is submitted 2. Button shows "Processing..." 3. User message appears in chat 4. AI streams response (appears in chat) 5. When streaming completes: - System generates TTS for AI response - Button shows "Generating speech..." - When TTS ready, plays audio - Button shows "AI is speaking..." - Skip button appears 6. After audio finishes OR user clicks skip: - System returns to listening mode ### Flow 5: Skipping AI Audio 1. While AI is generating or speaking (button shows "Generating speech..." or "AI is speaking...") 2. Skip button is visible 3. User clicks Skip 4. Audio stops immediately 5. System enters listening mode 6. Button shows "Listening... Start speaking" ### Flow 6: Exiting Voice Mode 1. User clicks voice button (at any time) 2. System stops all audio 3. System closes microphone connection 4. Returns to text mode 5. Button shows "Start Voice Conversation" ## Critical Rules 1. **Latest Message Only**: AI ONLY plays the most recent assistant message. Never re-play old messages. 2. **Skip Always Works**: Skip button must IMMEDIATELY stop audio and return to listening. 3. **One Message Per Turn**: Each user speech -> one submission -> one AI response -> one audio playback. 4. **Clean State**: Every state transition should cancel any incompatible ongoing operations. ## State Machine ``` text ├─ TOGGLE_VOICE_MODE → voice.idle voice.idle ├─ Check for latest AI message not yet spoken │ ├─ If found → Send AI_RESPONSE_READY → voice.aiGenerating │ └─ If not found → Send START_LISTENING → voice.listening └─ TOGGLE_VOICE_MODE → text voice.listening ├─ USER_STARTED_SPEAKING → voice.userSpeaking ├─ TRANSCRIPT_UPDATE → (update context.input for display) └─ TOGGLE_VOICE_MODE → text voice.userSpeaking ├─ FINALIZED_PHRASE → voice.timingOut (starts 3s timer) ├─ TRANSCRIPT_UPDATE → (update context.input for display) └─ TOGGLE_VOICE_MODE → text voice.timingOut ├─ FINALIZED_PHRASE → voice.timingOut (restart 3s timer) ├─ TRANSCRIPT_UPDATE → (update context.input for display) ├─ SILENCE_TIMEOUT → voice.processing └─ TOGGLE_VOICE_MODE → text voice.processing ├─ (Effect: submit if not submitted, wait for AI response) ├─ When AI response ready → Send AI_RESPONSE_READY → voice.aiGenerating └─ TOGGLE_VOICE_MODE → text voice.aiGenerating ├─ TTS_PLAYING → voice.aiSpeaking ├─ SKIP_AUDIO → voice.listening └─ TOGGLE_VOICE_MODE → text voice.aiSpeaking ├─ TTS_FINISHED → voice.listening ├─ SKIP_AUDIO → voice.listening └─ TOGGLE_VOICE_MODE → text ``` ## Test Cases ### Test 1: Basic Conversation 1. Click "Start Voice Conversation" 2. Skip initial greeting 3. Say "Hello" 4. Wait for AI response 5. Let AI audio play completely 6. Say "How are you?" 7. Skip AI audio 8. Say "Goodbye" Expected: 3 exchanges, AI only plays latest message each time ### Test 2: Multiple Skips 1. Start voice mode 2. Skip greeting immediately 3. Say "Test one" 4. Skip AI response immediately 5. Say "Test two" 6. Skip AI response immediately Expected: All skips work instantly, no audio bleeding ### Test 3: Re-entering Voice Mode 1. Start voice mode 2. Say "Hello" 3. Let AI respond 4. Exit voice mode (click button again) 5. Re-enter voice mode Expected: AI reads the most recent message (its last response) ### Test 4: Long Speech 1. Start voice mode 2. Skip greeting 3. Say a long sentence with multiple pauses < 3 seconds 4. Wait for final 3s timeout Expected: All speech is captured in one transcript