- Increase logo size (48x48 desktop, 56x56 mobile) for better visibility - Add logo as favicon - Add logo to mobile header - Move user menu to navigation bars (sidebar on desktop, bottom bar on mobile) - Fix desktop chat layout - container structure prevents voice controls cutoff - Fix mobile bottom bar - use icon-only ActionIcons instead of truncated text buttons - Hide Create Node/New Conversation buttons on mobile to save header space - Make fixed header and voice controls work properly with containers 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
150 lines
4.7 KiB
Markdown
150 lines
4.7 KiB
Markdown
# Voice Mode PRD
|
|
|
|
## User Flows
|
|
|
|
### Flow 1: Starting Voice Conversation (No Previous Messages)
|
|
1. User clicks "Start Voice Conversation" button
|
|
2. System enters listening mode
|
|
3. Button shows "Listening... Start speaking"
|
|
4. Microphone indicator appears
|
|
|
|
### Flow 2: Starting Voice Conversation (With Previous AI Message)
|
|
1. User clicks "Start Voice Conversation" button
|
|
2. System checks for most recent AI message
|
|
3. If found and not already spoken in this session:
|
|
- System generates and plays TTS for that message
|
|
- Button shows "Generating speech..." then "AI is speaking..."
|
|
- Skip button appears
|
|
4. After audio finishes OR user clicks skip:
|
|
- System enters listening mode
|
|
|
|
### Flow 3: User Speaks
|
|
1. User speaks (while in listening state)
|
|
2. System detects speech, button shows "Speaking..."
|
|
3. System receives interim transcripts (updates display)
|
|
4. System receives finalized phrases (appends to transcript)
|
|
5. After each finalized phrase, 3-second silence timer starts
|
|
6. Button shows countdown: "Speaking... (auto-submits in 2.1s)"
|
|
7. If user continues speaking, timer resets
|
|
|
|
### Flow 4: Submit and AI Response
|
|
1. After 3 seconds of silence, transcript is submitted
|
|
2. Button shows "Processing..."
|
|
3. User message appears in chat
|
|
4. AI streams response (appears in chat)
|
|
5. When streaming completes:
|
|
- System generates TTS for AI response
|
|
- Button shows "Generating speech..."
|
|
- When TTS ready, plays audio
|
|
- Button shows "AI is speaking..."
|
|
- Skip button appears
|
|
6. After audio finishes OR user clicks skip:
|
|
- System returns to listening mode
|
|
|
|
### Flow 5: Skipping AI Audio
|
|
1. While AI is generating or speaking (button shows "Generating speech..." or "AI is speaking...")
|
|
2. Skip button is visible
|
|
3. User clicks Skip
|
|
4. Audio stops immediately
|
|
5. System enters listening mode
|
|
6. Button shows "Listening... Start speaking"
|
|
|
|
### Flow 6: Exiting Voice Mode
|
|
1. User clicks voice button (at any time)
|
|
2. System stops all audio
|
|
3. System closes microphone connection
|
|
4. Returns to text mode
|
|
5. Button shows "Start Voice Conversation"
|
|
|
|
## Critical Rules
|
|
|
|
1. **Latest Message Only**: AI ONLY plays the most recent assistant message. Never re-play old messages.
|
|
2. **Skip Always Works**: Skip button must IMMEDIATELY stop audio and return to listening.
|
|
3. **One Message Per Turn**: Each user speech -> one submission -> one AI response -> one audio playback.
|
|
4. **Clean State**: Every state transition should cancel any incompatible ongoing operations.
|
|
|
|
## State Machine
|
|
|
|
```
|
|
text
|
|
├─ TOGGLE_VOICE_MODE → voice.idle
|
|
|
|
voice.idle
|
|
├─ Check for latest AI message not yet spoken
|
|
│ ├─ If found → Send AI_RESPONSE_READY → voice.aiGenerating
|
|
│ └─ If not found → Send START_LISTENING → voice.listening
|
|
└─ TOGGLE_VOICE_MODE → text
|
|
|
|
voice.listening
|
|
├─ USER_STARTED_SPEAKING → voice.userSpeaking
|
|
├─ TRANSCRIPT_UPDATE → (update context.input for display)
|
|
└─ TOGGLE_VOICE_MODE → text
|
|
|
|
voice.userSpeaking
|
|
├─ FINALIZED_PHRASE → voice.timingOut (starts 3s timer)
|
|
├─ TRANSCRIPT_UPDATE → (update context.input for display)
|
|
└─ TOGGLE_VOICE_MODE → text
|
|
|
|
voice.timingOut
|
|
├─ FINALIZED_PHRASE → voice.timingOut (restart 3s timer)
|
|
├─ TRANSCRIPT_UPDATE → (update context.input for display)
|
|
├─ SILENCE_TIMEOUT → voice.processing
|
|
└─ TOGGLE_VOICE_MODE → text
|
|
|
|
voice.processing
|
|
├─ (Effect: submit if not submitted, wait for AI response)
|
|
├─ When AI response ready → Send AI_RESPONSE_READY → voice.aiGenerating
|
|
└─ TOGGLE_VOICE_MODE → text
|
|
|
|
voice.aiGenerating
|
|
├─ TTS_PLAYING → voice.aiSpeaking
|
|
├─ SKIP_AUDIO → voice.listening
|
|
└─ TOGGLE_VOICE_MODE → text
|
|
|
|
voice.aiSpeaking
|
|
├─ TTS_FINISHED → voice.listening
|
|
├─ SKIP_AUDIO → voice.listening
|
|
└─ TOGGLE_VOICE_MODE → text
|
|
```
|
|
|
|
## Test Cases
|
|
|
|
### Test 1: Basic Conversation
|
|
1. Click "Start Voice Conversation"
|
|
2. Skip initial greeting
|
|
3. Say "Hello"
|
|
4. Wait for AI response
|
|
5. Let AI audio play completely
|
|
6. Say "How are you?"
|
|
7. Skip AI audio
|
|
8. Say "Goodbye"
|
|
|
|
Expected: 3 exchanges, AI only plays latest message each time
|
|
|
|
### Test 2: Multiple Skips
|
|
1. Start voice mode
|
|
2. Skip greeting immediately
|
|
3. Say "Test one"
|
|
4. Skip AI response immediately
|
|
5. Say "Test two"
|
|
6. Skip AI response immediately
|
|
|
|
Expected: All skips work instantly, no audio bleeding
|
|
|
|
### Test 3: Re-entering Voice Mode
|
|
1. Start voice mode
|
|
2. Say "Hello"
|
|
3. Let AI respond
|
|
4. Exit voice mode (click button again)
|
|
5. Re-enter voice mode
|
|
|
|
Expected: AI reads the most recent message (its last response)
|
|
|
|
### Test 4: Long Speech
|
|
1. Start voice mode
|
|
2. Skip greeting
|
|
3. Say a long sentence with multiple pauses < 3 seconds
|
|
4. Wait for final 3s timeout
|
|
|
|
Expected: All speech is captured in one transcript
|