feat: Improve UI layout and navigation
- Increase logo size (48x48 desktop, 56x56 mobile) for better visibility - Add logo as favicon - Add logo to mobile header - Move user menu to navigation bars (sidebar on desktop, bottom bar on mobile) - Fix desktop chat layout - container structure prevents voice controls cutoff - Fix mobile bottom bar - use icon-only ActionIcons instead of truncated text buttons - Hide Create Node/New Conversation buttons on mobile to save header space - Make fixed header and voice controls work properly with containers 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
149
docs/voice-mode-prd.md
Normal file
149
docs/voice-mode-prd.md
Normal file
@@ -0,0 +1,149 @@
|
||||
# Voice Mode PRD
|
||||
|
||||
## User Flows
|
||||
|
||||
### Flow 1: Starting Voice Conversation (No Previous Messages)
|
||||
1. User clicks "Start Voice Conversation" button
|
||||
2. System enters listening mode
|
||||
3. Button shows "Listening... Start speaking"
|
||||
4. Microphone indicator appears
|
||||
|
||||
### Flow 2: Starting Voice Conversation (With Previous AI Message)
|
||||
1. User clicks "Start Voice Conversation" button
|
||||
2. System checks for most recent AI message
|
||||
3. If found and not already spoken in this session:
|
||||
- System generates and plays TTS for that message
|
||||
- Button shows "Generating speech..." then "AI is speaking..."
|
||||
- Skip button appears
|
||||
4. After audio finishes OR user clicks skip:
|
||||
- System enters listening mode
|
||||
|
||||
### Flow 3: User Speaks
|
||||
1. User speaks (while in listening state)
|
||||
2. System detects speech, button shows "Speaking..."
|
||||
3. System receives interim transcripts (updates display)
|
||||
4. System receives finalized phrases (appends to transcript)
|
||||
5. After each finalized phrase, 3-second silence timer starts
|
||||
6. Button shows countdown: "Speaking... (auto-submits in 2.1s)"
|
||||
7. If user continues speaking, timer resets
|
||||
|
||||
### Flow 4: Submit and AI Response
|
||||
1. After 3 seconds of silence, transcript is submitted
|
||||
2. Button shows "Processing..."
|
||||
3. User message appears in chat
|
||||
4. AI streams response (appears in chat)
|
||||
5. When streaming completes:
|
||||
- System generates TTS for AI response
|
||||
- Button shows "Generating speech..."
|
||||
- When TTS ready, plays audio
|
||||
- Button shows "AI is speaking..."
|
||||
- Skip button appears
|
||||
6. After audio finishes OR user clicks skip:
|
||||
- System returns to listening mode
|
||||
|
||||
### Flow 5: Skipping AI Audio
|
||||
1. While AI is generating or speaking (button shows "Generating speech..." or "AI is speaking...")
|
||||
2. Skip button is visible
|
||||
3. User clicks Skip
|
||||
4. Audio stops immediately
|
||||
5. System enters listening mode
|
||||
6. Button shows "Listening... Start speaking"
|
||||
|
||||
### Flow 6: Exiting Voice Mode
|
||||
1. User clicks voice button (at any time)
|
||||
2. System stops all audio
|
||||
3. System closes microphone connection
|
||||
4. Returns to text mode
|
||||
5. Button shows "Start Voice Conversation"
|
||||
|
||||
## Critical Rules
|
||||
|
||||
1. **Latest Message Only**: AI ONLY plays the most recent assistant message. Never re-play old messages.
|
||||
2. **Skip Always Works**: Skip button must IMMEDIATELY stop audio and return to listening.
|
||||
3. **One Message Per Turn**: Each user speech -> one submission -> one AI response -> one audio playback.
|
||||
4. **Clean State**: Every state transition should cancel any incompatible ongoing operations.
|
||||
|
||||
## State Machine
|
||||
|
||||
```
|
||||
text
|
||||
├─ TOGGLE_VOICE_MODE → voice.idle
|
||||
|
||||
voice.idle
|
||||
├─ Check for latest AI message not yet spoken
|
||||
│ ├─ If found → Send AI_RESPONSE_READY → voice.aiGenerating
|
||||
│ └─ If not found → Send START_LISTENING → voice.listening
|
||||
└─ TOGGLE_VOICE_MODE → text
|
||||
|
||||
voice.listening
|
||||
├─ USER_STARTED_SPEAKING → voice.userSpeaking
|
||||
├─ TRANSCRIPT_UPDATE → (update context.input for display)
|
||||
└─ TOGGLE_VOICE_MODE → text
|
||||
|
||||
voice.userSpeaking
|
||||
├─ FINALIZED_PHRASE → voice.timingOut (starts 3s timer)
|
||||
├─ TRANSCRIPT_UPDATE → (update context.input for display)
|
||||
└─ TOGGLE_VOICE_MODE → text
|
||||
|
||||
voice.timingOut
|
||||
├─ FINALIZED_PHRASE → voice.timingOut (restart 3s timer)
|
||||
├─ TRANSCRIPT_UPDATE → (update context.input for display)
|
||||
├─ SILENCE_TIMEOUT → voice.processing
|
||||
└─ TOGGLE_VOICE_MODE → text
|
||||
|
||||
voice.processing
|
||||
├─ (Effect: submit if not submitted, wait for AI response)
|
||||
├─ When AI response ready → Send AI_RESPONSE_READY → voice.aiGenerating
|
||||
└─ TOGGLE_VOICE_MODE → text
|
||||
|
||||
voice.aiGenerating
|
||||
├─ TTS_PLAYING → voice.aiSpeaking
|
||||
├─ SKIP_AUDIO → voice.listening
|
||||
└─ TOGGLE_VOICE_MODE → text
|
||||
|
||||
voice.aiSpeaking
|
||||
├─ TTS_FINISHED → voice.listening
|
||||
├─ SKIP_AUDIO → voice.listening
|
||||
└─ TOGGLE_VOICE_MODE → text
|
||||
```
|
||||
|
||||
## Test Cases
|
||||
|
||||
### Test 1: Basic Conversation
|
||||
1. Click "Start Voice Conversation"
|
||||
2. Skip initial greeting
|
||||
3. Say "Hello"
|
||||
4. Wait for AI response
|
||||
5. Let AI audio play completely
|
||||
6. Say "How are you?"
|
||||
7. Skip AI audio
|
||||
8. Say "Goodbye"
|
||||
|
||||
Expected: 3 exchanges, AI only plays latest message each time
|
||||
|
||||
### Test 2: Multiple Skips
|
||||
1. Start voice mode
|
||||
2. Skip greeting immediately
|
||||
3. Say "Test one"
|
||||
4. Skip AI response immediately
|
||||
5. Say "Test two"
|
||||
6. Skip AI response immediately
|
||||
|
||||
Expected: All skips work instantly, no audio bleeding
|
||||
|
||||
### Test 3: Re-entering Voice Mode
|
||||
1. Start voice mode
|
||||
2. Say "Hello"
|
||||
3. Let AI respond
|
||||
4. Exit voice mode (click button again)
|
||||
5. Re-enter voice mode
|
||||
|
||||
Expected: AI reads the most recent message (its last response)
|
||||
|
||||
### Test 4: Long Speech
|
||||
1. Start voice mode
|
||||
2. Skip greeting
|
||||
3. Say a long sentence with multiple pauses < 3 seconds
|
||||
4. Wait for final 3s timeout
|
||||
|
||||
Expected: All speech is captured in one transcript
|
||||
Reference in New Issue
Block a user