Voice Interaction Modes

Project Overview
py-xiaozhi is a Python-based AI voice assistant client built with a modern asynchronous architecture, supporting rich multimodal interaction features. The system integrates speech recognition, natural language processing, visual recognition, IoT device control, and other advanced technologies to deliver an intelligent interactive experience.
Core Features
- Multi-protocol Support: WebSocket/MQTT dual-protocol communication
- MCP Tool Ecosystem: Integrated 10+ specialized tool modules
- IoT Device Integration: Thing-based architecture for device management
- Visual Recognition: GLM-4V-based multimodal understanding
- Audio Processing: Opus encoding + WebRTC enhancement
- Global Shortcuts: System-level interaction control
Voice Interaction Modes
The system provides multiple voice interaction methods with flexible control and intelligent voice detection:
1. Manual Press Mode
- How to use: Hold the shortcut key to record, release to auto-send
- Default shortcut:
Ctrl+J(configurable) - Use cases: Precise control over recording duration, avoiding ambient noise interference
- Advantages:
- Prevents accidental recording triggers
- Full control over recording duration
- Suitable for noisy environments
2. Turn-Based Conversation Mode (AUTO_STOP)
- How to use: Press the shortcut key / click the manual conversation mode toggle in the bottom-right of the GUI to switch to auto-conversation
- Default shortcut:
Ctrl+K(configurable) - Use cases: Quiet environments, traditional conversational interaction, when AEC is disabled
- How it works:
- User speaks → AI replies → User speaks again
- Each conversation turn waits for the AI to finish replying
- Prevents echo and both sides speaking simultaneously
- System automatically disables microphone input while the AI is speaking
- Technical features:
- Default mode when AEC is disabled
- Suitable for unidirectional audio devices or environments with echo issues
- More stable conversation experience, avoiding audio conflicts
3. Real-Time Conversation Mode (REALTIME)
- How to use: Automatically activated when AEC echo cancellation is enabled
- Configuration requirement:
"AEC_OPTIONS.ENABLED": true - Use cases: Natural conversation, bidirectional interaction, complex environments, scenarios requiring the ability to interrupt the AI
- Note: Requires audio devices with built-in AEC; the built-in AEC is currently non-functional
4. Wake Word Mode
- How to use: Speak the preset wake word to activate the system
- Default wake words: "小智", "小美" (customizable in configuration)
- Model support: Based on Vosk offline speech recognition
- Configuration requirement: Requires downloading the corresponding speech recognition model
System State Management
The system uses an event-driven state machine architecture with the following operating states:
┌─────────────────────────────────────────────────────────┐
│ System State Flow Diagram │
└─────────────────────────────────────────────────────────┘
IDLE CONNECTING LISTENING
┌─────────┐ wake word/button ┌─────────┐ connected ┌─────────┐
│ Idle │ ────────────────> │Connecting│ ────────> │Listening│
│ Standby │ │ Server │ │Recording│
└─────────┘ └─────────┘ └─────────┘
↑ │ │
│ connection failed │ speech recognition
│ │ │ complete/timeout
│ ↓ │
│ ┌─────────┐ │
└── playback done/abort ─│Replying │ <───────────────┘
│ AI is │
│ speaking │
└─────────┘Run Modes and Deployment
GUI Mode (Default)
Graphical user interface mode, providing an intuitive interactive experience:
# Standard launch
python main.py
# Using MQTT protocol
python main.py --protocol mqttGUI Mode Features:
- Visual operation interface
- Real-time status display
- Audio waveform visualization
- System tray support
- Graphical settings interface
CLI Mode
Command-line interface mode, suitable for server deployment:
# CLI mode launch
python main.py --mode cli
# CLI + MQTT protocol
python main.py --mode cli --protocol mqttCLI Mode Features:
- Low resource usage
- Server-friendly
- Detailed log output
- Keyboard shortcut support
- Scriptable deployment
GPIO Mode
GPIO button control mode, suitable for Raspberry Pi and other embedded devices:
# GPIO mode launch (Linux only)
python main.py --mode gpio
# GPIO + MQTT protocol
python main.py --mode gpio --protocol mqttGPIO Mode Features:
- Linux only (Raspberry Pi)
- Controlled via physical buttons
- No screen or keyboard required
- Suitable for embedded deployment
Default Button Functions:
| Button | GPIO Pin | Function |
|---|---|---|
| KEY1 | GPIO 17 | Start/Stop conversation |
| KEY2 | GPIO 27 | Interrupt current speech |
| KEY3 | GPIO 22 | Toggle manual/auto mode |
| KEY4 | GPIO 23 | Exit program |
Build Features:
- Cross-platform support
- Single-file mode
- Dependency packaging
- Automated configuration
Platform Compatibility
Windows
- Fully Compatible: All features supported
- Audio Enhancement: Windows Audio API support
- Volume Control: Integrated pycaw volume management
- System Tray: Full tray functionality
- Global Hotkeys: Full shortcut key support
macOS
- Fully Compatible: Core features fully supported
- Status Bar: Tray icon displayed in the top status bar
- Permission Management: May need to grant microphone/camera permissions
- Shortcuts: Some shortcuts require system permissions
- Audio: Native CoreAudio support
Linux
- Compatibility: Supports mainstream distributions (Ubuntu/CentOS/Debian)
- Desktop Environments:
- GNOME: Full support
- KDE: Full support
- Xfce: Requires additional tray support
- Audio Systems:
- PulseAudio: Recommended (auto-detected)
- ALSA: Fallback option
- Dependencies: May need to install system tray support packages
# Ubuntu/Debian tray support
sudo apt-get install libappindicator3-1
# CentOS/RHEL tray support
sudo yum install libappindicator-gtk3Troubleshooting Guide
Common Issues
1. Speech recognition not working
- Use simple wake words like 小美, 小朋 for easier recognition
2. Camera not working
# Test camera
python scripts/camera_scanner.py
# Check camera permissions and device index3. Shortcuts not responding
- Check if other programs are occupying the same shortcuts
- Try running with administrator privileges (Windows)
- Check for system security software blocking
4. Network connection issues
- Check firewall settings
- Verify WebSocket/MQTT server addresses
- Test network connectivity