|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
library_name: same |
|
|
tags: |
|
|
- vision-language |
|
|
- navigation |
|
|
- embodied-ai |
|
|
- visual-navigation |
|
|
- mixture-of-experts |
|
|
- multimodal |
|
|
- pytorch |
|
|
datasets: |
|
|
- R2R |
|
|
- REVERIE |
|
|
- RXR |
|
|
- CVDN |
|
|
- SOON |
|
|
- ObjectNav-MP3D |
|
|
metrics: |
|
|
- success_rate |
|
|
- spl |
|
|
pipeline_tag: visual-question-answering |
|
|
model-index: |
|
|
- name: SAME |
|
|
results: |
|
|
- task: |
|
|
type: visual-navigation |
|
|
name: Vision-and-Language Navigation |
|
|
dataset: |
|
|
type: R2R |
|
|
name: Room-to-Room (R2R) |
|
|
metrics: |
|
|
- type: success_rate |
|
|
value: 76 |
|
|
name: SR (val_unseen) |
|
|
- type: spl |
|
|
value: 66 |
|
|
name: SPL (val_unseen) |
|
|
- type: success_rate |
|
|
value: 74 |
|
|
name: SR (test_unseen) |
|
|
- type: spl |
|
|
value: 64 |
|
|
name: SPL (test_unseen) |
|
|
- task: |
|
|
type: visual-navigation |
|
|
name: Vision-and-Language Navigation |
|
|
dataset: |
|
|
type: REVERIE |
|
|
name: REVERIE |
|
|
metrics: |
|
|
- type: success_rate |
|
|
value: 46.4 |
|
|
name: SR (val_unseen) |
|
|
- type: spl |
|
|
value: 36.1 |
|
|
name: SPL (val_unseen) |
|
|
- type: success_rate |
|
|
value: 48.6 |
|
|
name: SR (test_unseen) |
|
|
- type: spl |
|
|
value: 37.1 |
|
|
name: SPL (test_unseen) |
|
|
- task: |
|
|
type: visual-navigation |
|
|
name: Multilingual VLN |
|
|
dataset: |
|
|
type: RXR |
|
|
name: RxR-EN |
|
|
metrics: |
|
|
- type: success_rate |
|
|
value: 50.5 |
|
|
name: SR (val_unseen) |
|
|
- type: ndtw |
|
|
value: 51.2 |
|
|
name: nDTW (val_unseen) |
|
|
- task: |
|
|
type: visual-navigation |
|
|
name: Dialog Navigation |
|
|
dataset: |
|
|
type: CVDN |
|
|
name: CVDN |
|
|
metrics: |
|
|
- type: goal_progress |
|
|
value: 6.94 |
|
|
name: GP (val) |
|
|
- type: goal_progress |
|
|
value: 7.07 |
|
|
name: GP (test) |
|
|
- task: |
|
|
type: visual-navigation |
|
|
name: Object-Oriented Navigation |
|
|
dataset: |
|
|
type: SOON |
|
|
name: SOON |
|
|
metrics: |
|
|
- type: success_rate |
|
|
value: 36.1 |
|
|
name: SR (val_unseen) |
|
|
- type: spl |
|
|
value: 25.4 |
|
|
name: SPL (val_unseen) |
|
|
- type: success_rate |
|
|
value: 38.2 |
|
|
name: SR (test_unseen) |
|
|
- type: spl |
|
|
value: 27.1 |
|
|
name: SPL (test_unseen) |
|
|
- task: |
|
|
type: object-navigation |
|
|
name: Object Navigation |
|
|
dataset: |
|
|
type: ObjectNav-MP3D |
|
|
name: ObjectNav-MP3D |
|
|
metrics: |
|
|
- type: success_rate |
|
|
value: 76.3 |
|
|
name: SR (val) |
|
|
- type: spl |
|
|
value: 42.7 |
|
|
name: SPL (val) |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
<h1><span style="background: linear-gradient(to right, #007BA7, #99B5D2); -webkit-background-clip: text; color: transparent;font-style: italic;"> SAME</span>: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts</h1> |
|
|
|
|
|
<div> |
|
|
<a href='https://gengzezhou.github.io' target='_blank'>Gengze Zhou<sup>🍕</sup></a>; |
|
|
<a href='http://www.yiconghong.me' target='_blank'>Yicong Hong<sup>🌭</sup></a>; |
|
|
<a href='https://zunwang1.github.io' target='_blank'>Zun Wang<sup>🍔</sup></a>; |
|
|
<a href='https://github.com/zhaoc5' target='_blank'>Chongyang Zhao<sup>🌮</sup></a>; |
|
|
<a href='https://www.cs.unc.edu/~mbansal/' target='_blank'>Mohit Bansal<sup>🍔</sup></a>; |
|
|
<a href='http://www.qi-wu.me' target='_blank'>Qi Wu<sup>🍕</sup></a> |
|
|
</div> |
|
|
<sup>🍕</sup>AIML, University of Adelaide |
|
|
<sup>🌭</sup>Adobe Research |
|
|
<sup>🍔</sup>UNC, Chapel Hill |
|
|
<sup>🌮</sup>UNSW Sydney |
|
|
|
|
|
<br> |
|
|
|
|
|
<div> |
|
|
<a href='https://github.com/GengzeZhou/SAME' target='_blank'><img alt="Static Badge" src="https://img.shields.io/badge/VLNBench-v0.1-blue"></a> |
|
|
<a href='https://arxiv.org/abs/2412.05552' target='_blank'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> |
|
|
<a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a> |
|
|
</div> |
|
|
|
|
|
</div> |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**SAME** (State-Adaptive Mixture of Experts) is a unified framework for language-guided visual navigation that consolidates diverse navigation tasks into a single versatile agent. Unlike previous task-specific approaches, SAME can handle both **high-level category-specific search** (e.g., "find a chair") and **low-level language-guided navigation** (e.g., detailed turn-by-turn instructions) through a novel state-adaptive Mixture of Experts (MoE) architecture. |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- **Multi-Task Capability**: Single model handles 9 different navigation datasets simultaneously |
|
|
- **State-Adaptive MoE**: Dynamic expert routing based on multimodal features (text + visual observations) |
|
|
- **Simulator-Free**: Works entirely with pre-computed CLIP ViT-B/16 features - no simulator installation required |
|
|
- **Flexible Architecture**: MoE can be placed at attention query, key-value, or feed-forward network positions |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
SAME is built on a transformer-based architecture with the following key components: |
|
|
|
|
|
| Component | Description | |
|
|
|-----------|-------------| |
|
|
| **Language Encoder** | 9-layer BERT-based transformer encoder | |
|
|
| **Image Embeddings** | Processes 512-dim CLIP ViT-B/16 panoramic features | |
|
|
| **Local VP Encoder** | Viewport-level information with crossmodal fusion | |
|
|
| **Global Map Encoder** | Global spatial graph with dynamic routing | |
|
|
| **State-Adaptive MoE** | 8 experts with top-2 selection, multimodal routing | |
|
|
|
|
|
### MoE Routing |
|
|
|
|
|
The State-Adaptive MoE uses multimodal features (fused text + visual embeddings) to dynamically route tokens to specialized experts. This allows the model to adapt its behavior based on: |
|
|
- The granularity of language instructions |
|
|
- Current visual observations |
|
|
- Navigation task requirements |
|
|
|
|
|
## Intended Uses |
|
|
|
|
|
### Primary Use Cases |
|
|
|
|
|
- **Vision-and-Language Navigation (VLN)**: Following natural language instructions in indoor environments |
|
|
- **Object Navigation**: Finding target objects given category names |
|
|
- **Dialog-based Navigation**: Multi-turn conversational navigation |
|
|
- **Remote Object Grounding**: Navigating to and identifying remote objects |
|
|
|
|
|
### Supported Tasks |
|
|
|
|
|
| Task | Dataset | Description | |
|
|
|------|---------|-------------| |
|
|
| Low-Level Navigation | R2R, R2R-PREVALENT, R2R-ScaleVLN | Fine-grained instruction following | |
|
|
| Object Grounding | REVERIE, REVERIE-ScaleVLN | Navigate and ground remote objects | |
|
|
| Long Horizontal VLN | RXR-EN | Long horizon navigation (English) | |
|
|
| Dialog Navigation | CVDN | Cooperative vision-and-dialog navigation | |
|
|
| Object Search | SOON | Semantic object-oriented navigation | |
|
|
| Object Navigation | ObjectNav-MP3D | Category-based object finding | |
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/GengzeZhou/SAME.git |
|
|
cd SAME |
|
|
conda create --name SAME python=3.10 |
|
|
conda activate SAME |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
### Download Data and Models |
|
|
|
|
|
```bash |
|
|
# Download all datasets and features |
|
|
python download.py --data |
|
|
|
|
|
# Download pretrained models |
|
|
python download.py --pretrain |
|
|
|
|
|
# Download trained checkpoints (optional) |
|
|
python download.py --checkpoints |
|
|
``` |
|
|
|
|
|
### Training |
|
|
|
|
|
```bash |
|
|
cd src |
|
|
|
|
|
# Single GPU training |
|
|
python run.py --config_dir configs/main_multi_q.yaml |
|
|
|
|
|
# Multi-GPU distributed training |
|
|
torchrun --nproc_per_node=4 --master_port=29500 \ |
|
|
run.py --config_dir configs/main_multi_q.yaml |
|
|
``` |
|
|
|
|
|
### Evaluation |
|
|
|
|
|
```bash |
|
|
cd src |
|
|
python run.py --config_dir configs/test.yaml \ |
|
|
--options experiment.resume_file=/path/to/checkpoint.pt |
|
|
``` |
|
|
|
|
|
### Configuration Options |
|
|
|
|
|
```yaml |
|
|
model: |
|
|
use_moe_layer: true |
|
|
moe_type: "Task" # Task-based MoE |
|
|
moe_position: "Attn_q" # Attn_q, Attn_kv, or FFN |
|
|
task_routing_feature: "multi" # Multimodal routing (recommended) |
|
|
num_experts: 8 |
|
|
num_experts_per_tok: 2 # Top-2 expert selection |
|
|
``` |
|
|
## Training Details |
|
|
### Training Data |
|
|
SAME is trained on 9 navigation datasets with weighted sampling: |
|
|
| Dataset | Environment | Sampling Weight | |
|
|
|---------|-------------|-----------------| |
|
|
| R2R-ScaleVLN | HM3D | 10-20 | |
|
|
| R2R-PREVALENT | MP3D | 1 | |
|
|
| R2R | MP3D | 1 | |
|
|
| REVERIE-ScaleVLN | HM3D | 1-10 | |
|
|
| REVERIE | MP3D | 1 | |
|
|
| RXR-EN | MP3D | 1 | |
|
|
| CVDN | MP3D | 1 | |
|
|
| SOON | MP3D | 1 | |
|
|
| ObjectNav-MP3D | MP3D (Habitat) | 2 | |
|
|
### Training Hyperparameters |
|
|
- **Optimizer**: AdamW |
|
|
- **Learning Rate**: 1e-5 |
|
|
- **Total Iterations**: 500,000 |
|
|
- **Batch Size**: 16 |
|
|
- **Gradient Clipping**: 0.5 |
|
|
- **Training Algorithm**: DAgger (Dataset Aggregation) |
|
|
- **MoE Auxiliary Loss Coefficient**: 0.8 |
|
|
### Visual Features |
|
|
- **Feature Extractor**: CLIP ViT-B/16 |
|
|
- **Feature Dimension**: 512 |
|
|
- **Format**: HDF5 / LMDB |
|
|
- **Environments**: MatterSim, Habitat-MP3D, Habitat-HM3D |
|
|
## Evaluation Results |
|
|
SAME achieves state-of-the-art or highly competitive performance across all navigation benchmarks as a **unified model**, outperforming task-specific approaches in many cases. |
|
|
### Main Results (Unified Model) |
|
|
#### Room-to-Room (R2R) |
|
|
| Split | SR ↑ | SPL ↑ | |
|
|
|-------|------|-------| |
|
|
| Val Unseen | **76** | 66 | |
|
|
| Test Unseen | **74** | **64** | |
|
|
#### REVERIE |
|
|
| Split | SR ↑ | SPL ↑ | |
|
|
|-------|------|-------| |
|
|
| Val Unseen | **46.4** | **36.1** | |
|
|
| Test Unseen | **48.6** | **37.1** | |
|
|
#### RxR-EN (Multilingual VLN) |
|
|
| Split | SR ↑ | nDTW ↑ | |
|
|
|-------|------|--------| |
|
|
| Val Unseen | **50.5** | **51.2** | |
|
|
#### CVDN (Dialog Navigation) |
|
|
| Split | GP ↑ | |
|
|
|-------|------| |
|
|
| Val | **6.94** | |
|
|
| Test | 7.07 | |
|
|
#### SOON (Object-Oriented Navigation) |
|
|
| Split | SR ↑ | SPL ↑ | |
|
|
|-------|------|-------| |
|
|
| Val Unseen | 36.1 | 25.4 | |
|
|
| Test Unseen | **38.2** | **27.1** | |
|
|
#### ObjectNav-MP3D |
|
|
| Split | SR ↑ | SPL ↑ | |
|
|
|-------|------|-------| |
|
|
| Val | **76.3** | 42.7 | |
|
|
### Evaluation Metrics |
|
|
- **SR (Success Rate)**: Percentage of successful navigations (within 3m of goal) |
|
|
- **SPL (Success weighted by Path Length)**: Efficiency-weighted success rate |
|
|
- **nDTW (normalized Dynamic Time Warping)**: Path similarity to ground truth |
|
|
- **GP (Goal Progress)**: Progress towards the goal in dialog navigation |
|
|
- **NE (Navigation Error)**: Distance to goal at episode end |
|
|
- **OSR (Oracle Success Rate)**: Success rate with oracle stop action |
|
|
## Model Variants |
|
|
| Variant | MoE Position | Routing | Checkpoint | |
|
|
|---------|--------------|---------|------------| |
|
|
| SAME-Q | Attention Query | Multimodal | `Attnq_pretrained_ckpt.pt` | |
|
|
| SAME-KV | Attention K/V | Multimodal | `Attnkv_pretrained_ckpt.pt` | |
|
|
| SAME-FFN | Feed-Forward | Multimodal | `FFN_pretrained_ckpt.pt` | |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Indoor Environments Only**: Trained and evaluated on indoor navigation datasets |
|
|
- **Pre-computed Features**: Requires pre-extracted CLIP features; cannot process raw images directly |
|
|
- **English Language**: Primary support for English instructions (though RXR provides multilingual data) |
|
|
- **Static Environments**: Assumes static environments without dynamic obstacles or agents |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
- **Hardware**: Training conducted on NVIDIA A100 GPUs |
|
|
- **Training Time**: Approximately 2-3 days on 4x A100 GPUs |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find this work helpful, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{zhou2024same, |
|
|
title={SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts}, |
|
|
author={Gengze Zhou and Yicong Hong and Zun Wang and Chongyang Zhao and Mohit Bansal and Qi Wu}, |
|
|
journal={arXiv preprint arXiv:2412.05552}, |
|
|
year={2024}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## Authors |
|
|
|
|
|
- **Gengze Zhou** - AIML, University of Adelaide ([Website](https://gengzezhou.github.io)) |
|
|
- **Yicong Hong** - Adobe Research ([Website](http://www.yiconghong.me)) |
|
|
- **Zun Wang** - UNC Chapel Hill ([Website](https://zunwang1.github.io)) |
|
|
- **Chongyang Zhao** - UNSW Sydney ([GitHub](https://github.com/zhaoc5)) |
|
|
- **Mohit Bansal** - UNC Chapel Hill ([Website](https://www.cs.unc.edu/~mbansal/)) |
|
|
- **Qi Wu** - University of Adelaide ([Website](http://www.qi-wu.me)) |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
We extend our gratitude to: |
|
|
- [MatterPort3D](https://niessner.github.io/Matterport/) for the open-source platform |
|
|
- [DUET](https://github.com/cshizhe/VLN-DUET) for the foundational architecture |
|
|
- [ScaleVLN](https://github.com/wz0919/ScaleVLN) for augmented training data |
|
|
- [NaviLLM](https://github.com/zd11024/NaviLLM) for additional insights |
|
|
|
|
|
## License |
|
|
|
|
|
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or issues, please open an issue on the [GitHub repository](https://github.com/GengzeZhou/SAME) or contact the authors. |