SAME / README.md
ZGZzz's picture
Update README.md
f7e8fbf verified
---
language:
- en
license: mit
library_name: same
tags:
- vision-language
- navigation
- embodied-ai
- visual-navigation
- mixture-of-experts
- multimodal
- pytorch
datasets:
- R2R
- REVERIE
- RXR
- CVDN
- SOON
- ObjectNav-MP3D
metrics:
- success_rate
- spl
pipeline_tag: visual-question-answering
model-index:
- name: SAME
results:
- task:
type: visual-navigation
name: Vision-and-Language Navigation
dataset:
type: R2R
name: Room-to-Room (R2R)
metrics:
- type: success_rate
value: 76
name: SR (val_unseen)
- type: spl
value: 66
name: SPL (val_unseen)
- type: success_rate
value: 74
name: SR (test_unseen)
- type: spl
value: 64
name: SPL (test_unseen)
- task:
type: visual-navigation
name: Vision-and-Language Navigation
dataset:
type: REVERIE
name: REVERIE
metrics:
- type: success_rate
value: 46.4
name: SR (val_unseen)
- type: spl
value: 36.1
name: SPL (val_unseen)
- type: success_rate
value: 48.6
name: SR (test_unseen)
- type: spl
value: 37.1
name: SPL (test_unseen)
- task:
type: visual-navigation
name: Multilingual VLN
dataset:
type: RXR
name: RxR-EN
metrics:
- type: success_rate
value: 50.5
name: SR (val_unseen)
- type: ndtw
value: 51.2
name: nDTW (val_unseen)
- task:
type: visual-navigation
name: Dialog Navigation
dataset:
type: CVDN
name: CVDN
metrics:
- type: goal_progress
value: 6.94
name: GP (val)
- type: goal_progress
value: 7.07
name: GP (test)
- task:
type: visual-navigation
name: Object-Oriented Navigation
dataset:
type: SOON
name: SOON
metrics:
- type: success_rate
value: 36.1
name: SR (val_unseen)
- type: spl
value: 25.4
name: SPL (val_unseen)
- type: success_rate
value: 38.2
name: SR (test_unseen)
- type: spl
value: 27.1
name: SPL (test_unseen)
- task:
type: object-navigation
name: Object Navigation
dataset:
type: ObjectNav-MP3D
name: ObjectNav-MP3D
metrics:
- type: success_rate
value: 76.3
name: SR (val)
- type: spl
value: 42.7
name: SPL (val)
---
<div align="center">
<h1><span style="background: linear-gradient(to right, #007BA7, #99B5D2); -webkit-background-clip: text; color: transparent;font-style: italic;"> SAME</span>: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts</h1>
<div>
<a href='https://gengzezhou.github.io' target='_blank'>Gengze Zhou<sup>🍕</sup></a>;
<a href='http://www.yiconghong.me' target='_blank'>Yicong Hong<sup>🌭</sup></a>;
<a href='https://zunwang1.github.io' target='_blank'>Zun Wang<sup>🍔</sup></a>;
<a href='https://github.com/zhaoc5' target='_blank'>Chongyang Zhao<sup>🌮</sup></a>;
<a href='https://www.cs.unc.edu/~mbansal/' target='_blank'>Mohit Bansal<sup>🍔</sup></a>;
<a href='http://www.qi-wu.me' target='_blank'>Qi Wu<sup>🍕</sup></a>
</div>
<sup>🍕</sup>AIML, University of Adelaide
<sup>🌭</sup>Adobe Research
<sup>🍔</sup>UNC, Chapel Hill
<sup>🌮</sup>UNSW Sydney
<br>
<div>
<a href='https://github.com/GengzeZhou/SAME' target='_blank'><img alt="Static Badge" src="https://img.shields.io/badge/VLNBench-v0.1-blue"></a>
<a href='https://arxiv.org/abs/2412.05552' target='_blank'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
<a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
</div>
</div>
## Model Description
**SAME** (State-Adaptive Mixture of Experts) is a unified framework for language-guided visual navigation that consolidates diverse navigation tasks into a single versatile agent. Unlike previous task-specific approaches, SAME can handle both **high-level category-specific search** (e.g., "find a chair") and **low-level language-guided navigation** (e.g., detailed turn-by-turn instructions) through a novel state-adaptive Mixture of Experts (MoE) architecture.
### Key Features
- **Multi-Task Capability**: Single model handles 9 different navigation datasets simultaneously
- **State-Adaptive MoE**: Dynamic expert routing based on multimodal features (text + visual observations)
- **Simulator-Free**: Works entirely with pre-computed CLIP ViT-B/16 features - no simulator installation required
- **Flexible Architecture**: MoE can be placed at attention query, key-value, or feed-forward network positions
## Model Architecture
SAME is built on a transformer-based architecture with the following key components:
| Component | Description |
|-----------|-------------|
| **Language Encoder** | 9-layer BERT-based transformer encoder |
| **Image Embeddings** | Processes 512-dim CLIP ViT-B/16 panoramic features |
| **Local VP Encoder** | Viewport-level information with crossmodal fusion |
| **Global Map Encoder** | Global spatial graph with dynamic routing |
| **State-Adaptive MoE** | 8 experts with top-2 selection, multimodal routing |
### MoE Routing
The State-Adaptive MoE uses multimodal features (fused text + visual embeddings) to dynamically route tokens to specialized experts. This allows the model to adapt its behavior based on:
- The granularity of language instructions
- Current visual observations
- Navigation task requirements
## Intended Uses
### Primary Use Cases
- **Vision-and-Language Navigation (VLN)**: Following natural language instructions in indoor environments
- **Object Navigation**: Finding target objects given category names
- **Dialog-based Navigation**: Multi-turn conversational navigation
- **Remote Object Grounding**: Navigating to and identifying remote objects
### Supported Tasks
| Task | Dataset | Description |
|------|---------|-------------|
| Low-Level Navigation | R2R, R2R-PREVALENT, R2R-ScaleVLN | Fine-grained instruction following |
| Object Grounding | REVERIE, REVERIE-ScaleVLN | Navigate and ground remote objects |
| Long Horizontal VLN | RXR-EN | Long horizon navigation (English) |
| Dialog Navigation | CVDN | Cooperative vision-and-dialog navigation |
| Object Search | SOON | Semantic object-oriented navigation |
| Object Navigation | ObjectNav-MP3D | Category-based object finding |
## How to Use
### Installation
```bash
git clone https://github.com/GengzeZhou/SAME.git
cd SAME
conda create --name SAME python=3.10
conda activate SAME
pip install -r requirements.txt
```
### Download Data and Models
```bash
# Download all datasets and features
python download.py --data
# Download pretrained models
python download.py --pretrain
# Download trained checkpoints (optional)
python download.py --checkpoints
```
### Training
```bash
cd src
# Single GPU training
python run.py --config_dir configs/main_multi_q.yaml
# Multi-GPU distributed training
torchrun --nproc_per_node=4 --master_port=29500 \
run.py --config_dir configs/main_multi_q.yaml
```
### Evaluation
```bash
cd src
python run.py --config_dir configs/test.yaml \
--options experiment.resume_file=/path/to/checkpoint.pt
```
### Configuration Options
```yaml
model:
use_moe_layer: true
moe_type: "Task" # Task-based MoE
moe_position: "Attn_q" # Attn_q, Attn_kv, or FFN
task_routing_feature: "multi" # Multimodal routing (recommended)
num_experts: 8
num_experts_per_tok: 2 # Top-2 expert selection
```
## Training Details
### Training Data
SAME is trained on 9 navigation datasets with weighted sampling:
| Dataset | Environment | Sampling Weight |
|---------|-------------|-----------------|
| R2R-ScaleVLN | HM3D | 10-20 |
| R2R-PREVALENT | MP3D | 1 |
| R2R | MP3D | 1 |
| REVERIE-ScaleVLN | HM3D | 1-10 |
| REVERIE | MP3D | 1 |
| RXR-EN | MP3D | 1 |
| CVDN | MP3D | 1 |
| SOON | MP3D | 1 |
| ObjectNav-MP3D | MP3D (Habitat) | 2 |
### Training Hyperparameters
- **Optimizer**: AdamW
- **Learning Rate**: 1e-5
- **Total Iterations**: 500,000
- **Batch Size**: 16
- **Gradient Clipping**: 0.5
- **Training Algorithm**: DAgger (Dataset Aggregation)
- **MoE Auxiliary Loss Coefficient**: 0.8
### Visual Features
- **Feature Extractor**: CLIP ViT-B/16
- **Feature Dimension**: 512
- **Format**: HDF5 / LMDB
- **Environments**: MatterSim, Habitat-MP3D, Habitat-HM3D
## Evaluation Results
SAME achieves state-of-the-art or highly competitive performance across all navigation benchmarks as a **unified model**, outperforming task-specific approaches in many cases.
### Main Results (Unified Model)
#### Room-to-Room (R2R)
| Split | SR ↑ | SPL ↑ |
|-------|------|-------|
| Val Unseen | **76** | 66 |
| Test Unseen | **74** | **64** |
#### REVERIE
| Split | SR ↑ | SPL ↑ |
|-------|------|-------|
| Val Unseen | **46.4** | **36.1** |
| Test Unseen | **48.6** | **37.1** |
#### RxR-EN (Multilingual VLN)
| Split | SR ↑ | nDTW ↑ |
|-------|------|--------|
| Val Unseen | **50.5** | **51.2** |
#### CVDN (Dialog Navigation)
| Split | GP ↑ |
|-------|------|
| Val | **6.94** |
| Test | 7.07 |
#### SOON (Object-Oriented Navigation)
| Split | SR ↑ | SPL ↑ |
|-------|------|-------|
| Val Unseen | 36.1 | 25.4 |
| Test Unseen | **38.2** | **27.1** |
#### ObjectNav-MP3D
| Split | SR ↑ | SPL ↑ |
|-------|------|-------|
| Val | **76.3** | 42.7 |
### Evaluation Metrics
- **SR (Success Rate)**: Percentage of successful navigations (within 3m of goal)
- **SPL (Success weighted by Path Length)**: Efficiency-weighted success rate
- **nDTW (normalized Dynamic Time Warping)**: Path similarity to ground truth
- **GP (Goal Progress)**: Progress towards the goal in dialog navigation
- **NE (Navigation Error)**: Distance to goal at episode end
- **OSR (Oracle Success Rate)**: Success rate with oracle stop action
## Model Variants
| Variant | MoE Position | Routing | Checkpoint |
|---------|--------------|---------|------------|
| SAME-Q | Attention Query | Multimodal | `Attnq_pretrained_ckpt.pt` |
| SAME-KV | Attention K/V | Multimodal | `Attnkv_pretrained_ckpt.pt` |
| SAME-FFN | Feed-Forward | Multimodal | `FFN_pretrained_ckpt.pt` |
## Limitations
- **Indoor Environments Only**: Trained and evaluated on indoor navigation datasets
- **Pre-computed Features**: Requires pre-extracted CLIP features; cannot process raw images directly
- **English Language**: Primary support for English instructions (though RXR provides multilingual data)
- **Static Environments**: Assumes static environments without dynamic obstacles or agents
## Environmental Impact
- **Hardware**: Training conducted on NVIDIA A100 GPUs
- **Training Time**: Approximately 2-3 days on 4x A100 GPUs
## Citation
If you find this work helpful, please cite:
```bibtex
@article{zhou2024same,
title={SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts},
author={Gengze Zhou and Yicong Hong and Zun Wang and Chongyang Zhao and Mohit Bansal and Qi Wu},
journal={arXiv preprint arXiv:2412.05552},
year={2024},
}
```
## Authors
- **Gengze Zhou** - AIML, University of Adelaide ([Website](https://gengzezhou.github.io))
- **Yicong Hong** - Adobe Research ([Website](http://www.yiconghong.me))
- **Zun Wang** - UNC Chapel Hill ([Website](https://zunwang1.github.io))
- **Chongyang Zhao** - UNSW Sydney ([GitHub](https://github.com/zhaoc5))
- **Mohit Bansal** - UNC Chapel Hill ([Website](https://www.cs.unc.edu/~mbansal/))
- **Qi Wu** - University of Adelaide ([Website](http://www.qi-wu.me))
## Acknowledgements
We extend our gratitude to:
- [MatterPort3D](https://niessner.github.io/Matterport/) for the open-source platform
- [DUET](https://github.com/cshizhe/VLN-DUET) for the foundational architecture
- [ScaleVLN](https://github.com/wz0919/ScaleVLN) for augmented training data
- [NaviLLM](https://github.com/zd11024/NaviLLM) for additional insights
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Contact
For questions or issues, please open an issue on the [GitHub repository](https://github.com/GengzeZhou/SAME) or contact the authors.