SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

---
language:
- en
license: mit
library_name: same
tags:
- vision-language
- navigation
- embodied-ai
- visual-navigation
- mixture-of-experts
- multimodal
- pytorch
datasets:
- R2R
- REVERIE
- RXR
- CVDN
- SOON
- ObjectNav-MP3D
metrics:
- success_rate
- spl
pipeline_tag: visual-question-answering
model-index:
- name: SAME
  results:
  - task:
      type: visual-navigation
      name: Vision-and-Language Navigation
    dataset:
      type: R2R
      name: Room-to-Room (R2R)
    metrics:
    - type: success_rate
      value: 76
      name: SR (val_unseen)
    - type: spl
      value: 66
      name: SPL (val_unseen)
    - type: success_rate
      value: 74
      name: SR (test_unseen)
    - type: spl
      value: 64
      name: SPL (test_unseen)
  - task:
      type: visual-navigation
      name: Vision-and-Language Navigation
    dataset:
      type: REVERIE
      name: REVERIE
    metrics:
    - type: success_rate
      value: 46.4
      name: SR (val_unseen)
    - type: spl
      value: 36.1
      name: SPL (val_unseen)
    - type: success_rate
      value: 48.6
      name: SR (test_unseen)
    - type: spl
      value: 37.1
      name: SPL (test_unseen)
  - task:
      type: visual-navigation
      name: Multilingual VLN
    dataset:
      type: RXR
      name: RxR-EN
    metrics:
    - type: success_rate
      value: 50.5
      name: SR (val_unseen)
    - type: ndtw
      value: 51.2
      name: nDTW (val_unseen)
  - task:
      type: visual-navigation
      name: Dialog Navigation
    dataset:
      type: CVDN
      name: CVDN
    metrics:
    - type: goal_progress
      value: 6.94
      name: GP (val)
    - type: goal_progress
      value: 7.07
      name: GP (test)
  - task:
      type: visual-navigation
      name: Object-Oriented Navigation
    dataset:
      type: SOON
      name: SOON
    metrics:
    - type: success_rate
      value: 36.1
      name: SR (val_unseen)
    - type: spl
      value: 25.4
      name: SPL (val_unseen)
    - type: success_rate
      value: 38.2
      name: SR (test_unseen)
    - type: spl
      value: 27.1
      name: SPL (test_unseen)
  - task:
      type: object-navigation
      name: Object Navigation
    dataset:
      type: ObjectNav-MP3D
      name: ObjectNav-MP3D
    metrics:
    - type: success_rate
      value: 76.3
      name: SR (val)
    - type: spl
      value: 42.7
      name: SPL (val)
---

<div align="center">

<h1><span style="background: linear-gradient(to right, #007BA7, #99B5D2); -webkit-background-clip: text; color: transparent;font-style: italic;"> SAME</span>: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts</h1>

<div>
    <a href='https://gengzezhou.github.io' target='_blank'>Gengze Zhou<sup>🍕</sup></a>;
    <a href='http://www.yiconghong.me' target='_blank'>Yicong Hong<sup>🌭</sup></a>;
    <a href='https://zunwang1.github.io' target='_blank'>Zun Wang<sup>🍔</sup></a>;
    <a href='https://github.com/zhaoc5' target='_blank'>Chongyang Zhao<sup>🌮</sup></a>;
    <a href='https://www.cs.unc.edu/~mbansal/' target='_blank'>Mohit Bansal<sup>🍔</sup></a>;
    <a href='http://www.qi-wu.me' target='_blank'>Qi Wu<sup>🍕</sup></a>
</div>
<sup>🍕</sup>AIML, University of Adelaide 
<sup>🌭</sup>Adobe Research 
<sup>🍔</sup>UNC, Chapel Hill
<sup>🌮</sup>UNSW Sydney

<br>

<div>
    <a href='https://github.com/GengzeZhou/SAME' target='_blank'><img alt="Static Badge" src="https://img.shields.io/badge/VLNBench-v0.1-blue"></a>
    <a href='https://arxiv.org/abs/2412.05552' target='_blank'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
    <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
</div>

</div>

## Model Description

**SAME** (State-Adaptive Mixture of Experts) is a unified framework for language-guided visual navigation that consolidates diverse navigation tasks into a single versatile agent. Unlike previous task-specific approaches, SAME can handle both **high-level category-specific search** (e.g., "find a chair") and **low-level language-guided navigation** (e.g., detailed turn-by-turn instructions) through a novel state-adaptive Mixture of Experts (MoE) architecture.

### Key Features

- **Multi-Task Capability**: Single model handles 9 different navigation datasets simultaneously
- **State-Adaptive MoE**: Dynamic expert routing based on multimodal features (text + visual observations)
- **Simulator-Free**: Works entirely with pre-computed CLIP ViT-B/16 features - no simulator installation required
- **Flexible Architecture**: MoE can be placed at attention query, key-value, or feed-forward network positions

## Model Architecture

SAME is built on a transformer-based architecture with the following key components:

| Component | Description |
|-----------|-------------|
| **Language Encoder** | 9-layer BERT-based transformer encoder |
| **Image Embeddings** | Processes 512-dim CLIP ViT-B/16 panoramic features |
| **Local VP Encoder** | Viewport-level information with crossmodal fusion |
| **Global Map Encoder** | Global spatial graph with dynamic routing |
| **State-Adaptive MoE** | 8 experts with top-2 selection, multimodal routing |

### MoE Routing

The State-Adaptive MoE uses multimodal features (fused text + visual embeddings) to dynamically route tokens to specialized experts. This allows the model to adapt its behavior based on:
- The granularity of language instructions
- Current visual observations
- Navigation task requirements

## Intended Uses

### Primary Use Cases

- **Vision-and-Language Navigation (VLN)**: Following natural language instructions in indoor environments
- **Object Navigation**: Finding target objects given category names
- **Dialog-based Navigation**: Multi-turn conversational navigation
- **Remote Object Grounding**: Navigating to and identifying remote objects

### Supported Tasks

| Task | Dataset | Description |
|------|---------|-------------|
| Low-Level Navigation | R2R, R2R-PREVALENT, R2R-ScaleVLN | Fine-grained instruction following |
| Object Grounding | REVERIE, REVERIE-ScaleVLN | Navigate and ground remote objects |
| Long Horizontal VLN | RXR-EN | Long horizon navigation (English) |
| Dialog Navigation | CVDN | Cooperative vision-and-dialog navigation |
| Object Search | SOON | Semantic object-oriented navigation |
| Object Navigation | ObjectNav-MP3D | Category-based object finding |

## How to Use

### Installation

```bash
git clone https://github.com/GengzeZhou/SAME.git
cd SAME
conda create --name SAME python=3.10
conda activate SAME
pip install -r requirements.txt
```

### Download Data and Models

```bash
# Download all datasets and features
python download.py --data

# Download pretrained models
python download.py --pretrain

# Download trained checkpoints (optional)
python download.py --checkpoints
```

### Training

```bash
cd src

# Single GPU training
python run.py --config_dir configs/main_multi_q.yaml

# Multi-GPU distributed training
torchrun --nproc_per_node=4 --master_port=29500 \
    run.py --config_dir configs/main_multi_q.yaml
```

### Evaluation

```bash
cd src
python run.py --config_dir configs/test.yaml \
    --options experiment.resume_file=/path/to/checkpoint.pt
```

### Configuration Options

```yaml
model:
  use_moe_layer: true
  moe_type: "Task"              # Task-based MoE
  moe_position: "Attn_q"        # Attn_q, Attn_kv, or FFN
  task_routing_feature: "multi" # Multimodal routing (recommended)
  num_experts: 8
  num_experts_per_tok: 2        # Top-2 expert selection
```
## Training Details
### Training Data
SAME is trained on 9 navigation datasets with weighted sampling:
| Dataset | Environment | Sampling Weight |
|---------|-------------|-----------------|
| R2R-ScaleVLN | HM3D | 10-20 |
| R2R-PREVALENT | MP3D | 1 |
| R2R | MP3D | 1 |
| REVERIE-ScaleVLN | HM3D | 1-10 |
| REVERIE | MP3D | 1 |
| RXR-EN | MP3D | 1 |
| CVDN | MP3D | 1 |
| SOON | MP3D | 1 |
| ObjectNav-MP3D | MP3D (Habitat) | 2 |
### Training Hyperparameters
- **Optimizer**: AdamW
- **Learning Rate**: 1e-5
- **Total Iterations**: 500,000
- **Batch Size**: 16
- **Gradient Clipping**: 0.5
- **Training Algorithm**: DAgger (Dataset Aggregation)
- **MoE Auxiliary Loss Coefficient**: 0.8
### Visual Features
- **Feature Extractor**: CLIP ViT-B/16
- **Feature Dimension**: 512
- **Format**: HDF5 / LMDB
- **Environments**: MatterSim, Habitat-MP3D, Habitat-HM3D
## Evaluation Results
SAME achieves state-of-the-art or highly competitive performance across all navigation benchmarks as a **unified model**, outperforming task-specific approaches in many cases.
### Main Results (Unified Model)
#### Room-to-Room (R2R)
| Split | SR ↑ | SPL ↑ |
|-------|------|-------|
| Val Unseen | **76** | 66 |
| Test Unseen | **74** | **64** |
#### REVERIE
| Split | SR ↑ | SPL ↑ |
|-------|------|-------|
| Val Unseen | **46.4** | **36.1** |
| Test Unseen | **48.6** | **37.1** |
#### RxR-EN (Multilingual VLN)
| Split | SR ↑ | nDTW ↑ |
|-------|------|--------|
| Val Unseen | **50.5** | **51.2** |
#### CVDN (Dialog Navigation)
| Split | GP ↑ |
|-------|------|
| Val | **6.94** |
| Test | 7.07 |
#### SOON (Object-Oriented Navigation)
| Split | SR ↑ | SPL ↑ |
|-------|------|-------|
| Val Unseen | 36.1 | 25.4 |
| Test Unseen | **38.2** | **27.1** |
#### ObjectNav-MP3D
| Split | SR ↑ | SPL ↑ |
|-------|------|-------|
| Val | **76.3** | 42.7 |
### Evaluation Metrics
- **SR (Success Rate)**: Percentage of successful navigations (within 3m of goal)
- **SPL (Success weighted by Path Length)**: Efficiency-weighted success rate
- **nDTW (normalized Dynamic Time Warping)**: Path similarity to ground truth
- **GP (Goal Progress)**: Progress towards the goal in dialog navigation
- **NE (Navigation Error)**: Distance to goal at episode end
- **OSR (Oracle Success Rate)**: Success rate with oracle stop action
## Model Variants
| Variant | MoE Position | Routing | Checkpoint |
|---------|--------------|---------|------------|
| SAME-Q | Attention Query | Multimodal | `Attnq_pretrained_ckpt.pt` |
| SAME-KV | Attention K/V | Multimodal | `Attnkv_pretrained_ckpt.pt` |
| SAME-FFN | Feed-Forward | Multimodal | `FFN_pretrained_ckpt.pt` |

## Limitations

- **Indoor Environments Only**: Trained and evaluated on indoor navigation datasets
- **Pre-computed Features**: Requires pre-extracted CLIP features; cannot process raw images directly
- **English Language**: Primary support for English instructions (though RXR provides multilingual data)
- **Static Environments**: Assumes static environments without dynamic obstacles or agents

## Environmental Impact

- **Hardware**: Training conducted on NVIDIA A100 GPUs
- **Training Time**: Approximately 2-3 days on 4x A100 GPUs

## Citation

If you find this work helpful, please cite:

```bibtex
@article{zhou2024same,
  title={SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts},
  author={Gengze Zhou and Yicong Hong and Zun Wang and Chongyang Zhao and Mohit Bansal and Qi Wu},
  journal={arXiv preprint arXiv:2412.05552},
  year={2024},
}
```

## Authors

- **Gengze Zhou** - AIML, University of Adelaide ([Website](https://gengzezhou.github.io))
- **Yicong Hong** - Adobe Research ([Website](http://www.yiconghong.me))
- **Zun Wang** - UNC Chapel Hill ([Website](https://zunwang1.github.io))
- **Chongyang Zhao** - UNSW Sydney ([GitHub](https://github.com/zhaoc5))
- **Mohit Bansal** - UNC Chapel Hill ([Website](https://www.cs.unc.edu/~mbansal/))
- **Qi Wu** - University of Adelaide ([Website](http://www.qi-wu.me))

## Acknowledgements

We extend our gratitude to:
- [MatterPort3D](https://niessner.github.io/Matterport/) for the open-source platform
- [DUET](https://github.com/cshizhe/VLN-DUET) for the foundational architecture
- [ScaleVLN](https://github.com/wz0919/ScaleVLN) for augmented training data
- [NaviLLM](https://github.com/zd11024/NaviLLM) for additional insights

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Contact

For questions or issues, please open an issue on the [GitHub repository](https://github.com/GengzeZhou/SAME) or contact the authors.