SAME / README.md
ZGZzz's picture
Update README.md
f7e8fbf verified
metadata
language:
  - en
license: mit
library_name: same
tags:
  - vision-language
  - navigation
  - embodied-ai
  - visual-navigation
  - mixture-of-experts
  - multimodal
  - pytorch
datasets:
  - R2R
  - REVERIE
  - RXR
  - CVDN
  - SOON
  - ObjectNav-MP3D
metrics:
  - success_rate
  - spl
pipeline_tag: visual-question-answering
model-index:
  - name: SAME
    results:
      - task:
          type: visual-navigation
          name: Vision-and-Language Navigation
        dataset:
          type: R2R
          name: Room-to-Room (R2R)
        metrics:
          - type: success_rate
            value: 76
            name: SR (val_unseen)
          - type: spl
            value: 66
            name: SPL (val_unseen)
          - type: success_rate
            value: 74
            name: SR (test_unseen)
          - type: spl
            value: 64
            name: SPL (test_unseen)
      - task:
          type: visual-navigation
          name: Vision-and-Language Navigation
        dataset:
          type: REVERIE
          name: REVERIE
        metrics:
          - type: success_rate
            value: 46.4
            name: SR (val_unseen)
          - type: spl
            value: 36.1
            name: SPL (val_unseen)
          - type: success_rate
            value: 48.6
            name: SR (test_unseen)
          - type: spl
            value: 37.1
            name: SPL (test_unseen)
      - task:
          type: visual-navigation
          name: Multilingual VLN
        dataset:
          type: RXR
          name: RxR-EN
        metrics:
          - type: success_rate
            value: 50.5
            name: SR (val_unseen)
          - type: ndtw
            value: 51.2
            name: nDTW (val_unseen)
      - task:
          type: visual-navigation
          name: Dialog Navigation
        dataset:
          type: CVDN
          name: CVDN
        metrics:
          - type: goal_progress
            value: 6.94
            name: GP (val)
          - type: goal_progress
            value: 7.07
            name: GP (test)
      - task:
          type: visual-navigation
          name: Object-Oriented Navigation
        dataset:
          type: SOON
          name: SOON
        metrics:
          - type: success_rate
            value: 36.1
            name: SR (val_unseen)
          - type: spl
            value: 25.4
            name: SPL (val_unseen)
          - type: success_rate
            value: 38.2
            name: SR (test_unseen)
          - type: spl
            value: 27.1
            name: SPL (test_unseen)
      - task:
          type: object-navigation
          name: Object Navigation
        dataset:
          type: ObjectNav-MP3D
          name: ObjectNav-MP3D
        metrics:
          - type: success_rate
            value: 76.3
            name: SR (val)
          - type: spl
            value: 42.7
            name: SPL (val)

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

๐Ÿ•AIML, University of Adelaide ๐ŸŒญAdobe Research ๐Ÿ”UNC, Chapel Hill ๐ŸŒฎUNSW Sydney
Static Badge License: MIT

Model Description

SAME (State-Adaptive Mixture of Experts) is a unified framework for language-guided visual navigation that consolidates diverse navigation tasks into a single versatile agent. Unlike previous task-specific approaches, SAME can handle both high-level category-specific search (e.g., "find a chair") and low-level language-guided navigation (e.g., detailed turn-by-turn instructions) through a novel state-adaptive Mixture of Experts (MoE) architecture.

Key Features

  • Multi-Task Capability: Single model handles 9 different navigation datasets simultaneously
  • State-Adaptive MoE: Dynamic expert routing based on multimodal features (text + visual observations)
  • Simulator-Free: Works entirely with pre-computed CLIP ViT-B/16 features - no simulator installation required
  • Flexible Architecture: MoE can be placed at attention query, key-value, or feed-forward network positions

Model Architecture

SAME is built on a transformer-based architecture with the following key components:

Component Description
Language Encoder 9-layer BERT-based transformer encoder
Image Embeddings Processes 512-dim CLIP ViT-B/16 panoramic features
Local VP Encoder Viewport-level information with crossmodal fusion
Global Map Encoder Global spatial graph with dynamic routing
State-Adaptive MoE 8 experts with top-2 selection, multimodal routing

MoE Routing

The State-Adaptive MoE uses multimodal features (fused text + visual embeddings) to dynamically route tokens to specialized experts. This allows the model to adapt its behavior based on:

  • The granularity of language instructions
  • Current visual observations
  • Navigation task requirements

Intended Uses

Primary Use Cases

  • Vision-and-Language Navigation (VLN): Following natural language instructions in indoor environments
  • Object Navigation: Finding target objects given category names
  • Dialog-based Navigation: Multi-turn conversational navigation
  • Remote Object Grounding: Navigating to and identifying remote objects

Supported Tasks

Task Dataset Description
Low-Level Navigation R2R, R2R-PREVALENT, R2R-ScaleVLN Fine-grained instruction following
Object Grounding REVERIE, REVERIE-ScaleVLN Navigate and ground remote objects
Long Horizontal VLN RXR-EN Long horizon navigation (English)
Dialog Navigation CVDN Cooperative vision-and-dialog navigation
Object Search SOON Semantic object-oriented navigation
Object Navigation ObjectNav-MP3D Category-based object finding

How to Use

Installation

git clone https://github.com/GengzeZhou/SAME.git
cd SAME
conda create --name SAME python=3.10
conda activate SAME
pip install -r requirements.txt

Download Data and Models

# Download all datasets and features
python download.py --data

# Download pretrained models
python download.py --pretrain

# Download trained checkpoints (optional)
python download.py --checkpoints

Training

cd src

# Single GPU training
python run.py --config_dir configs/main_multi_q.yaml

# Multi-GPU distributed training
torchrun --nproc_per_node=4 --master_port=29500 \
    run.py --config_dir configs/main_multi_q.yaml

Evaluation

cd src
python run.py --config_dir configs/test.yaml \
    --options experiment.resume_file=/path/to/checkpoint.pt

Configuration Options

model:
  use_moe_layer: true
  moe_type: "Task"              # Task-based MoE
  moe_position: "Attn_q"        # Attn_q, Attn_kv, or FFN
  task_routing_feature: "multi" # Multimodal routing (recommended)
  num_experts: 8
  num_experts_per_tok: 2        # Top-2 expert selection

Training Details

Training Data

SAME is trained on 9 navigation datasets with weighted sampling:

Dataset Environment Sampling Weight
R2R-ScaleVLN HM3D 10-20
R2R-PREVALENT MP3D 1
R2R MP3D 1
REVERIE-ScaleVLN HM3D 1-10
REVERIE MP3D 1
RXR-EN MP3D 1
CVDN MP3D 1
SOON MP3D 1
ObjectNav-MP3D MP3D (Habitat) 2

Training Hyperparameters

  • Optimizer: AdamW
  • Learning Rate: 1e-5
  • Total Iterations: 500,000
  • Batch Size: 16
  • Gradient Clipping: 0.5
  • Training Algorithm: DAgger (Dataset Aggregation)
  • MoE Auxiliary Loss Coefficient: 0.8

Visual Features

  • Feature Extractor: CLIP ViT-B/16
  • Feature Dimension: 512
  • Format: HDF5 / LMDB
  • Environments: MatterSim, Habitat-MP3D, Habitat-HM3D

Evaluation Results

SAME achieves state-of-the-art or highly competitive performance across all navigation benchmarks as a unified model, outperforming task-specific approaches in many cases.

Main Results (Unified Model)

Room-to-Room (R2R)

Split SR โ†‘ SPL โ†‘
Val Unseen 76 66
Test Unseen 74 64

REVERIE

Split SR โ†‘ SPL โ†‘
Val Unseen 46.4 36.1
Test Unseen 48.6 37.1

RxR-EN (Multilingual VLN)

Split SR โ†‘ nDTW โ†‘
Val Unseen 50.5 51.2

CVDN (Dialog Navigation)

Split GP โ†‘
Val 6.94
Test 7.07

SOON (Object-Oriented Navigation)

Split SR โ†‘ SPL โ†‘
Val Unseen 36.1 25.4
Test Unseen 38.2 27.1

ObjectNav-MP3D

Split SR โ†‘ SPL โ†‘
Val 76.3 42.7

Evaluation Metrics

  • SR (Success Rate): Percentage of successful navigations (within 3m of goal)
  • SPL (Success weighted by Path Length): Efficiency-weighted success rate
  • nDTW (normalized Dynamic Time Warping): Path similarity to ground truth
  • GP (Goal Progress): Progress towards the goal in dialog navigation
  • NE (Navigation Error): Distance to goal at episode end
  • OSR (Oracle Success Rate): Success rate with oracle stop action

Model Variants

Variant MoE Position Routing Checkpoint
SAME-Q Attention Query Multimodal Attnq_pretrained_ckpt.pt
SAME-KV Attention K/V Multimodal Attnkv_pretrained_ckpt.pt
SAME-FFN Feed-Forward Multimodal FFN_pretrained_ckpt.pt

Limitations

  • Indoor Environments Only: Trained and evaluated on indoor navigation datasets
  • Pre-computed Features: Requires pre-extracted CLIP features; cannot process raw images directly
  • English Language: Primary support for English instructions (though RXR provides multilingual data)
  • Static Environments: Assumes static environments without dynamic obstacles or agents

Environmental Impact

  • Hardware: Training conducted on NVIDIA A100 GPUs
  • Training Time: Approximately 2-3 days on 4x A100 GPUs

Citation

If you find this work helpful, please cite:

@article{zhou2024same,
  title={SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts},
  author={Gengze Zhou and Yicong Hong and Zun Wang and Chongyang Zhao and Mohit Bansal and Qi Wu},
  journal={arXiv preprint arXiv:2412.05552},
  year={2024},
}

Authors

  • Gengze Zhou - AIML, University of Adelaide (Website)
  • Yicong Hong - Adobe Research (Website)
  • Zun Wang - UNC Chapel Hill (Website)
  • Chongyang Zhao - UNSW Sydney (GitHub)
  • Mohit Bansal - UNC Chapel Hill (Website)
  • Qi Wu - University of Adelaide (Website)

Acknowledgements

We extend our gratitude to:

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions or issues, please open an issue on the GitHub repository or contact the authors.