ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signal
Paper
•
2508.14689
•
Published
ECHO (frEquenCy-aware Hierarchical encOding for variable-length signal) is a general machine signal representation learning model based on Masked Autoencoders (MAE) with band-splitting and frequency positional encoding that handles variable lengths.
Overall performance summary (DCASE anomaly detection + Fault classification):
from huggingface_hub import snapshot_download
# Download the model to local directory
model_path = snapshot_download(
repo_id="yucongzh/echo-tiny-0828",
local_dir="./echo-tiny",
local_dir_use_symlinks=False
)
print(f"Model downloaded to: {model_path}")
import torch
import torchaudio
import sys
# Add the model path to Python path
sys.path.append('./echo-tiny')
# Import the model architecture
from audioMAE_band_upgrade import AudioMAEWithBand
# Create model instance with your configuration
model = AudioMAEWithBand(
spec_len=2000,
band_width=32,
shift_size=16,
in_chans=1,
embed_dim=192,
encoder_depth=12,
num_heads=3,
mlp_ratio=4.0,
freq_pos_emb_dim=192
)
# Load pre-trained weights
from safetensors.torch import load_file
state_dict = load_file('model.safetensors')
model.load_state_dict(state_dict, strict=False)
# Set to evaluation mode
model.eval()
# Example usage
audio_signal = torch.randn(1, 240000) # 5 seconds at 48kHz
sample_rate = 48000
# Method 1: Extract features directly from audio (Recommended)
with torch.inference_mode():
utterance_level_features, segment_level_features = model.extract_features_from_audio(audio_signal, sample_rate=sample_rate)
print(f"Utterance-level Feature shape: {utterance_level_features.shape}")
print(f"Segment-level Feature shape: {segment_level_features.shape}")
# Method 2: Use preprocessing separately, then extract features
spec = model.preprocess_audio_to_spectrogram(audio_signal, sample_rate=sample_rate)
print(f"Spectrogram shape: {spec.shape}")
# Extract features from preprocessed spectrogram
with torch.inference_mode():
utterance_level_features, segment_level_features = model.extract_features(spec, sample_rate=sample_rate)
print(f"Utterance-level Feature shape: {utterance_level_features.shape}")
print(f"Segment-level Feature shape: {segment_level_features.shape}")
The ECHO model outputs two types of features:
[NxD,] (concatenated CLS tokens from all frequency bands)[T, NxD] (temporal features for each patch, concatenated across bands)If you find ECHO helpful, please consider to cite our paper:
@article{echo2025,
title={ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signal},
author={Yucong Zhang and Juan Liu and Ming Li},
journal={arXiv preprint arXiv:2508.14689},
year={2025},
}