espnet
/

xeus

@@ -158,6 +158,43 @@ language:
  XEUS tops the [ML-SUPERB]() multilingual speech recognition leaderboard, outperforming [MMS](), [w2v-BERT 2.0](), and [XLS-R](). XEUS also sets a new state-of-the-art on 4 tasks in the monolingual [SUPERB]() benchmark.
 ## Results
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/630438615c70c21d0eae6613/RCAWBxSuDLXJ5zdj-OBdn.png)

  XEUS tops the [ML-SUPERB]() multilingual speech recognition leaderboard, outperforming [MMS](), [w2v-BERT 2.0](), and [XLS-R](). XEUS also sets a new state-of-the-art on 4 tasks in the monolingual [SUPERB]() benchmark.
+## Requirements
+The code for XEUS is still in progress of being merged into the main ESPnet repo. It can instead be used from the following fork:
+```
+pip install -e git+git://github.com/wanchichen/espnet.git@ssl
+```
+XEUS supports [Flash Attention], which can be installed as follows:
+```
+pip install flash-attn --no-build-isolation
+```
+## Usage
+```
+from torch.nn.utils.rnn import pad_sequence
+from espnet2.tasks.ssl import SSLTask
+import soundfile as sf
+device = "cuda" if torch.cuda.is_available() else "cpu"
+xeus_model, xeus_train_args = SSLTask.build_model_from_file(
+    config = None,
+    ckpt = '/path/to/checkpoint/here/checkpoint.pth',
+    device,
+)
+wavs, sampling_rate = sf.read('/path/to/audio.wav') # sampling rate should be 16000
+wav_lengths = torch.LongTensor([len(wav) for wav in [wavs]]).to(device)
+wavs = pad_sequence([wavs], batch_first=True).to(device)
+# we recommend use_mask=True during fine-tuning
+feats = xeus_model.encode(wavs, wav_lengths, use_mask=False, use_final_output=False)[0][-1] # take the output of the last layer
+```
 ## Results
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/630438615c70c21d0eae6613/RCAWBxSuDLXJ5zdj-OBdn.png)