Safetensors
dattn_mistral
video
audio
multimodal
Vidi-7B / README.md
Zilence006's picture
Update README.md
ae9773d verified
---
license: cc-by-nc-4.0
base_model:
- mistralai/Mistral-7B-Instruct-v0.3
tags:
- video
- audio
- multimodal
---
# [Vidi: Large Multimodal Models for Video Understanding and Editing](https://arxiv.org/pdf/2504.15681)
Homepage: [https://bytedance.github.io/vidi-website/](https://bytedance.github.io/vidi-website/)
Github: [https://github.com/bytedance/vidi](https://github.com/bytedance/vidi)
Demo: [https://vidi.byteintl.com/](https://vidi.byteintl.com/)
> We introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understanding and editing (VUE) scenarios. The first release focuses on temporal retrieval (TR), i.e., identifying the time ranges in input videos corresponding to a given text query.
This model is the first release for temporal retrieval.
Please find the inference and evaluation code on [https://github.com/bytedance/vidi](https://github.com/bytedance/vidi).
## Citation
If you find Vidi useful for your research and applications, please cite using this BibTeX:
```
@article{Vidi2025vidi2,
title={Vidi2: Large Multimodal Models for Video
Understanding and Creation},
author={Vidi Team, Celong Liu, Chia-Wen Kuo, Chuang Huang,
Dawei Du, Fan Chen, Guang Chen, Haoji Zhang,
Haojun Zhao, Lingxi Zhang, Lu Guo, Lusha Li,
Longyin Wen, Qihang Fan, Qingyu Chen, Rachel Deng,
Sijie Zhu, Stuart Siew, Tong Jin, Weiyan Tao,
Wen Zhong, Xiaohui Shen, Xin Gu, Zhenfang Chen, Zuhua Lin},
journal={arXiv preprint arXiv:2511.19529},
year={2025}
}
@article{Vidi2025vidi,
title={Vidi: Large Multimodal Models for Video
Understanding and Editing},
author={Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du,
Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang,
Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen,
Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin,
Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei,
Xueqiong Qu, Zhenfang Chen},
journal={arXiv preprint arXiv:2504.15681},
year={2025}
}
```