Multimodal AI Models - a adarshzolekar Collection

adarshzolekar 's Collections

Multimodal AI Models

Audio & Speech Models

Vision Models (Image & Video)

Text & Code Models (NLP)

Multimodal AI Models

updated Jan 23

Purpose: Models that understand text + image + audio together.

llava-hf/llava-1.5-7b-hf

Image-Text-to-Text • 7B • Updated Jun 6, 2025 • 3.59M • 365
Salesforce/blip-image-captioning-base

Image-to-Text • Updated Feb 3, 2025 • 2.47M • 859
google/pix2struct-base

Image-to-Text • 0.3B • Updated Dec 24, 2023 • 3.56k • 79
microsoft/kosmos-2-patch14-224

Image-to-Text • 2B • Updated Nov 28, 2023 • 164k • 185
openbmb/MiniCPM-V-4_5

Image-Text-to-Text • 9B • Updated Mar 10 • 111k • 1.09k