Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
adarshzolekar
's Collections
Multimodal AI Models
Audio & Speech Models
Vision Models (Image & Video)
Text & Code Models (NLP)
Multimodal AI Models
updated
Jan 23
Purpose: Models that understand text + image + audio together.
Upvote
1
llava-hf/llava-1.5-7b-hf
Image-Text-to-Text
•
7B
•
Updated
Jun 6, 2025
•
3.59M
•
365
Salesforce/blip-image-captioning-base
Image-to-Text
•
Updated
Feb 3, 2025
•
2.47M
•
859
google/pix2struct-base
Image-to-Text
•
0.3B
•
Updated
Dec 24, 2023
•
3.56k
•
79
microsoft/kosmos-2-patch14-224
Image-to-Text
•
2B
•
Updated
Nov 28, 2023
•
164k
•
185
openbmb/MiniCPM-V-4_5
Image-Text-to-Text
•
9B
•
Updated
Mar 10
•
111k
•
1.09k
Upvote
1
Share collection
View history
Collection guide
Browse collections