Purpose: Models that understand text + image + audio together.
-
llava-hf/llava-1.5-7b-hf
Image-Text-to-Text • 7B • Updated • 4.37M • 349 -
Salesforce/blip-image-captioning-base
Image-to-Text • Updated • 3.28M • 846 -
google/pix2struct-base
Image-to-Text • 0.3B • Updated • 2.8k • 79 -
microsoft/kosmos-2-patch14-224
Image-to-Text • Updated • 173k • 184