🎥 Visualizing How Multimodal Models Think

This tool generates a video to visualize how a multimodal model (image + text) attends to different parts of an image while generating text.

📌 What it does: - Takes an input image and a text prompt. - Shows how the model’s attention shifts on the image for each generated token. - Helps explain the model’s behavior and decision-making.

🖼️ Video layout (per frame): Each frame in the video includes: 1. 🔥 Heatmap over image: Shows which area the model focuses on. 2. 📝 Generated text: With old context, current token highlighted. 3. 📊 Token prediction table: Shows the model’s top next-token guesses.