|
|
--- |
|
|
pipeline_tag: image-text-to-text |
|
|
tags: |
|
|
- MLX |
|
|
- mlx |
|
|
base_model: |
|
|
- Qwen/Qwen3-VL-4B-Instruct |
|
|
--- |
|
|
# Qwen3-VL-4B-Instruct |
|
|
Run **Qwen3-VL-4B-Instruct** optimized for **Apple Silicon** on MLX with [NexaSDK](https://github.com/NexaAI/nexa-sdk). |
|
|
|
|
|
## Quickstart |
|
|
|
|
|
1. **Install [NexaSDK](https://github.com/NexaAI/nexa-sdk)** |
|
|
2. Run the model locally with one line of code: |
|
|
|
|
|
```bash |
|
|
nexa infer NexaAI/qwen3vl-4B-4bit-mlx |
|
|
``` |
|
|
|
|
|
## Model Description |
|
|
**Qwen3-VL-4B-Instruct** is a 4-billion-parameter instruction-tuned multimodal large language model from Alibaba Cloud’s Qwen team. |
|
|
As part of the **Qwen3-VL** series, it fuses powerful vision-language understanding with conversational fine-tuning, optimized for real-world applications such as chat-based reasoning, document analysis, and visual dialogue. |
|
|
|
|
|
The *Instruct* variant is tuned for following user prompts naturally and safely — producing concise, relevant, and user-aligned responses across text, image, and video contexts. |
|
|
|
|
|
## Features |
|
|
- **Instruction-Following**: Optimized for dialogue, explanation, and user-friendly task completion. |
|
|
- **Vision-Language Fusion**: Understands and reasons across text, images, and video frames. |
|
|
- **Multilingual Capability**: Handles multiple languages for diverse global use cases. |
|
|
- **Contextual Coherence**: Balances reasoning ability with natural, grounded conversational tone. |
|
|
- **Lightweight & Deployable**: 4B parameters make it efficient for edge and device-level inference. |
|
|
|
|
|
## Use Cases |
|
|
- Visual chatbots and assistants |
|
|
- Image captioning and scene understanding |
|
|
- Chart, document, or screenshot analysis |
|
|
- Educational or tutoring systems with visual inputs |
|
|
- Multilingual, multimodal question answering |
|
|
|
|
|
## Inputs and Outputs |
|
|
**Input:** |
|
|
- Text prompts, image(s), or mixed multimodal instructions. |
|
|
|
|
|
**Output:** |
|
|
- Natural-language responses or visual reasoning explanations. |
|
|
- Can return structured text (summaries, captions, answers, etc.) depending on the prompt. |
|
|
|
|
|
## License |
|
|
Refer to the [official Qwen license](https://huggingface.co/Qwen) for terms of use and redistribution. |