Qwen3VL-8B QLora 4-bit - Japanese Photo Conversation
🎨 Vision-Language Model | 📸 Photo Description | 🇯🇵 Japanese Specialized
Built with Qwen3-VL | Fine-tuning: 4-bit QLoRA (84MB adapter) | Framework: LLaMA-Factory | Hardware: 2×RTX 4090 24GB (Multi-node)
A Japanese-specialized vision-language model fine-tuned from Qwen/Qwen3-VL-8B-Instruct using 4-bit QLoRA on multi-node distributed infrastructure.
Model Description
This model adapts the general-purpose Qwen3-VL-8B into a specialized Japanese photo description system. Through LoRA fine-tuning on 11,808 Japanese photo-conversation pairs, it learns to produce concise, objective descriptions while significantly reducing hallucinations common in the base model.
Key Improvements Over Base Model
- Eliminated location hallucinations - Base model frequently guessed specific place names incorrectly
- Fixed infinite generation loops - Base model got stuck repeating text in some cases
- Concise objective descriptions - Matches human annotation style instead of encyclopedia-like responses
- No more OCR over-reliance - Focuses on visual understanding rather than text reading
- Consistent output format - Predictable response length and structure
| Output Consistency | Variable length/format | Consistent format |
Quick Start
What is This Model?
This is a LoRA adapter (not a full model). You need to:
- Load the base model:
Qwen/Qwen3-VL-8B-Instruct - Apply this LoRA adapter on top of it
Advantage: Only ~84MB download instead of ~8.7GB full model!
Installation
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .
Usage
from llamafactory.chat import ChatModel
# Initialize model with LoRA adapter
chat_model = ChatModel(args={
"model_name_or_path": "Qwen/Qwen3-VL-8B-Instruct",
"adapter_name_or_path": "WayBob/Qwen3vl-8b-qlora-4bit-Japanese-photo-conversation",
"template": "qwen3_vl_nothink",
"quantization_bit": 4,
"trust_remote_code": True,
"flash_attn": "fa2", # Optional: enable flash attention for faster inference
"infer_backend": "huggingface",
})
# Ask questions about images
messages = [{"role": "user", "content": "<image>\nこの画像には何が写っていますか?"}]
responses = chat_model.chat(messages=messages, images=["your_image.jpg"])
print(responses[0].response_text)
See inference_example.py for a complete working example with multiple question types.
Hardware Requirements
| Configuration | VRAM Required |
|---|---|
| 4-bit Quantization (as used in training) | ~10-12GB |
Recommended GPU: RTX 3090 / 4090 / A100 or equivalent with 12GB+ VRAM
Training Details
Base Model
- Source: Qwen/Qwen3-VL-8B-Instruct
- Parameters: 8.7 billion
- Architecture: Qwen3-VL (Vision-Language)
- Context Length: 262,144 tokens
- Vision Encoder: ViT-based with spatial merge
Training Data
Primary Dataset: WayBob/Japanese_Photo_conversation_cleaned
This dataset is cleaned and organized from:
We thank the original authors for their excellent work.
| Split | Samples | Percentage |
|---|---|---|
| Training | 11,217 | 95% |
| Validation | 591 | 5% |
| Total | 11,808 | 100% |
Training Configuration
Hardware & Infrastructure:
- GPUs: 2× NVIDIA RTX 4090 24GB
- Distributed Setup: Multi-node DDP (Distributed Data Parallel)
- Master Node: 192.168.11.8
- Worker Node: 192.168.11.10
- 1 GPU per node
- Framework: LLaMA-Factory
Hyperparameters:
method: qlora_4bit
quantization: 4-bit (BitsAndBytes)
lora_rank: 8
lora_alpha: 16
lora_target: all (all linear layers)
lora_dropout: 0.05
learning_rate: 1.0e-4
batch_size: 4 per device × 2 accumulation = 8 effective
epochs: 3
optimizer: AdamW
lr_scheduler: cosine
warmup_ratio: 0.1
precision: bfloat16
Training Duration: ~2.85 hours (10,253 seconds)
Memory Usage: ~10-12GB VRAM per GPU
Training Results
| Metric | Value |
|---|---|
| Final Training Loss | 0.8665 |
| Final Validation Loss | 0.8579 |
| Training Throughput | 3.28 samples/second |
| Total Training Steps | 2,106 |
Training Loss Curve:
Validation Loss Curve:
Performance Evaluation
We conducted comprehensive comparisons between the base model and fine-tuned model on both training and validation sets.
Example 1: Coastal Scene (Training Set)
Question: この画像はどのような場所を映していますか?
| Model | Response |
|---|---|
| Base Model | この画像は日本、愛知県知多郡美浜町の「知多半岛」にある「知多海岸」(特に「知多海岸の浜」や「知多海岸の遊歩道」付近)を撮影したものです... ISSUE: Hallucinates specific location name with confidence |
| Fine-tuned | この画像は海岸沿いの風景を映しています。海、砂浜、そしてその背後に山々が見えます。 |
| Ground Truth | この画像は、海岸の風景を映しています。波打ち際には砂利や石があり、奥には山と空が広がっています。右側には舗装された道が見え、その横には石が多く並んでいます。 |
Result: HALLUCINATION ELIMINATED - Model provides objective description without guessing specific locations.
Example 2: Indoor Building (Training Set)
Question: この画像はどのような場所で撮影されたものですか?
| Model | Response |
|---|---|
| Base Model | この画像は東京都港区の虎ノ門・新宿エリアにある外務省・旧外務省庁舎...または「虎ノ門ヒルズ」の地下通路...または「虎ノ門ヒルズ」の地下通路...[repeats "虎ノ門ヒルズ" 30+ times] ISSUE: Infinite repetition loop |
| Fine-tuned | この画像は、商業施設やオフィスビルのような公共の室内で撮影されたものです。エスカレーターがあり、案内表示が見られます。 |
| Ground Truth | この画像は大きなビルの内部で、エスカレーターや案内標識が見られる場所で撮影されたものです。公共の施設やオフィスビルのように見えます。 |
Result: CRITICAL BUG FIXED - Eliminates infinite loop, generates coherent response.
Example 3: Night Cityscape (Training Set)
Question: この画像に写っているのは昼間の景色ですか、夜の景色ですか?
| Model | Response |
|---|---|
| Base Model | この画像は夜の景色です。 理由:天空は暗く、星や月が見えません...中央に見える高層塔(東京タワー)も夜にライトアップされているのが確認できます。 ISSUE: Misidentifies Sky Tree as Tokyo Tower; over-explains simple question |
| Fine-tuned | この画像は夜の景色です。 |
| Ground Truth | この画像に写っているのは夜の景色です。 |
Result: ACCURATE & CONCISE - Correctly answers without misidentification or unnecessary explanation.
For detailed comparison with all 6 examples (3 training + 3 validation), see model_comparison_report.md.
Quantitative Improvements Summary
| Metric | Base Model | Fine-tuned Model |
|---|---|---|
| Location Hallucination | Frequent (50%+ samples) | Eliminated (0%) |
| Generation Errors | Infinite loops in edge cases | Clean stops (100%) |
| Response Style | Verbose, encyclopedia-like | Concise, objective |
| Misidentification | Common (e.g., Tokyo Tower) | Rare |
| Output Consistency | Variable length/format | Consistent format |
Training Reproduction
Single-Node Training
llamafactory-cli train qwen3vl_8b_japanese_4bit.yaml
Multi-Node Distributed Training (Used in this model)
This model was trained using 2 nodes with the provided scripts.
On Master Node (192.168.11.8):
bash train_qwen3vl_8b_4bit_master.sh
On Worker Node (192.168.11.10):
bash train_qwen3vl_8b_4bit_worker.sh
The scripts are included in this repository with the exact configuration used for training.
Key environment variables for multi-node setup:
export FORCE_TORCHRUN=1
export NNODES=2
export NODE_RANK=0 # 0 for master, 1 for worker
export MASTER_ADDR=192.168.11.8
export MASTER_PORT=29500
export CUDA_VISIBLE_DEVICES=0
Model Files
Model Weights & Config
adapter_config.json- LoRA adapter configurationadapter_model.safetensors- LoRA adapter weights (~84MB)training_args.bin- Training arguments
Training Results
training_loss.png- Training loss curvetraining_eval_loss.png- Validation loss curvetrainer_log.jsonl- Detailed training logsall_results.json- Final training metricstrain_results.json- Training statisticseval_results.json- Evaluation statistics
Training Configuration
qwen3vl_8b_japanese_4bit.yaml- LLaMA-Factory training configurationtrain_qwen3vl_8b_4bit_master.sh- Multi-node training script (master node)train_qwen3vl_8b_4bit_worker.sh- Multi-node training script (worker node)
Documentation & Examples
model_comparison_report.md- Detailed comparison report with 6 examplesinference_example.py- Simple inference example using LLaMA-Factorysample_images/- Sample images used in comparison report (6 images)
Limitations
- Language: Primarily trained on Japanese; performance on other languages not guaranteed
- Domain: Specialized for photo description; may not generalize to other vision tasks (e.g., OCR, diagram analysis)
- Quantization: Designed for 4-bit quantization; full precision inference not tested
- Output Style: Trained to produce concise descriptions; may not provide detailed analysis when needed
- Context: 2048 token cutoff during training
Intended Use Cases
Recommended:
- Japanese photo description and captioning
- Visual question answering in Japanese
- Content moderation (image understanding)
- Accessibility applications (image-to-text for visually impaired)
- Dataset annotation assistance
Not Recommended:
- Medical image diagnosis
- Fine-grained object detection (use specialized models)
- OCR tasks (use dedicated OCR models)
- Video understanding (trained on static images only)
Citation
@misc{qwen3vl-8b-qlora-japanese-photo,
author = {WayBob},
title = {Qwen3VL-8B QLora 4-bit Japanese Photo Conversation},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/WayBob/Qwen3vl-8b-qlora-4bit-Japanese-photo-conversation}
}
Acknowledgements
Base Model:
- Qwen3-VL-8B-Instruct - Alibaba Cloud Qwen Team
- Licensed under Apache 2.0
Training Datasets:
- llm-jp/japanese-photos-conversation - Original Japanese photo conversation dataset
- ThePioneer/japanese-photos - Japanese photo collection
- WayBob/Japanese_Photo_conversation_cleaned - Cleaned and organized version
Training Framework:
- LLaMA-Factory by hiyouga
Method:
- QLoRA: Efficient Finetuning of Quantized LLMs by Dettmers et al.
Infrastructure:
- Local multi-node setup with 2× NVIDIA RTX 4090 24GB GPUs
License
This model is licensed under Creative Commons Attribution 2.0 (CC-BY-2.0).
Key License Terms
- Share: You can copy and redistribute the material in any medium or format
- Adapt: You can remix, transform, and build upon the material for any purpose, even commercially
- Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made
- No Additional Restrictions: You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits
Full License: See CC-BY-2.0 License for complete terms.
Contact
- HuggingFace: WayBob
- Model Repository: Qwen3vl-8b-qlora-4bit-Japanese-photo-conversation
- Dataset Repository: Japanese_Photo_conversation_cleaned
- Downloads last month
- 51
Model tree for WayBob/Qwen3vl-8b-qlora-4bit-Japanese-photo-conversation
Base model
Qwen/Qwen3-VL-8B-InstructDataset used to train WayBob/Qwen3vl-8b-qlora-4bit-Japanese-photo-conversation
Evaluation results
- Validation Loss on Japanese Photo Conversationself-reported0.858
- Training Loss on Japanese Photo Conversationself-reported0.867




