Qwen3VL-8B QLora 4-bit - Japanese Photo Conversation

🎨 Vision-Language Model | 📸 Photo Description | 🇯🇵 Japanese Specialized

Built with Qwen3-VL | Fine-tuning: 4-bit QLoRA (84MB adapter) | Framework: LLaMA-Factory | Hardware: 2×RTX 4090 24GB (Multi-node)

A Japanese-specialized vision-language model fine-tuned from Qwen/Qwen3-VL-8B-Instruct using 4-bit QLoRA on multi-node distributed infrastructure.

Model Description

This model adapts the general-purpose Qwen3-VL-8B into a specialized Japanese photo description system. Through LoRA fine-tuning on 11,808 Japanese photo-conversation pairs, it learns to produce concise, objective descriptions while significantly reducing hallucinations common in the base model.

Key Improvements Over Base Model

Eliminated location hallucinations - Base model frequently guessed specific place names incorrectly
Fixed infinite generation loops - Base model got stuck repeating text in some cases
Concise objective descriptions - Matches human annotation style instead of encyclopedia-like responses
No more OCR over-reliance - Focuses on visual understanding rather than text reading
Consistent output format - Predictable response length and structure

| Output Consistency | Variable length/format | Consistent format |

Quick Start

What is This Model?

This is a LoRA adapter (not a full model). You need to:

Load the base model: Qwen/Qwen3-VL-8B-Instruct
Apply this LoRA adapter on top of it

Advantage: Only ~84MB download instead of ~8.7GB full model!

Installation

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .

Usage

from llamafactory.chat import ChatModel

# Initialize model with LoRA adapter
chat_model = ChatModel(args={
    "model_name_or_path": "Qwen/Qwen3-VL-8B-Instruct",
    "adapter_name_or_path": "WayBob/Qwen3vl-8b-qlora-4bit-Japanese-photo-conversation",
    "template": "qwen3_vl_nothink",
    "quantization_bit": 4,
    "trust_remote_code": True,
    "flash_attn": "fa2",  # Optional: enable flash attention for faster inference
    "infer_backend": "huggingface",
})

# Ask questions about images
messages = [{"role": "user", "content": "<image>\nこの画像には何が写っていますか？"}]
responses = chat_model.chat(messages=messages, images=["your_image.jpg"])
print(responses[0].response_text)

See inference_example.py for a complete working example with multiple question types.

Hardware Requirements

Configuration	VRAM Required
4-bit Quantization (as used in training)	~10-12GB

Recommended GPU: RTX 3090 / 4090 / A100 or equivalent with 12GB+ VRAM

Training Details

Base Model

Source: Qwen/Qwen3-VL-8B-Instruct
Parameters: 8.7 billion
Architecture: Qwen3-VL (Vision-Language)
Context Length: 262,144 tokens
Vision Encoder: ViT-based with spatial merge

Training Data

Primary Dataset: WayBob/Japanese_Photo_conversation_cleaned

This dataset is cleaned and organized from:

We thank the original authors for their excellent work.

Split	Samples	Percentage
Training	11,217	95%
Validation	591	5%
Total	11,808	100%

Training Configuration

Hardware & Infrastructure:

GPUs: 2× NVIDIA RTX 4090 24GB
Distributed Setup: Multi-node DDP (Distributed Data Parallel)
- Master Node: 192.168.11.8
- Worker Node: 192.168.11.10
- 1 GPU per node
Framework: LLaMA-Factory

Hyperparameters:

method: qlora_4bit
quantization: 4-bit (BitsAndBytes)
lora_rank: 8
lora_alpha: 16
lora_target: all (all linear layers)
lora_dropout: 0.05
learning_rate: 1.0e-4
batch_size: 4 per device × 2 accumulation = 8 effective
epochs: 3
optimizer: AdamW
lr_scheduler: cosine
warmup_ratio: 0.1
precision: bfloat16

Training Duration: ~2.85 hours (10,253 seconds)

Memory Usage: ~10-12GB VRAM per GPU

Training Results

Metric	Value
Final Training Loss	0.8665
Final Validation Loss	0.8579
Training Throughput	3.28 samples/second
Total Training Steps	2,106

Training Loss Curve:

Validation Loss Curve:

Performance Evaluation

We conducted comprehensive comparisons between the base model and fine-tuned model on both training and validation sets.

Example 1: Coastal Scene (Training Set)

Question: この画像はどのような場所を映していますか？

Model	Response
Base Model	この画像は日本、愛知県知多郡美浜町の「知多半岛」にある「知多海岸」（特に「知多海岸の浜」や「知多海岸の遊歩道」付近）を撮影したものです... ISSUE: Hallucinates specific location name with confidence
Fine-tuned	この画像は海岸沿いの風景を映しています。海、砂浜、そしてその背後に山々が見えます。
Ground Truth	この画像は、海岸の風景を映しています。波打ち際には砂利や石があり、奥には山と空が広がっています。右側には舗装された道が見え、その横には石が多く並んでいます。

Result: HALLUCINATION ELIMINATED - Model provides objective description without guessing specific locations.

Example 2: Indoor Building (Training Set)

Question: この画像はどのような場所で撮影されたものですか？

Model	Response
Base Model	この画像は東京都港区の虎ノ門・新宿エリアにある外務省・旧外務省庁舎...または「虎ノ門ヒルズ」の地下通路...または「虎ノ門ヒルズ」の地下通路...[repeats "虎ノ門ヒルズ" 30+ times] ISSUE: Infinite repetition loop
Fine-tuned	この画像は、商業施設やオフィスビルのような公共の室内で撮影されたものです。エスカレーターがあり、案内表示が見られます。
Ground Truth	この画像は大きなビルの内部で、エスカレーターや案内標識が見られる場所で撮影されたものです。公共の施設やオフィスビルのように見えます。

Result: CRITICAL BUG FIXED - Eliminates infinite loop, generates coherent response.

Example 3: Night Cityscape (Training Set)

Question: この画像に写っているのは昼間の景色ですか、夜の景色ですか？

Model	Response
Base Model	この画像は夜の景色です。理由：天空は暗く、星や月が見えません...中央に見える高層塔（東京タワー）も夜にライトアップされているのが確認できます。 ISSUE: Misidentifies Sky Tree as Tokyo Tower; over-explains simple question
Fine-tuned	この画像は夜の景色です。
Ground Truth	この画像に写っているのは夜の景色です。

Result: ACCURATE & CONCISE - Correctly answers without misidentification or unnecessary explanation.

For detailed comparison with all 6 examples (3 training + 3 validation), see model_comparison_report.md.

Quantitative Improvements Summary

Metric	Base Model	Fine-tuned Model
Location Hallucination	Frequent (50%+ samples)	Eliminated (0%)
Generation Errors	Infinite loops in edge cases	Clean stops (100%)
Response Style	Verbose, encyclopedia-like	Concise, objective
Misidentification	Common (e.g., Tokyo Tower)	Rare
Output Consistency	Variable length/format	Consistent format

Training Reproduction

Single-Node Training

llamafactory-cli train qwen3vl_8b_japanese_4bit.yaml

Multi-Node Distributed Training (Used in this model)

This model was trained using 2 nodes with the provided scripts.

On Master Node (192.168.11.8):

bash train_qwen3vl_8b_4bit_master.sh

On Worker Node (192.168.11.10):

bash train_qwen3vl_8b_4bit_worker.sh

The scripts are included in this repository with the exact configuration used for training.

Key environment variables for multi-node setup:

export FORCE_TORCHRUN=1
export NNODES=2
export NODE_RANK=0  # 0 for master, 1 for worker
export MASTER_ADDR=192.168.11.8
export MASTER_PORT=29500
export CUDA_VISIBLE_DEVICES=0

Model Files

Model Weights & Config

adapter_config.json - LoRA adapter configuration
adapter_model.safetensors - LoRA adapter weights (~84MB)
training_args.bin - Training arguments

Training Results

training_loss.png - Training loss curve
training_eval_loss.png - Validation loss curve
trainer_log.jsonl - Detailed training logs
all_results.json - Final training metrics
train_results.json - Training statistics
eval_results.json - Evaluation statistics

Training Configuration

qwen3vl_8b_japanese_4bit.yaml - LLaMA-Factory training configuration
train_qwen3vl_8b_4bit_master.sh - Multi-node training script (master node)
train_qwen3vl_8b_4bit_worker.sh - Multi-node training script (worker node)

Documentation & Examples

model_comparison_report.md - Detailed comparison report with 6 examples
inference_example.py - Simple inference example using LLaMA-Factory
sample_images/ - Sample images used in comparison report (6 images)

Limitations

Language: Primarily trained on Japanese; performance on other languages not guaranteed
Domain: Specialized for photo description; may not generalize to other vision tasks (e.g., OCR, diagram analysis)
Quantization: Designed for 4-bit quantization; full precision inference not tested
Output Style: Trained to produce concise descriptions; may not provide detailed analysis when needed
Context: 2048 token cutoff during training

Intended Use Cases

Recommended:

Japanese photo description and captioning
Visual question answering in Japanese
Content moderation (image understanding)
Accessibility applications (image-to-text for visually impaired)
Dataset annotation assistance

Not Recommended:

Medical image diagnosis
Fine-grained object detection (use specialized models)
OCR tasks (use dedicated OCR models)
Video understanding (trained on static images only)

Citation

@misc{qwen3vl-8b-qlora-japanese-photo,
  author = {WayBob},
  title = {Qwen3VL-8B QLora 4-bit Japanese Photo Conversation},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/WayBob/Qwen3vl-8b-qlora-4bit-Japanese-photo-conversation}
}

Acknowledgements

Base Model:

Qwen3-VL-8B-Instruct - Alibaba Cloud Qwen Team
Licensed under Apache 2.0

Training Datasets:

llm-jp/japanese-photos-conversation - Original Japanese photo conversation dataset
ThePioneer/japanese-photos - Japanese photo collection
WayBob/Japanese_Photo_conversation_cleaned - Cleaned and organized version

Training Framework:

LLaMA-Factory by hiyouga

Method:

QLoRA: Efficient Finetuning of Quantized LLMs by Dettmers et al.

Infrastructure:

Local multi-node setup with 2× NVIDIA RTX 4090 24GB GPUs

License

This model is licensed under Creative Commons Attribution 2.0 (CC-BY-2.0).

Key License Terms

Share: You can copy and redistribute the material in any medium or format
Adapt: You can remix, transform, and build upon the material for any purpose, even commercially
Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made
No Additional Restrictions: You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits

Full License: See CC-BY-2.0 License for complete terms.

Contact

HuggingFace: WayBob
Model Repository: Qwen3vl-8b-qlora-4bit-Japanese-photo-conversation
Dataset Repository: Japanese_Photo_conversation_cleaned

Downloads last month: 51

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WayBob/Qwen3vl-8b-qlora-4bit-Japanese-photo-conversation

Base model

Qwen/Qwen3-VL-8B-Instruct

Adapter

(28)

this model

Dataset used to train WayBob/Qwen3vl-8b-qlora-4bit-Japanese-photo-conversation

Evaluation results

Validation Loss on Japanese Photo Conversation
self-reported

0.858
Training Loss on Japanese Photo Conversation
self-reported

0.867