Qwen3VL-8B QLora 4-bit - Japanese Photo Conversation

🎨 Vision-Language Model | 📸 Photo Description | 🇯🇵 Japanese Specialized

License Base Model Training Dataset

Built with Qwen3-VL | Fine-tuning: 4-bit QLoRA (84MB adapter) | Framework: LLaMA-Factory | Hardware: 2×RTX 4090 24GB (Multi-node)


A Japanese-specialized vision-language model fine-tuned from Qwen/Qwen3-VL-8B-Instruct using 4-bit QLoRA on multi-node distributed infrastructure.

Model Description

This model adapts the general-purpose Qwen3-VL-8B into a specialized Japanese photo description system. Through LoRA fine-tuning on 11,808 Japanese photo-conversation pairs, it learns to produce concise, objective descriptions while significantly reducing hallucinations common in the base model.

Key Improvements Over Base Model

  • Eliminated location hallucinations - Base model frequently guessed specific place names incorrectly
  • Fixed infinite generation loops - Base model got stuck repeating text in some cases
  • Concise objective descriptions - Matches human annotation style instead of encyclopedia-like responses
  • No more OCR over-reliance - Focuses on visual understanding rather than text reading
  • Consistent output format - Predictable response length and structure

| Output Consistency | Variable length/format | Consistent format |

Quick Start

What is This Model?

This is a LoRA adapter (not a full model). You need to:

  1. Load the base model: Qwen/Qwen3-VL-8B-Instruct
  2. Apply this LoRA adapter on top of it

Advantage: Only ~84MB download instead of ~8.7GB full model!

Installation

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .

Usage

from llamafactory.chat import ChatModel

# Initialize model with LoRA adapter
chat_model = ChatModel(args={
    "model_name_or_path": "Qwen/Qwen3-VL-8B-Instruct",
    "adapter_name_or_path": "WayBob/Qwen3vl-8b-qlora-4bit-Japanese-photo-conversation",
    "template": "qwen3_vl_nothink",
    "quantization_bit": 4,
    "trust_remote_code": True,
    "flash_attn": "fa2",  # Optional: enable flash attention for faster inference
    "infer_backend": "huggingface",
})

# Ask questions about images
messages = [{"role": "user", "content": "<image>\nこの画像には何が写っていますか?"}]
responses = chat_model.chat(messages=messages, images=["your_image.jpg"])
print(responses[0].response_text)

See inference_example.py for a complete working example with multiple question types.

Hardware Requirements

Configuration VRAM Required
4-bit Quantization (as used in training) ~10-12GB

Recommended GPU: RTX 3090 / 4090 / A100 or equivalent with 12GB+ VRAM

Training Details

Base Model

  • Source: Qwen/Qwen3-VL-8B-Instruct
  • Parameters: 8.7 billion
  • Architecture: Qwen3-VL (Vision-Language)
  • Context Length: 262,144 tokens
  • Vision Encoder: ViT-based with spatial merge

Training Data

Primary Dataset: WayBob/Japanese_Photo_conversation_cleaned

This dataset is cleaned and organized from:

We thank the original authors for their excellent work.

Split Samples Percentage
Training 11,217 95%
Validation 591 5%
Total 11,808 100%

Training Configuration

Hardware & Infrastructure:

  • GPUs: 2× NVIDIA RTX 4090 24GB
  • Distributed Setup: Multi-node DDP (Distributed Data Parallel)
    • Master Node: 192.168.11.8
    • Worker Node: 192.168.11.10
    • 1 GPU per node
  • Framework: LLaMA-Factory

Hyperparameters:

method: qlora_4bit
quantization: 4-bit (BitsAndBytes)
lora_rank: 8
lora_alpha: 16
lora_target: all (all linear layers)
lora_dropout: 0.05
learning_rate: 1.0e-4
batch_size: 4 per device × 2 accumulation = 8 effective
epochs: 3
optimizer: AdamW
lr_scheduler: cosine
warmup_ratio: 0.1
precision: bfloat16

Training Duration: ~2.85 hours (10,253 seconds)

Memory Usage: ~10-12GB VRAM per GPU

Training Results

Metric Value
Final Training Loss 0.8665
Final Validation Loss 0.8579
Training Throughput 3.28 samples/second
Total Training Steps 2,106

Training Loss Curve:

Training Loss

Validation Loss Curve:

Validation Loss

Performance Evaluation

We conducted comprehensive comparisons between the base model and fine-tuned model on both training and validation sets.

Example 1: Coastal Scene (Training Set)

Coastal Scene

Question: この画像はどのような場所を映していますか?

Model Response
Base Model この画像は日本、愛知県知多郡美浜町の「知多半岛」にある「知多海岸」(特に「知多海岸の浜」や「知多海岸の遊歩道」付近)を撮影したものです...

ISSUE: Hallucinates specific location name with confidence
Fine-tuned この画像は海岸沿いの風景を映しています。海、砂浜、そしてその背後に山々が見えます。
Ground Truth この画像は、海岸の風景を映しています。波打ち際には砂利や石があり、奥には山と空が広がっています。右側には舗装された道が見え、その横には石が多く並んでいます。

Result: HALLUCINATION ELIMINATED - Model provides objective description without guessing specific locations.


Example 2: Indoor Building (Training Set)

Indoor Building

Question: この画像はどのような場所で撮影されたものですか?

Model Response
Base Model この画像は東京都港区の虎ノ門・新宿エリアにある外務省・旧外務省庁舎...または「虎ノ門ヒルズ」の地下通路...または「虎ノ門ヒルズ」の地下通路...[repeats "虎ノ門ヒルズ" 30+ times]

ISSUE: Infinite repetition loop
Fine-tuned この画像は、商業施設やオフィスビルのような公共の室内で撮影されたものです。エスカレーターがあり、案内表示が見られます。
Ground Truth この画像は大きなビルの内部で、エスカレーターや案内標識が見られる場所で撮影されたものです。公共の施設やオフィスビルのように見えます。

Result: CRITICAL BUG FIXED - Eliminates infinite loop, generates coherent response.


Example 3: Night Cityscape (Training Set)

Night Cityscape

Question: この画像に写っているのは昼間の景色ですか、夜の景色ですか?

Model Response
Base Model この画像は夜の景色です。

理由:天空は暗く、星や月が見えません...中央に見える高層塔(東京タワー)も夜にライトアップされているのが確認できます。

ISSUE: Misidentifies Sky Tree as Tokyo Tower; over-explains simple question
Fine-tuned この画像は夜の景色です。
Ground Truth この画像に写っているのは夜の景色です。

Result: ACCURATE & CONCISE - Correctly answers without misidentification or unnecessary explanation.


For detailed comparison with all 6 examples (3 training + 3 validation), see model_comparison_report.md.

Quantitative Improvements Summary

Metric Base Model Fine-tuned Model
Location Hallucination Frequent (50%+ samples) Eliminated (0%)
Generation Errors Infinite loops in edge cases Clean stops (100%)
Response Style Verbose, encyclopedia-like Concise, objective
Misidentification Common (e.g., Tokyo Tower) Rare
Output Consistency Variable length/format Consistent format

Training Reproduction

Single-Node Training

llamafactory-cli train qwen3vl_8b_japanese_4bit.yaml

Multi-Node Distributed Training (Used in this model)

This model was trained using 2 nodes with the provided scripts.

On Master Node (192.168.11.8):

bash train_qwen3vl_8b_4bit_master.sh

On Worker Node (192.168.11.10):

bash train_qwen3vl_8b_4bit_worker.sh

The scripts are included in this repository with the exact configuration used for training.

Key environment variables for multi-node setup:

export FORCE_TORCHRUN=1
export NNODES=2
export NODE_RANK=0  # 0 for master, 1 for worker
export MASTER_ADDR=192.168.11.8
export MASTER_PORT=29500
export CUDA_VISIBLE_DEVICES=0

Model Files

Model Weights & Config

  • adapter_config.json - LoRA adapter configuration
  • adapter_model.safetensors - LoRA adapter weights (~84MB)
  • training_args.bin - Training arguments

Training Results

  • training_loss.png - Training loss curve
  • training_eval_loss.png - Validation loss curve
  • trainer_log.jsonl - Detailed training logs
  • all_results.json - Final training metrics
  • train_results.json - Training statistics
  • eval_results.json - Evaluation statistics

Training Configuration

  • qwen3vl_8b_japanese_4bit.yaml - LLaMA-Factory training configuration
  • train_qwen3vl_8b_4bit_master.sh - Multi-node training script (master node)
  • train_qwen3vl_8b_4bit_worker.sh - Multi-node training script (worker node)

Documentation & Examples

  • model_comparison_report.md - Detailed comparison report with 6 examples
  • inference_example.py - Simple inference example using LLaMA-Factory
  • sample_images/ - Sample images used in comparison report (6 images)

Limitations

  • Language: Primarily trained on Japanese; performance on other languages not guaranteed
  • Domain: Specialized for photo description; may not generalize to other vision tasks (e.g., OCR, diagram analysis)
  • Quantization: Designed for 4-bit quantization; full precision inference not tested
  • Output Style: Trained to produce concise descriptions; may not provide detailed analysis when needed
  • Context: 2048 token cutoff during training

Intended Use Cases

Recommended:

  • Japanese photo description and captioning
  • Visual question answering in Japanese
  • Content moderation (image understanding)
  • Accessibility applications (image-to-text for visually impaired)
  • Dataset annotation assistance

Not Recommended:

  • Medical image diagnosis
  • Fine-grained object detection (use specialized models)
  • OCR tasks (use dedicated OCR models)
  • Video understanding (trained on static images only)

Citation

@misc{qwen3vl-8b-qlora-japanese-photo,
  author = {WayBob},
  title = {Qwen3VL-8B QLora 4-bit Japanese Photo Conversation},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/WayBob/Qwen3vl-8b-qlora-4bit-Japanese-photo-conversation}
}

Acknowledgements

Base Model:

Training Datasets:

Training Framework:

Method:

Infrastructure:

  • Local multi-node setup with 2× NVIDIA RTX 4090 24GB GPUs

License

This model is licensed under Creative Commons Attribution 2.0 (CC-BY-2.0).

Key License Terms

  • Share: You can copy and redistribute the material in any medium or format
  • Adapt: You can remix, transform, and build upon the material for any purpose, even commercially
  • Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made
  • No Additional Restrictions: You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits

Full License: See CC-BY-2.0 License for complete terms.

Contact

Downloads last month
51
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WayBob/Qwen3vl-8b-qlora-4bit-Japanese-photo-conversation

Adapter
(28)
this model

Dataset used to train WayBob/Qwen3vl-8b-qlora-4bit-Japanese-photo-conversation

Evaluation results