Phi-4-Multimodal-DPO

Fine-tuned version of microsoft/Phi-4-multimodal-instruct using Direct Preference Optimization (DPO) for improved code-switching speech recognition.

Evaluation Results (MER - Mixed Error Rate, lower is better)

Benchmark Baseline This Model Improvement
SEAME (Code-Switching) 0.5900 0.4992 +15.4%
EMILIA 0.7098 0.0740 +89.6%
CS DIALOGUE 0.4961 0.1070 +78.4%

Benchmark Descriptions

  • SEAME: English-Mandarin code-switching conversational speech (out-of-distribution test set)
  • EMILIA: Synthetic code-switching evaluation set
  • CS DIALOGUE: In-distribution evaluation set

Training Configuration

Parameter Value
Base Model microsoft/Phi-4-multimodal-instruct
Training Method DPO (Direct Preference Optimization)
Learning Rate 5e-6
DPO Beta 0.05
Epochs 1
Batch Size (per GPU) 1
Gradient Accumulation 4
Effective Batch Size 256
Optimizer AdamW
LR Scheduler Cosine
Warmup Ratio 0.1
Max Length 2048
DeepSpeed ZeRO-2

Usage

from transformers import AutoModelForCausalLM, AutoProcessor
import soundfile as sf

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "myaccountfor/Phi-4-multimodal-DPO",
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(
    "myaccountfor/Phi-4-multimodal-DPO",
    trust_remote_code=True
)

# Load audio
audio, sr = sf.read("your_audio.wav")

# Build prompt
prompt = "<|user|><|audio_1|>Please transcribe this speech.<|end|><|assistant|>"

# Process and generate
inputs = processor(text=prompt, audios=[(audio, sr)], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
transcription = processor.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0]
print(transcription)

Files

β”œβ”€β”€ README.md
β”œβ”€β”€ config.json
β”œβ”€β”€ model-*.safetensors       # Model weights (~11.5 GB total)
β”œβ”€β”€ tokenizer files
└── eval_results/
    β”œβ”€β”€ baseline_seame.json
    β”œβ”€β”€ baseline_emilia.json
    β”œβ”€β”€ baseline_csdialogue.json
    β”œβ”€β”€ trained_seame.json
    β”œβ”€β”€ trained_emilia.json
    └── trained_csdialogue.json

License

This model inherits the license from Microsoft's Phi-4 model.

Downloads last month
48
Safetensors
Model size
6B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for myaccountfor/Phi-4-multimodal-DPO

Finetuned
(47)
this model