Phi-4-Multimodal-DPO

Fine-tuned version of microsoft/Phi-4-multimodal-instruct using Direct Preference Optimization (DPO) for improved code-switching speech recognition.

Evaluation Results (MER - Mixed Error Rate, lower is better)

Benchmark	Baseline	This Model	Improvement
SEAME (Code-Switching)	0.5900	0.4992	+15.4%
EMILIA	0.7098	0.0740	+89.6%
CS DIALOGUE	0.4961	0.1070	+78.4%

Benchmark Descriptions

SEAME: English-Mandarin code-switching conversational speech (out-of-distribution test set)
EMILIA: Synthetic code-switching evaluation set
CS DIALOGUE: In-distribution evaluation set

Training Configuration

Parameter	Value
Base Model	microsoft/Phi-4-multimodal-instruct
Training Method	DPO (Direct Preference Optimization)
Learning Rate	5e-6
DPO Beta	0.05
Epochs	1
Batch Size (per GPU)	1
Gradient Accumulation	4
Effective Batch Size	256
Optimizer	AdamW
LR Scheduler	Cosine
Warmup Ratio	0.1
Max Length	2048
DeepSpeed	ZeRO-2

Usage

from transformers import AutoModelForCausalLM, AutoProcessor
import soundfile as sf

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "myaccountfor/Phi-4-multimodal-DPO",
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(
    "myaccountfor/Phi-4-multimodal-DPO",
    trust_remote_code=True
)

# Load audio
audio, sr = sf.read("your_audio.wav")

# Build prompt
prompt = "<|user|><|audio_1|>Please transcribe this speech.<|end|><|assistant|>"

# Process and generate
inputs = processor(text=prompt, audios=[(audio, sr)], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
transcription = processor.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0]
print(transcription)

Files

├── README.md
├── config.json
├── model-*.safetensors       # Model weights (~11.5 GB total)
├── tokenizer files
└── eval_results/
    ├── baseline_seame.json
    ├── baseline_emilia.json
    ├── baseline_csdialogue.json
    ├── trained_seame.json
    ├── trained_emilia.json
    └── trained_csdialogue.json

License

This model inherits the license from Microsoft's Phi-4 model.

Downloads last month: 48

Safetensors

Model size

6B params

Tensor type

BF16

Model tree for myaccountfor/Phi-4-multimodal-DPO

Base model

microsoft/Phi-4-multimodal-instruct

Finetuned

(47)

this model