Phi-4-Multimodal-DPO
Fine-tuned version of microsoft/Phi-4-multimodal-instruct using Direct Preference Optimization (DPO) for improved code-switching speech recognition.
Evaluation Results (MER - Mixed Error Rate, lower is better)
| Benchmark | Baseline | This Model | Improvement |
|---|---|---|---|
| SEAME (Code-Switching) | 0.5900 | 0.4992 | +15.4% |
| EMILIA | 0.7098 | 0.0740 | +89.6% |
| CS DIALOGUE | 0.4961 | 0.1070 | +78.4% |
Benchmark Descriptions
- SEAME: English-Mandarin code-switching conversational speech (out-of-distribution test set)
- EMILIA: Synthetic code-switching evaluation set
- CS DIALOGUE: In-distribution evaluation set
Training Configuration
| Parameter | Value |
|---|---|
| Base Model | microsoft/Phi-4-multimodal-instruct |
| Training Method | DPO (Direct Preference Optimization) |
| Learning Rate | 5e-6 |
| DPO Beta | 0.05 |
| Epochs | 1 |
| Batch Size (per GPU) | 1 |
| Gradient Accumulation | 4 |
| Effective Batch Size | 256 |
| Optimizer | AdamW |
| LR Scheduler | Cosine |
| Warmup Ratio | 0.1 |
| Max Length | 2048 |
| DeepSpeed | ZeRO-2 |
Usage
from transformers import AutoModelForCausalLM, AutoProcessor
import soundfile as sf
# Load model
model = AutoModelForCausalLM.from_pretrained(
"myaccountfor/Phi-4-multimodal-DPO",
trust_remote_code=True,
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained(
"myaccountfor/Phi-4-multimodal-DPO",
trust_remote_code=True
)
# Load audio
audio, sr = sf.read("your_audio.wav")
# Build prompt
prompt = "<|user|><|audio_1|>Please transcribe this speech.<|end|><|assistant|>"
# Process and generate
inputs = processor(text=prompt, audios=[(audio, sr)], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
transcription = processor.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0]
print(transcription)
Files
βββ README.md
βββ config.json
βββ model-*.safetensors # Model weights (~11.5 GB total)
βββ tokenizer files
βββ eval_results/
βββ baseline_seame.json
βββ baseline_emilia.json
βββ baseline_csdialogue.json
βββ trained_seame.json
βββ trained_emilia.json
βββ trained_csdialogue.json
License
This model inherits the license from Microsoft's Phi-4 model.
- Downloads last month
- 48
Model tree for myaccountfor/Phi-4-multimodal-DPO
Base model
microsoft/Phi-4-multimodal-instruct