I have a custom dataset and I want to fine-tune a Vision-Language Model (VLM) for open-vocabulary object detection. However, I’m a bit confused about the correct dataset format.
I know that most detection models (e.g., OmDet-Turbo, OWL-ViT) support the COCO-style JSON format, but I’m not sure how to adapt my dataset, especially when it includes extra metadata such as captions or attributes for each bounding box.
Here is a simplified example of my current dataset:
{
"images": [
{
"id": 1,
"file_name": "file_path/000003.bmp",
"width": 930,
"height": 930
},...
],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_name": "Other",
"bbox": [784.0, 807.0, 90.0, 110.0],
"area": 9900.0,
"iscrowd": 0,
"caption": {
"confidence": confidence_score,
"ship-visibility": "visible",
"ship-purpose": "cargo or transport",
...
},
"category_id": 26
},...
],
"categories": [
{ "id": 1, "name": "category_name" },...
]
}
My questions:
-
Should I remove the extra caption field? Or can VLMs use these text attributes during fine-tuning?
-
For standard COCO format, I think the annotation should look like this:
{
"id": 1,
"image_id": 1,
"category_id": 26,
"bbox": [784.0, 807.0, 90.0, 110.0],
"area": 9900.0,
"iscrowd": 0
}
- Does anyone have a working fine-tuning example for open-vocabulary detection models like OmDet-Turbo or OWL-ViT?
Any guidance or example repositories would be greatly appreciated!
Thanks ![]()