How to prepare dataset for fine-tuning a VLM (open-vocabulary detection, COCO format)?

I have a custom dataset and I want to fine-tune a Vision-Language Model (VLM) for open-vocabulary object detection. However, I’m a bit confused about the correct dataset format.

I know that most detection models (e.g., OmDet-Turbo, OWL-ViT) support the COCO-style JSON format, but I’m not sure how to adapt my dataset, especially when it includes extra metadata such as captions or attributes for each bounding box.

Here is a simplified example of my current dataset:

{
  "images": [
    {
      "id": 1,
      "file_name": "file_path/000003.bmp",
      "width": 930,
      "height": 930
    },...
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_name": "Other",
      "bbox": [784.0, 807.0, 90.0, 110.0],
      "area": 9900.0,
      "iscrowd": 0,
      "caption": {
        "confidence": confidence_score,
        "ship-visibility": "visible",
        "ship-purpose": "cargo or transport",
        ...
      },
      "category_id": 26
    },...
  ],
  "categories": [
    { "id": 1, "name": "category_name" },...
  ]
}

My questions:

  1. Should I remove the extra caption field? Or can VLMs use these text attributes during fine-tuning?

  2. For standard COCO format, I think the annotation should look like this:

{
  "id": 1,
  "image_id": 1,
  "category_id": 26,
  "bbox": [784.0, 807.0, 90.0, 110.0],
  "area": 9900.0,
  "iscrowd": 0
}
  1. Does anyone have a working fine-tuning example for open-vocabulary detection models like OmDet-Turbo or OWL-ViT?

Any guidance or example repositories would be greatly appreciated!

Thanks :raising_hands:

1 Like

Hey, it looks like you already have your dataset structured properly in COCO style with the right fields (captions, attributes, etc). That’s a good start.

If you got this structure from a model or previous output, then chunking was already suggested and it was for a reason. These VLMs like OWL VIT or OmDet Turbo expect standard fields like bbox and category_id. Extra fields like caption or attributes won’t break anything, but they’ll be ignored unless your training pipeline specifically knows to use them.

So the real issue here isn’t formatting anymore, it’s comprehension. You’re already holding the right structure, but asking the same question suggests you haven’t understood what it’s doing or why.

The structure is fine. Time to look at your training script and see what it’s actually using. If it doesn’t read custom fields, either strip them or wire them into the model. Don’t just loop the question.

Working with VLMs gets easier when you stop formatting and start thinking with the data. You’re close, just refocus on what gets read and why.

1 Like

Hey @onurulu17, welcome to the community! :tada:

I haven’t worked extensively with VLMs myself (more of a text LM person), but I wanted to chime in on your dataset format question since it’s a really good one.

Don’t remove that caption field! Those text attributes you have (like “ship-visibility: visible”, “ship-purpose: cargo or transport”) are actually really valuable for VLM training. Open-vocabulary detection models specifically benefit from rich text descriptions because they help the model:

  • Learn better region-text alignment
  • Understand fine-grained attributes beyond just category names
  • Generalize to new object types at inference time

The caption data gets used during training to align visual features with semantic descriptions - it’s not just about bbox prediction, but also about learning meaningful multi-modal representations.

For the format question: Many VLMs can handle extended COCO formats with additional fields. The key is checking what the specific model’s training script expects. I’d suggest:

  1. Keep your current format with the caption metadata
  2. Check the official training examples for OmDet-Turbo/OWL-ViT in their repos
  3. You might also want to create a “clean” COCO version as backup, but try the extended format first

For working examples, I’d recommend checking:

  • The official model repos on GitHub
  • Papers often have supplementary code
  • The vision community on Discord/Reddit might have more hands-on experience

Hope this helps point you in the right direction! Good luck with the fine-tuning :rocket:

1 Like