Hi everyone,
I’m building an AI application where a user can take a photo of their environment, and the AI should:
-
Determine if the environment is “Good” or “Not Good”
-
If “Not Good,” detect and localize the object(s) causing it
However, I’m facing several challenges since my dataset is highly diverse, which is expected with real-world environmental images.
What I’ve Tried
1. Vision-Language Models (Gemma3, LLaVA)
-
I input the image and ask the VL model: “Is this Good or Not Good?”
-
I also provide prompts describing what a Good image looks like vs. Not Good.
-
Result:
-
The model can describe the image well.
-
But classification is terrible — it almost always says “Not Good,” even when the image is actually Good.
-
It feels like the model is “overly cautious” or unable to map the description rules to a binary decision.
-
2. Image Classification (ConvNeXt)
-
Built a binary classification dataset: Good / Not Good.
-
Training loss and accuracy look good.
-
Result:
-
Works in some cases (e.g., empty table = Good).
-
But fails in others (e.g., full table that’s still acceptable = classified as Not Good).
-
Seems to overfit to simple visual cues like clutter = bad.
-
3. Object Detection (YOLO)
-
Labeled Not Good examples with bounding boxes showing the issues.
-
Trained YOLO to only detect Not Good objects (no detection = Good).
-
Result:
-
Very poor training accuracy.
-
I think the main problem is inconsistent bounding boxes — varied size, position, and coverage across images.
-
The dataset is too inconsistent for the model to learn clear patterns.
-
My Challenges
-
Data variability: “Not Good” situations can look very different.
-
Subtlety in rules: Some environments are “full” but still acceptable, which confuses binary classifiers
What I’m Looking For
- Advice on which model architecture or processing pipeline I should try, along with examples, so that both classification and detection can work effectively.