Struggling to Build an AI That Can Detect “Not Good” vs “Good” Environment Photos

Hi everyone,

I’m building an AI application where a user can take a photo of their environment, and the AI should:

  1. Determine if the environment is “Good” or “Not Good”

  2. If “Not Good,” detect and localize the object(s) causing it

However, I’m facing several challenges since my dataset is highly diverse, which is expected with real-world environmental images.
What I’ve Tried

1. Vision-Language Models (Gemma3, LLaVA)

  • I input the image and ask the VL model: “Is this Good or Not Good?”

  • I also provide prompts describing what a Good image looks like vs. Not Good.

  • Result:

    • The model can describe the image well.

    • But classification is terrible — it almost always says “Not Good,” even when the image is actually Good.

    • It feels like the model is “overly cautious” or unable to map the description rules to a binary decision.


2. Image Classification (ConvNeXt)

  • Built a binary classification dataset: Good / Not Good.

  • Training loss and accuracy look good.

  • Result:

    • Works in some cases (e.g., empty table = Good).

    • But fails in others (e.g., full table that’s still acceptable = classified as Not Good).

    • Seems to overfit to simple visual cues like clutter = bad.


3. Object Detection (YOLO)

  • Labeled Not Good examples with bounding boxes showing the issues.

  • Trained YOLO to only detect Not Good objects (no detection = Good).

  • Result:

    • Very poor training accuracy.

    • I think the main problem is inconsistent bounding boxes — varied size, position, and coverage across images.

    • The dataset is too inconsistent for the model to learn clear patterns.

My Challenges

  • Data variability: “Not Good” situations can look very different.

  • Subtlety in rules: Some environments are “full” but still acceptable, which confuses binary classifiers

What I’m Looking For

  • Advice on which model architecture or processing pipeline I should try, along with examples, so that both classification and detection can work effectively.
1 Like

Came across similar use case and i found the best solution for me is set up two task-specific Peft adapters instead.

The first adapter performs binary classification task that gives you good/ no good, and the second adapter do whatever processing you’d need circling out why it’s bad.

1 Like