Prompted Segmentation for Drywall QA

Text-conditioned binary mask prediction for construction defect detection

Python 3.11 PyTorch HuggingFace CLIPSeg Typst uv MIT

Methodology β€’ Data Preparation β€’ Results β€’ Failure Cases β€’ Quick Start β€’ Full Report (PDF)


Feed a construction photo and a text prompt. Get a binary segmentation mask back.

Two tasks β€” crack detection and drywall taping/joint detection β€” both driven by natural language at inference time. Change the prompt, change what gets segmented. No class heads, no retraining.

Input:  image.jpg  +  "segment wall crack"
Output: image__segment_wall_crack.png   (binary mask, {0, 255})

1. Methodology

Model: CLIPSeg

We fine-tune CLIPSeg (Luddecke & Ecker, CVPR 2022) β€” a text-conditioned segmentation model built on CLIP. The entire CLIP backbone (149.6M params) stays frozen. Only a lightweight 3-block transformer decoder with U-Net skip connections (1.13M params) is trained.

CLIPSeg architecture: frozen CLIP backbone + trainable decoder

The model takes an RGB image and a text prompt. The CLIP vision encoder (ViT-B/16) and text encoder independently produce embeddings. The decoder fuses these via cross-attention and generates logits at 352x352, which are thresholded at 0.5 to produce binary masks.

Why CLIPSeg over Grounded SAM, SEEM, X-Decoder?
CLIPSeg Grounded SAM SEEM X-Decoder
Text-to-mask Direct Two-stage (text β†’ bbox β†’ mask) Multi-modal Yes
Small-data fine-tuning Proven Moderate Difficult Not ideal
Consumer GPU (Apple M4) Yes Decoder only No No
HuggingFace native Yes Yes GitHub only Limited

CLIPSeg is the only architecture that gives direct text-to-mask conditioning without bounding box intermediates, fine-tunes reliably on small datasets, and runs on consumer hardware with mature HuggingFace support.

Training Configuration

Parameter Value
Base model CIDAS/clipseg-rd64-refined
Trainable 1,127,009 params (decoder only)
Frozen 149,620,737 params (CLIP backbone)
Loss BCEDiceLoss β€” 0.5 BCE + 0.5 Dice
Optimizer AdamW (lr=1e-4, wd=1e-4) + CosineAnnealingLR
Early stopping patience 7 on val mIoU
Device Apple M4 (MPS backend)
Wall time 97.2 min (18 epochs, best at epoch 11)
Why BCEDiceLoss instead of standard BCE?

Standard BCE alone fails on thin structures like cracks β€” the severe foreground/background imbalance means BCE happily predicts "all background" at low loss. Dice loss directly optimizes overlap, forcing the model to find crack pixels. The 50/50 blend gives gradient stability (BCE) and overlap-awareness (Dice).

Training Pipeline

Training loop: load pretrained β†’ freeze backbone β†’ train decoder β†’ early stop β†’ evaluate

Training converged at epoch 11 (val mIoU 0.1605). The remaining 7 epochs showed no improvement before early stopping triggered at epoch 18.

All hyperparameters: configs/train_config.yaml


2. Data Preparation

Sources

Two datasets from Roboflow Universe, downloaded manually in COCO format:

Dataset Source Images Raw Annotation Mask Strategy
Taping drywall-join-detect 1,186 Bounding boxes only Filled rectangles
Cracks cracks-3ii36 5,369 COCO polygons Pixel-accurate binary masks via pycocotools

Note: The cracks dataset had 0 generated Roboflow versions β€” the owner never created an exportable version, making API download impossible. The raw export was downloaded directly from the website.

Mask Rendering

  • Cracks: COCO polygon annotations rendered to pixel-accurate binary masks using pycocotools.mask. Some annotations had empty segmentation fields (edge case) β€” handled with try/except fallback to bounding box rendering.
  • Taping: Only bounding box annotations available. Filled rectangles used as mask approximations. This is a known limitation β€” the rectangles include substantial background, which affects training signal quality.

Prompt Augmentation

5 synonyms per class, randomly sampled each training iteration. This forces the decoder to learn semantic meaning from the text encoder rather than memorize exact strings:

Class Prompts
Cracks "segment crack" Β· "segment wall crack" Β· "segment surface crack" Β· "segment drywall crack" Β· "segment fracture"
Taping "segment taping area" Β· "segment joint tape" Β· "segment drywall seam" Β· "segment drywall joint" Β· "segment tape line"

Pipeline

Data pipeline: Roboflow β†’ inspect annotations β†’ render masks β†’ unified manifest β†’ stratified split

Splits

Stratified by class (taping vs cracks), seed 42:

Train Validation Test
4,588 (70%) 982 (15%) 985 (15%)

Preprocessing code: src/data/preprocess.py Β· Dataset class: src/data/dataset.py


3. Results

Best Predictions

The model's strongest predictions reach IoU 0.78 on both cracks and taping:

Best test-set predictions ranked by IoU β€” 3 cracks + 3 taping

Test-Set Metrics (985 samples)

Class mIoU Dice Samples
Taping 0.1917 0.2780 179
Cracks 0.1639 0.2434 806
Overall 0.1689 0.2497 985

Taping outperforms cracks because filled-rectangle masks provide a stronger supervision signal (larger contiguous regions) compared to thin crack annotations where minor spatial offsets cause disproportionate IoU drops.

Inference

Metric Value
Avg inference time 58.7 ms / image
Model size 575.1 MB
Output format PNG, single-channel {0, 255}, resized to original dimensions
Threshold 0.5 (sigmoid β†’ binary)

4. Failure Cases & Potential Solutions

Worst Predictions

The model's worst predictions (IoU near zero) reveal systematic failure patterns:

Failure cases β€” worst test-set predictions by IoU, 3 cracks + 3 taping

What's going wrong in these examples:

  • Cracks (rows 1–3): The model activates over broad wall regions instead of tracing the thin crack lines. Fine cracks disappear at 352x352 resolution, and the frozen CLIP backbone has no features for hairline construction defects. The predictions show the model "knows something is there" but can't localize it precisely.
  • Taping (rows 4–6): The model predicts large rectangular blobs that don't match the actual joint locations. This directly traces back to the filled-rectangle training masks β€” the model learned to predict rectangles because that's what it was supervised on.

Root Causes

# Factor Impact
1 Coarse taping annotations Source dataset has bounding boxes, not pixel masks. Filled rectangles include background β†’ model over-predicts.
2 Thin crack IoU sensitivity A 1px crack shifted 2px = near-zero IoU despite visual similarity. Dominates aggregate.
3 352x352 resolution ceiling CLIPSeg's fixed input size discards fine detail from high-res construction photos.
4 Frozen backbone domain gap CLIP was trained on internet images, not construction imagery. Cannot adapt feature extraction.
5 Small decoder (1.13M params) Limited capacity to learn construction-specific visual patterns.

Proposed Solutions

Limitation Solution Expected Impact
Coarse taping masks Use SAM/SAM2 to generate pixel-accurate masks from bounding boxes before training High β€” directly fixes the supervision signal
Frozen backbone Unfreeze last 2–3 ViT blocks with 10x lower learning rate for domain adaptation High β€” lets the model learn construction-specific features
352x352 resolution Switch to SAM2 with text-prompt conditioning or a higher-res architecture High β€” preserves fine crack detail
Small decoder Add decoder blocks or increase hidden dimension (monitor overfitting) Medium β€” more capacity, but risk of overfitting on small data
Thin-crack metric sensitivity Use boundary IoU or distance-tolerant evaluation instead of standard IoU Low β€” doesn't improve the model, but gives fairer measurement

Repo Structure

Repository structure β€” color-coded by module with data flow arrows

File-by-file listing
Path Purpose
configs/train_config.yaml All hyperparameters in one file
src/data/preprocess.py Annotation inspection, mask rendering, stratified splits
src/data/dataset.py PyTorch Dataset + CLIPSegProcessor collation
src/model/clipseg_wrapper.py Model loading + backbone freezing
src/model/losses.py BCEDiceLoss implementation
src/train.py Training loop with early stopping + logging
src/evaluate.py Test metrics, mask generation, visual comparisons
src/predict.py Single-image CLI inference
src/best_predictions.py Per-sample IoU scoring, best/worst prediction figures
reports/report.typ Typst source β†’ report.pdf

Quick Start

Prerequisites: Python 3.11+, uv, Homebrew (macOS)

brew install graphviz plantuml typst d2
uv sync

1. Get the data

Download both datasets from Roboflow Universe in COCO format β†’ place under data/raw/:

data/raw/
β”œβ”€β”€ taping/          # drywall-join-detect (COCO export)
β”‚   β”œβ”€β”€ train/
β”‚   └── valid/
└── cracks/          # cracks-3ii36 (COCO export)
    └── train/

2. Preprocess

uv run python -m src.data.preprocess

3. Train

uv run python -m src.train

4. Evaluate

uv run python -m src.evaluate

5. Predict on a single image

uv run python -m src.predict path/to/image.jpg "segment crack"

6. Build the report

d2 reports/diagrams/pipeline.d2 reports/diagrams/pipeline.png
plantuml -tpng reports/diagrams/training.puml
uv run python reports/diagrams/architecture.py
typst compile reports/report.typ reports/report.pdf

Reproducibility


Read the full report (PDF)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for youngPhilosopher/drywall-qa-clipseg

Finetuned
(3)
this model

Paper for youngPhilosopher/drywall-qa-clipseg