Part of the Detecture monorepo — this is the evaluation component. The model under evaluation lives at ../Qwen2SAM_Detecture/; the data pipeline that produced its training set lives at ../DetectureMiner/.
Unified benchmark harness that runs four method families on four datasets under a strict, dataset-dependent fairness protocol. One command reproduces every row of the paper's main comparison table.
| Name | Pipeline | Role |
|---|---|---|
| Detecture (ours) | trained Qwen2SAM_Detecture (best.pt) → Bridge → SAM3 |
our model |
| SAM3 | SAM3 text encoder → Semantic Seg Head | baseline — single-prompt |
| Grounding_SAM3 | Grounding-DINO boxes → SAM3 box-prompt → Semantic Seg Head | baseline — box-driven |
| SA2VA | ByteDance/Sa2VA-4B end-to-end VLM | baseline — big VLM |
A fifth row (TextureSAM) is included by copying the published paper numbers; we do not re-run that checkpoint.
Per-dataset "regime" — controls which oracle hints (if any) each method
receives. Enforced by the DISPATCH table inside
master_runner.py so no ad-hoc prompt can silently
leak K_GT back in.
| Dataset | Regime | How fairness is enforced | |
|---|---|---|---|
| RWTD (253, natural 2-texture) | 2 | Oracle K=2 | Detecture + SA2VA: explicit "exactly 2" prompt. SAM3 + Grounding_SAM3: [m1, -m1] inverse-mask trick over Softmax, which is mathematically equivalent to a K=2 partition without needing a text prior that can express a count. |
| STLD (200, synthetic 2-texture) | 2 | Oracle K=2 | Same as RWTD. |
| DeTexture ADE20K (212, multi-texture) | 1–6 | Autonomous | No method is told <p>entity</p> tags its caption produces. SAM3 is the one exception — its proposal decoder has no prior over how many masks to keep, so it gets a top-$K_\text{GT}$ concession (SAM3-specific limitation, matches TextureSAM paper's treatment). |
| CAID (3091, single-region) | 1 | K=1 (trivial) | Single-class route; the K-regime distinction is moot. |
Every method's output is scored through the same metrics_utils.py
scorer — Softmax + static dustbin + scipy.optimize.linear_sum_assignment
on 1 − IoU + ARI on the predicted partition — so mIoU and ARI are
directly comparable across methods and datasets.
cd ~/Detecture
pip install -r requirements.txt # root requirements.txt
# SAM3 — clone separately and point to it (not pip-installable).
git clone https://github.com/facebookresearch/sam3.git ~/sam3
pip install -e ~/sam3
export SAM3_ROOT=~/sam3 # optional; defaults to ~/sam3
# For SA2VA: install a stub flash_attn (no real kernels) so the model loads.
# Minimal stub — a package directory named `flash_attn` with the names listed
# in eval_vlm_end2end.py (index_first_axis, pad_input, unpad_input, etc.).
# Details inline in that file.HuggingFace weights download on first run:
Qwen/Qwen3-VL-8B-InstructByteDance/Sa2VA-4Bpinned to revisionb5ffed22(eval_vlm_end2end.py)IDEA-Research/grounding-dino-tiny- SAM3 via
build_sam3_image_model(load_from_HF=True)
All four datasets use a unified metadata JSON. Paths are read from datasets_config.yaml; the defaults expect:
| Dataset key | Metadata path |
|---|---|
CAID |
~/datasets/CAID/metadata.json |
RWTD |
~/datasets/RWTD/metadata.json |
STLD |
~/datasets/STLD/metadata.json |
ADE20k_Detecture |
~/datasets/ADE20k_Detecture/metadata.json |
Override the root with DETECTURE_DATASETS_ROOT=/some/where or edit the
config. See the root README for Hugging Face
download instructions.
Unified metadata entry:
{
"image_path": "/abs/path/image.jpg",
"id": "sample_id",
"textures": [
{"description": "…", "mask_path": "/abs/path/mask_0.png"},
{"description": "…", "mask_path": "/abs/path/mask_1.png"}
]
}# Everything (4 methods × 4 datasets = 16 cells)
python master_runner.py
# One method, all datasets
python master_runner.py --model detecture
# One dataset, all methods (smoke)
python master_runner.py --dataset RWTD --limit 10
# One specific cell
python master_runner.py --model sa2va --dataset STLD --limit 10
# Preview commands without spawning anything
python master_runner.py --dry-runEach (model × dataset) cell spawns as a subprocess — clean GPU state per model (Detecture alone occupies ~16 GB in bf16 and cannot co-reside with a second backbone).
Outputs land under results/<model>/<dataset>/zero_shot_results.json.
The runner also writes a combined
results/paper_benchmark_summary.json with the dispatch metadata and
per-cell summaries.
With best.pt + SAM3 + Grounding-DINO-tiny + SA2VA-4B (rev b5ffed22):
| Method | RWTD mIoU/ARI | STLD mIoU/ARI | ADE20K mIoU/ARI | CAID mIoU/ARI |
|---|---|---|---|---|
| SAM3 | 0.6337 / 0.4427 | 0.5042 / 0.1378 | 0.3194 / 0.3219 | 0.9006 / 0.8169 |
| Grounding_SAM3 | 0.4640 / 0.1959 | 0.4489 / 0.0539 | 0.4518 / 0.4794 | 0.6217 / 0.4043 |
| SA2VA | 0.3561 / 0.5593 | 0.3739 / 0.7140 | 0.7141 / 0.7011 | 0.7986 / 0.7151 |
| TextureSAM (ref) | 0.4684 / 0.6163 | 0.4677 / 0.6849 | 0.4798 / 0.3566 | 0.6691 / 0.5080 |
| Detecture (ours) | 0.8162 / 0.6895 | 0.7441 / 0.6062 | 0.7419 / 0.7138 | 0.7450 / 0.5883 |
Detecture leads mIoU on every multi-texture route. CAID (single "water surface" class) is where SAM3 wins honestly — noted in the paper as the expected cost of generality.
Aggregate + paper-ready CSV / LaTeX:
python aggregate_results.py --csv results/summary.csv --latex results/summary.tex| Script | Role |
|---|---|
master_runner.py |
canonical entry point — dispatches every (method × dataset) cell through REGIMES + DISPATCH |
eval_sam3_vanilla.py |
SAM3 text-encoder baseline |
eval_grounded_sam3.py |
Grounding-DINO → SAM3 box-prompt baseline |
eval_vlm_end2end.py |
Sa2VA (and future VLM backends) end-to-end |
eval_qwen2sam_zs.py |
legacy Qwen→SAM3 zero-shot shim — kept for ablation only |
steelman/*.py |
K=2 inverse-trick variants used on RWTD/STLD |
metrics_utils.py |
shared Softmax+dustbin+Hungarian+ARI scorer (every cell routes through this) |
aggregate_results.py |
per-cell JSONs → Markdown / CSV / LaTeX table |
- All CLI defaults resolve through
Path.home()/ environment variables — no absolute paths anywhere in the codebase. - Sa2VA's HF
revisionis pinned tob5ffed22in eval_vlm_end2end.py so upstream modeling changes don't silently alter numbers. - Each subprocess logs the exact command it ran into the per-cell JSON
summaryblock for traceability.