Six Models Enter, One Problem Wins: A Head-to-Head Segmentation Showdown
When I started this research, the question seemed straightforward: which model is best for detecting oil palm trees from drone imagery? After training six models across eight resolution levels and generating hundreds of evaluation metrics, the answer turned out to be: it depends — but in a very specific, operationally useful way.
Context / Problem
Instance segmentation for agricultural monitoring requires balancing three competing demands: detection accuracy (finding every tree), segmentation quality (drawing accurate masks around each crown), and computational efficiency (running fast enough to be deployable). No single model maximises all three simultaneously. The goal was to understand exactly where each model sits in this trade-off space — and what that means for practical deployment.
The Six Contenders
YOLOv8l-seg
Released in 2023, YOLOv8 introduced anchor-free detection with a decoupled head separating classification from localisation. The segmentation variant adds a lightweight ProtoNet branch that generates mask coefficients. It has 45.9M parameters and 105.0 GFLOPs — the heaviest of the YOLO family tested here.
YOLOv11l-seg
YOLOv11 advances YOLOv8 with two key architectural additions: the C3k2 block (multi-scale convolution with improved gradient flow) and C2PSA (convolutional module with parallel spatial attention). At 27.6M parameters and 65.9 GFLOPs, it is 40% lighter than YOLOv8l while outperforming it on this dataset.
YOLOv11-Ablation
A diagnostic variant of YOLOv11 with identical architecture, trained with a modified augmentation strategy to isolate the effect of the C2PSA attention mechanism on multi-GSD robustness. It performed unexpectedly well — more below.
Mask R-CNN (ResNet-50)
The two-stage baseline. Mask R-CNN first generates region proposals, then classifies and segments each proposal separately. It has 43.9M parameters and 133.9 GFLOPs. It is the most computationally expensive model tested, but also the most studied architecture for instance segmentation in agriculture.
Hybrid-v8-SAM and Hybrid-v11-SAM
These pipelines use a YOLO detector to generate bounding box prompts, which are then passed to SAM 2.1 (Segment Anything Model, Meta AI) for high-quality mask generation. SAM 2.1 uses a hierarchical Hiera encoder with a memory mechanism. The combined pipeline has 251.6M parameters and 2,565.9 GFLOPs. (See B4 for the full paradox story.)
Training Setup
| Parameter | Value |
|---|---|
| Hardware | NVIDIA A100-SXM4, Google Colab Pro+ |
| Framework | PyTorch 2.9.0 + Ultralytics 8.3.251 |
| Optimizer | AdamW, lr=0.002, momentum=0.937 |
| Max epochs | 200 (early stopping: patience=30 on mAP@50-95) |
| Batch size | 16 |
| Image size | 640×640 px |
| Initialisation | COCO-pretrained weights |
YOLOv11l-seg ran the full 200 epochs (1.06 hours) and achieved MaskAP@50 of 0.880. YOLOv8l-seg early-stopped at epoch 62 (0.68 hours) with MaskAP@50 of 0.859 — converged faster but to a lower ceiling. YOLOv11 weights are 55.9 MB vs 92.3 MB for YOLOv8 — 40% lighter.
Results: Overall Mean (All 8 GSDs)
| Model | Mean F1 | Mean IoU | F1 Std Dev |
|---|---|---|---|
| Mask R-CNN | 0.727 | 0.751 | 0.093 |
| Hybrid-v11-SAM | 0.680 | 0.696 | 0.119 |
| YOLOv11-ablation | 0.667 | — | 0.134 |
| YOLOv11l-seg | 0.646 | 0.769 | 0.151 |
| Hybrid-v8-SAM | 0.635 | 0.692 | 0.128 |
| YOLOv8l-seg | 0.616 | 0.751 | 0.142 |
Mask R-CNN leads on mean F1 and stability (lowest std=0.093). It is the most consistent model across the full GSD range.
YOLOv11l-seg leads on mean IoU (0.769). When a tree is detected, YOLOv11 draws more accurate masks than any other model.
The weight-accuracy trade-off is decisive. YOLOv11l-seg is 40% lighter than YOLOv8l-seg and consistently outperforms it on both F1 and IoU. For new deployments, there is no reason to use v8 over v11.
Results: Fine Resolution (0.03–0.04m GSD)
| Model | F1 @ 0.04m | IoU @ 0.04m |
|---|---|---|
| Mask R-CNN | 0.847 | 0.871 |
| YOLOv11-ablation | 0.832 | 0.882 |
| YOLOv11l-seg | 0.815 | 0.895 |
| YOLOv8l-seg | 0.804 | 0.867 |
At 0.04m specifically, YOLOv11-ablation achieves the highest F1 (0.832). The ablation variant’s modified augmentation schedule appears to particularly benefit performance at this specific resolution. For rapid census work at the recommended operating point, it is the strongest single-model choice.
Efficiency Profile
| Model | Latency (ms) | FPS | Memory (MB) |
|---|---|---|---|
| YOLOv8l-seg | 12.4 | 80.8 | 2,229 |
| YOLOv11l-seg | 18.3 | 54.6 | 2,341 |
| YOLOv11-ablation | 18.9 | 52.8 | 2,445 |
| Mask R-CNN | 23.5 | 42.5 | 2,798 |
| Hybrid-v11-SAM | 418.3 | 2.4 | 4,341 |
YOLOv11 at 18.3ms is well within real-time range on GPU hardware. The Hybrid SAM pipeline at 418ms and 2.4 FPS is unsuitable for any real-time application.
Notes
Why does Mask R-CNN perform so consistently? Two-stage detectors are inherently more stable across resolution variation because the region proposal network is less sensitive to absolute pixel scale than single-stage anchor-free detectors.
How were models compared fairly? SAM was used in zero-shot mode — not fine-tuned on oil palm data. This represents the realistic deployment scenario. Fine-tuning would likely improve results but removes the “foundation model out-of-the-box” argument.
All training configs, result CSVs, and PR curve plots are in the companion repository: github.com/Sai21112000/oil-palm-instance-segmentation.
Part 3 of 6 in the Oil Palm AI series. Next: B4 — The Hybrid Paradox