Model Integrations¶

SAHI works with any object detection framework through a unified API. Load your model once with AutoDetectionModel.from_pretrained(), then use it with any SAHI function -- sliced prediction, batch inference, CLI, etc.

Ultralytics (YOLO)¶

Supports Ultralytics YOLO26, Ultralytics YOLO11, Ultralytics YOLOv8, and all Ultralytics model variants including segmentation and oriented bounding box models.

pip install ultralytics

from sahi import AutoDetectionModel
from sahi.predict import get_sliced_prediction

detection_model = AutoDetectionModel.from_pretrained(
    model_type="ultralytics",
    model_path="yolo26n.pt",
    confidence_threshold=0.25,
    device="cuda:0",  # or "cpu"
)

result = get_sliced_prediction(
    "image.jpg",
    detection_model,
    slice_height=512,
    slice_width=512,
    overlap_height_ratio=0.2,
    overlap_width_ratio=0.2,
)

Ultralytics models also support native GPU batch inference for faster processing of multiple slices:

result = get_sliced_prediction(
    "large_image.jpg",
    detection_model,
    slice_height=640,
    slice_width=640,
    batch_size=8,  # process 8 slices at once
)

YOLOE¶

YOLOE models with prompt-free and open-vocabulary detection.

pip install ultralytics

detection_model = AutoDetectionModel.from_pretrained(
    model_type="yoloe",
    model_path="yoloe-v8l-seg.pt",
    confidence_threshold=0.25,
    device="cuda:0",
)

result = get_sliced_prediction(
    "image.jpg",
    detection_model,
    slice_height=512,
    slice_width=512,
)

YOLO-World (Zero-Shot)¶

Open-vocabulary detection -- detect objects by text description without retraining.

pip install ultralytics

detection_model = AutoDetectionModel.from_pretrained(
    model_type="yolo-world",
    model_path="yolov8s-worldv2.pt",
    confidence_threshold=0.1,
    device="cuda:0",
)

result = get_sliced_prediction(
    "image.jpg",
    detection_model,
    slice_height=512,
    slice_width=512,
)

YOLOv5¶

Classic YOLOv5 models via the yolov5 pip package.

pip install yolov5

detection_model = AutoDetectionModel.from_pretrained(
    model_type="yolov5",
    model_path="yolov5s.pt",
    confidence_threshold=0.25,
    device="cuda:0",
)

result = get_sliced_prediction(
    "image.jpg",
    detection_model,
    slice_height=512,
    slice_width=512,
)

HuggingFace Transformers¶

Use object detection and zero-shot object detection models from the HuggingFace Hub (DETR, Deformable DETR, DETA, GroundingDINO, etc.).

pip install transformers timm

detection_model = AutoDetectionModel.from_pretrained(
    model_type="huggingface",
    model_path="facebook/detr-resnet-50",
    confidence_threshold=0.3,
    device="cuda:0",
)

result = get_sliced_prediction(
    "image.jpg",
    detection_model,
    slice_height=512,
    slice_width=512,
)

GroundingDINO models require text-conditioned inference. Use text_labels when the target categories are known, so SAHI can assign stable category ids to those labels. Additional grounded phrases returned by the processor are appended as new categories.

detection_model = AutoDetectionModel.from_pretrained(
    model_type="huggingface",
    model_path="IDEA-Research/grounding-dino-tiny",
    confidence_threshold=0.25,
    text_threshold=0.20,
    text_labels=["car", "truck", "person"],
    device="cuda:0",
)

Zero-shot parameters¶

In addition to the common parameters, zero-shot (GroundingDINO) models accept:

Parameter	Type	Description
`text_labels`	list[str]	Fixed categories to detect, e.g. `["car", "truck"]`. Each gets a stable category id; phrases outside this list are dropped
`text_prompt`	str	Free-form prompt (e.g. `"a car. a truck."`) used when `text_labels` is not set; returned phrases become categories dynamically
`text_threshold`	float	Minimum score for matching a box to a text token (default: 0.25)

HuggingFace object detection notebook:

GroundingDINO zero-shot detection notebook:

HuggingFace Segmentation¶

Run segmentation models from the HuggingFace Hub. SAHI returns each segment as an ObjectPrediction with a polygon mask, so sliced inference and postprocessing work the same as for detection.

Architecture	`instance`	`semantic`	`panoptic`
MaskFormer	✅	✅	✅
Mask2Former	✅	✅	✅
OneFormer	✅	✅	✅

The available heads depend on the checkpoint (e.g. facebook/mask2former-swin-tiny-coco-instance is instance-only). OneFormer selects the head at inference time, so a single checkpoint serves all three.

pip install transformers timm

from sahi.models.huggingface_segmentation import SegmentationType

detection_model = AutoDetectionModel.from_pretrained(
    model_type="huggingface_segmentation",
    model_path="facebook/mask2former-swin-tiny-coco-instance",
    confidence_threshold=0.5,
    device="cuda:0",
    segmentation_type=SegmentationType.INSTANCE_SEGMENTATION,
)

result = get_sliced_prediction(
    "image.jpg",
    detection_model,
    slice_height=512,
    slice_width=512,
)

Switch segmentation_type to SEMANTIC_SEGMENTATION or PANOPTIC_SEGMENTATION to use the matching head. Note that semantic segmentation merges every instance of a class into a single mask, so one ObjectPrediction is returned per class rather than per instance.

Segmentation parameters¶

In addition to the common parameters, this model accepts:

Parameter	Type	Description
`segmentation_type`	`SegmentationType`	`INSTANCE_SEGMENTATION` (default), `SEMANTIC_SEGMENTATION`, or `PANOPTIC_SEGMENTATION`
`min_segment_area`	int	Drop segments smaller than this many pixels (default: 100)
`overlap_mask_area_threshold`	float	Merge/discard disconnected parts within a mask (default: 0.8)
`label_ids_to_fuse`	list[int]	Panoptic only -- fuse all instances of these labels into one segment
`token`	str	HuggingFace access token for gated/private models (falls back to `$HF_TOKEN`)

RT-DETR¶

Real-Time Detection Transformer for high-accuracy real-time detection.

pip install transformers timm

detection_model = AutoDetectionModel.from_pretrained(
    model_type="rtdetr",
    model_path="PekingU/rtdetr_r50vd",
    confidence_threshold=0.3,
    device="cuda:0",
)

result = get_sliced_prediction(
    "image.jpg",
    detection_model,
    slice_height=512,
    slice_width=512,
)

TorchVision¶

Use built-in TorchVision detection models (Faster R-CNN, RetinaNet, FCOS, SSD, etc.).

pip install torch torchvision

detection_model = AutoDetectionModel.from_pretrained(
    model_type="torchvision",
    model_path="fasterrcnn_resnet50_fpn",
    confidence_threshold=0.3,
    device="cuda:0",
)

result = get_sliced_prediction(
    "image.jpg",
    detection_model,
    slice_height=512,
    slice_width=512,
)

MMDetection¶

Supports the full MMDetection model zoo (300+ models).

pip install mmdet mmcv mmengine

detection_model = AutoDetectionModel.from_pretrained(
    model_type="mmdet",
    model_path="path/to/checkpoint.pth",
    config_path="path/to/config.py",
    confidence_threshold=0.25,
    device="cuda:0",
)

result = get_sliced_prediction(
    "image.jpg",
    detection_model,
    slice_height=512,
    slice_width=512,
)

Detectron2¶

Use Facebook's Detectron2 models for detection and instance segmentation.

pip install detectron2

detection_model = AutoDetectionModel.from_pretrained(
    model_type="detectron2",
    model_path="path/to/model_final.pth",
    config_path="path/to/config.yaml",
    confidence_threshold=0.25,
    device="cuda:0",
)

result = get_sliced_prediction(
    "image.jpg",
    detection_model,
    slice_height=512,
    slice_width=512,
)

Roboflow (RF-DETR)¶

Use Roboflow's RF-DETR models for detection and segmentation.

pip install rfdetr

detection_model = AutoDetectionModel.from_pretrained(
    model_type="roboflow",
    model_path="rfdetr-base",
    confidence_threshold=0.3,
    device="cuda:0",
)

result = get_sliced_prediction(
    "image.jpg",
    detection_model,
    slice_height=512,
    slice_width=512,
)

Common Parameters¶

All models accept these parameters in AutoDetectionModel.from_pretrained():

Parameter	Type	Description
`model_type`	str	Framework name (see sections above)
`model_path`	str	Path to weights file or model name
`config_path`	str	Config file path (MMDetection, Detectron2)
`confidence_threshold`	float	Minimum score to keep a detection (default: 0.25)
`device`	str	`"cpu"`, `"cuda:0"`, `"mps"`, etc.
`category_mapping`	dict	Map category IDs to names: `{0: "car", 1: "person"}`
`category_remapping`	dict	Remap category names after inference
`image_size`	int	Override model input resolution
`load_at_init`	bool	Load weights immediately (default: True)

Using a Pre-loaded Model¶

If you already have a model instance, pass it directly instead of a path:

from ultralytics import YOLO

yolo_model = YOLO("yolo26n.pt")
# ... customize the model ...

detection_model = AutoDetectionModel.from_pretrained(
    model_type="ultralytics",
    model=yolo_model,
    confidence_threshold=0.25,
    device="cuda:0",
)

Next Steps¶

How Sliced Inference Works -- Understand the algorithm
Prediction Utilities -- Advanced prediction options
Interactive Notebooks -- Hands-on examples for each framework

models inference ultralytics mmdetection huggingface torchvision detectron2 yolov5 roboflow