Milan Ghimire

Instance Segmentation with YOLOv8

A detailed engineering project showcasing system design, data processing, and production-ready implementation.

Overview

This project does something detection alone cannot: instead of drawing a box around an object, it traces the object's exact outline, pixel by pixel. That is instance segmentation - and with Ultralytics YOLOv8 it takes surprisingly little code.

The script reads a video frame by frame, runs YOLOv8's segmentation model on each frame, pulls out the mask for every detected object, turns that mask into a contour, and draws the outline in green with a stable tracking ID floating above it. The result is a live silhouette that follows the object's real shape as it moves.

This is the natural next step after the Computer Vision with OpenCV & YOLOv8 project, which covers detection, counting and the OpenCV basics this builds on.

A YOLOv8 segmentation frame: a dog is outlined with a green contour and labelled with a tracking ID, inside a garden scene

Instance segmentation in action - the model finds the dog, traces its exact outline (not just a box) and assigns it a tracking ID that stays stable across frames. This is real output from running the script below.

A few core concepts first

Before the code, the three ideas this project rests on.

Detection vs segmentation - box vs mask

  • A bounding box is the simple answer to where: a rectangle (x1, y1, x2, y2) around the object. Detection gives you boxes.
  • A mask is the far richer answer to which pixels: a pixel-perfect map of the object's exact shape. Segmentation gives you masks. The difference is "there's a dog roughly here" versus "these exact pixels are the dog".

To get masks instead of just boxes, we load the segmentation weights (yolov8n-seg.pt) rather than the plain detection weights.

Mask → contour

YOLOv8 hands back each mask as a grid of 0s and 1s (background vs object). To draw that as a clean outline we convert it into a contour - the list of points that trace the boundary of the white region - with OpenCV's findContours, then stroke that contour in green. So the pipeline is: mask (0/1 grid) → black-and-white image → contour → green outline.

Tracking with a persistent ID

Plain detection treats every frame independently, so it has no idea the dog in frame 1 is the same dog in frame 50. Tracking (model.track(..., persist=True)) adds memory: it assigns each object an ID and keeps it stable across frames. That is why each outline below carries a steady ID: label instead of a number that flickers every frame.

The model is pre-trained on the COCO dataset (80 classes). Here I filter to classes=[16], which is COCO's dog class, so only dogs are segmented and everything else is ignored.

The full script

The complete program - read a video, segment each frame, and draw the tracked outlines. The walkthrough underneath takes it line by line.

import cv2
import os
import numpy as np
from ultralytics import YOLO

model= YOLO("yolov8n-seg.pt")

VIDEO_PATH = os.path.join(os.path.dirname(os.path.abspath(__file__)), "test-agri.mp4")

cap= cv2.VideoCapture(VIDEO_PATH)

while True:
    ret, frame=cap.read()
    if not ret:
        break
    results=model.track(frame, classes=[16],persist=True)
    for r in results:
        annoted_frame= frame.copy()
        if r.masks is not None and r.boxes is not None and r.boxes.id is not None:
            masks =r.masks.data.cpu().numpy()
            boxes=r.boxes.data.cpu().numpy()
            ids=r.boxes.id.cpu().numpy()
            for i,mask in enumerate(masks):
                dog_id=int(ids[i])
                mask_resized=cv2.resize(mask.astype(np.uint8)*255, (frame.shape[1], frame.shape[0]))
                x1,y1,x2,y2=boxes[i][:4].astype(int) # bounding box coordinates
                contours,_=cv2.findContours(mask_resized, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
                cv2.drawContours(annoted_frame, contours, -1, (0,255,0), thickness=2)
                cv2.putText(annoted_frame, f"ID: {dog_id}", (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0,255,0), 2)
    cv2.imshow("Segmentation", annoted_frame)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break
cap.release()
cv2.destroyAllWindows()

Line by line

Setup

  • import cv2, os, numpy as np and from ultralytics import YOLO bring in OpenCV (frames, drawing), os (build the video path), NumPy (the array type behind every mask) and YOLOv8.
  • YOLO("yolov8n-seg.pt") loads the segmentation weights - same YOLOv8 family as detection, but this one predicts masks, not just boxes. The -seg in the filename is the whole difference.
  • os.path.join(os.path.dirname(os.path.abspath(__file__)), "test-agri.mp4") builds the video path relative to the script file, so it runs from any working directory.
  • cv2.VideoCapture(VIDEO_PATH) opens the video for frame-by-frame reading.

The frame loop

  • ret, frame = cap.read() grabs the next frame; ret says whether it worked. if not ret: break exits cleanly when the video ends.
  • model.track(frame, classes=[16], persist=True) is the core call. It segments the frame, keeps only dogs (classes=[16]), and persist=True makes the tracker remember objects between frames so their IDs stay stable.
  • for r in results: iterates the result (one per frame).
  • annoted_frame = frame.copy() draws on a fresh copy of the frame each time, so old outlines don't smear across into later frames.
  • if r.masks is not None and r.boxes is not None and r.boxes.id is not None: is a three-part guard - masks, boxes and tracking IDs must all exist before we touch them, otherwise the first few frames (before the tracker locks on) would crash.

Per-object drawing

  • masks = r.masks.data.cpu().numpy() etc. pull the masks, boxes and IDs off the GPU/tensor into plain NumPy arrays we can loop over.
  • for i, mask in enumerate(masks): walks each detected object; enumerate gives both the index i and the mask.
  • dog_id = int(ids[i]) reads this object's stable tracking ID.
  • mask.astype(np.uint8) * 255 turns the mask's 0/1 values into a proper black-and-white image (0 or 255), and cv2.resize(..., (frame.shape[1], frame.shape[0])) scales it up to the full frame size - the model outputs masks at a lower resolution than the video.
  • x1, y1, x2, y2 = boxes[i][:4].astype(int) reads the bounding-box corners, used only to place the text label.
  • cv2.findContours(mask_resized, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) traces the outer outline of the white mask. RETR_EXTERNAL keeps only the outermost contour; CHAIN_APPROX_SIMPLE compresses straight runs of points into just their endpoints.
  • cv2.drawContours(annoted_frame, contours, -1, (0,255,0), thickness=2) strokes every contour in green - -1 means "draw all of them", and (0,255,0) is green in OpenCV's BGR colour order.
  • cv2.putText(annoted_frame, f"ID: {dog_id}", (x1, y1-10), ...) writes the tracking ID just above the object.

Showing and cleanup

  • cv2.imshow("Segmentation", annoted_frame) displays the annotated frame.
  • if cv2.waitKey(1) & 0xFF == ord('q'): break waits 1 ms for a key and quits on q - the standard way to make a real-time OpenCV window closable.
  • cap.release() and cv2.destroyAllWindows() free the video and close the window when the loop ends.

Key insight

Segmentation is detection's richer cousin: swap yolov8n.pt for yolov8n-seg.pt, read r.masks instead of just r.boxes, and you go from "a dog is roughly here" to "these exact pixels are the dog". The only real handwork is turning the raw mask into something drawable - resize it to full frame size, then findContours + drawContours - and pairing it with track(persist=True) so each silhouette carries a stable ID.

A practical note: the model is only as good as its training data. On the COCO classes it works out of the box, but the masks can wobble on unusual scenes or partially hidden objects. For domain-specific work (crops, tools, medical imagery) the next step is training YOLOv8-seg on your own labelled masks.

Tech stack

  • Python 3.12
  • Ultralytics YOLOv8 - yolov8n-seg.pt pre-trained segmentation weights
  • OpenCV (cv2) - video I/O, findContours, drawContours, putText, display
  • NumPy - the array type behind every mask
  • Built-in tracker (model.track(persist=True)) - persistent per-object IDs
  • COCO pre-trained classes - dog (16) used here

Reference