Object Detection Basics

Sensors give you raw data — pixels, points, distances. Object detection turns that data into understanding: "There's a person at (2m, 0.5m), a chair at (1m, -1m), and a cup on the table."

This is the bridge between sensing and decision-making.

What Is Object Detection?

Object detection answers two questions simultaneously:

What is in the image? (classification)
Where is it? (localization)

The output is a list of detections, each containing:

Class: "person", "car", "dog", "cup"
Bounding box: Rectangle around the object (x, y, width, height)
Confidence score: How certain the detector is (0.0 to 1.0)

Detection Output Structure

# A single frame's detections
detections = [
  {
    "class": "person",
    "confidence": 0.92,
    "bbox": {"x": 120, "y": 50, "width": 80, "height": 200}
  },
  {
    "class": "cup",
    "confidence": 0.78,
    "bbox": {"x": 450, "y": 300, "width": 40, "height": 60}
  },
  {
    "class": "dog",
    "confidence": 0.65,
    "bbox": {"x": 200, "y": 180, "width": 150, "height": 140}
  }
]
 
# You can filter by confidence:
high_confidence = [d for d in detections if d["confidence"] > 0.80]
# Result: person (0.92), cup (0.78 is below threshold)
# Oops, that filters the cup too. Use 0.75 instead!

Bounding Boxes

A bounding box is the smallest rectangle that fully contains an object. It's defined by:

Top-left corner: (x, y) in pixel coordinates
Size: width and height in pixels

Some formats use:

(x_min, y_min, x_max, y_max) — two corners
(x_center, y_center, width, height) — center and size

Note

Bounding boxes are axis-aligned — they can't rotate. For a tilted object (like a book lying diagonally), the bbox includes empty space around it. For tighter fits, use rotated bounding boxes or segmentation masks.

Visualizing Detections

Drawing Bounding Boxes

import cv2
 
# Draw each detection on the image
for det in detections:
    x, y, w, h = det["bbox"].values()
    confidence = det["confidence"]
    label = f'{det["class"]}: {confidence:.2f}'
 
    # Draw rectangle
    cv2.rectangle(image, (x, y), (x+w, y+h), color=(0, 255, 0), thickness=2)
 
    # Draw label
    cv2.putText(image, label, (x, y-10), cv2.FONT_HERSHEY_SIMPLEX,
                0.5, (0, 255, 0), 2)
 
cv2.imshow("Detections", image)
cv2.waitKey(0)

What Machine Learning Does

You could write rules for detection:

"If there's a red circle, it's a stop sign"
"If there's a face-shaped blob, it's a person"

But this breaks down fast. What about stop signs in shadows? People wearing masks? Objects partially hidden?

Machine learning learns patterns from thousands of labeled examples:

Train on 10,000 images of cars from every angle, lighting, weather
The model learns "car-ness" — shapes, textures, contexts
It generalizes to new images it's never seen

YOLO grid approach — image divided into grid cells, each predicting bounding boxes for objects whose centers fall within — YOLO divides the image into an S×S grid. Each cell predicts bounding boxes for objects whose centers land in that cell — all in a single forward pass.

Feature pyramid — multi-scale detection with large features detecting big objects and small features detecting small objects — Feature pyramids detect objects at multiple scales: high-res layers find small objects, low-res layers find large objects.

Popular Object Detection Models

Model	Speed	Accuracy	Use Case
YOLO (You Only Look Once)	🚀 Fast	Good	Real-time robotics (30+ FPS)
SSD (Single Shot Detector)	Fast	Good	Mobile/embedded devices
Faster R-CNN	Slow	Excellent	High-precision tasks (inspection, medical)
EfficientDet	Medium	Excellent	Balanced accuracy/speed

Tip

For robots, YOLO is the go-to choice. It's fast enough for real-time video (60+ FPS on a GPU, 10+ FPS on CPU), accurate enough for most tasks, and has tons of pre-trained models available.

Confidence Scores and Thresholds

Every detection has a confidence score — the model's certainty that the detection is correct.

0.95: Very confident (almost certainly correct)
0.70: Moderately confident (probably correct)
0.30: Low confidence (might be a false positive)

You set a threshold to filter detections:

High threshold (0.8+): Fewer false positives, but might miss real objects
Low threshold (0.5): Catch more objects, but more false alarms

Confidence threshold comparison — high threshold shows fewer but correct detections, low threshold shows more detections including false positives — High threshold (left): fewer but reliable detections. Low threshold (right): catches everything but introduces false positives.

The right threshold depends on your task:

Picking objects: High threshold (you don't want to grasp empty air)
Obstacle avoidance: Lower threshold (better to slow down for a false alarm than crash)

Tuning Confidence Thresholds

# Conservative: only trust very confident detections
confident_detections = [d for d in detections if d["confidence"] > 0.85]
 
# Aggressive: accept uncertain detections
all_detections = [d for d in detections if d["confidence"] > 0.50]
 
# Class-specific thresholds (people are critical, decorations aren't)
filtered = []
for d in detections:
    if d["class"] == "person" and d["confidence"] > 0.70:
        filtered.append(d)
    elif d["class"] == "cup" and d["confidence"] > 0.60:
        filtered.append(d)
    # ... other classes

Common Pitfalls

1. False Positives

The model detects something that isn't there:

Shadow looks like a person
Reflection in glass triggers a detection
Pattern on a shirt mistaken for a logo

Fix: Increase confidence threshold, use temporal filtering (require detection in 3+ consecutive frames).

2. False Negatives

The model misses real objects:

Object partially hidden
Unusual angle or lighting
Object not in the training data

Fix: Lower threshold, add more training data, use multiple camera angles.

3. Duplicate Detections

Same object detected twice with overlapping bboxes.

Fix: Apply Non-Maximum Suppression (NMS) — merge overlapping boxes, keep only the highest-confidence one.

What's Next?

Detection gives you objects in the image, but robots need 3D positions to interact with the world. The next lesson covers sensor fusion — combining camera detections with depth data (from LiDAR, stereo, etc.) to localize objects in 3D space.

Object Detection Basics

Object Detection Basics

What Is Object Detection?

Bounding Boxes

Visualizing Detections

What Machine Learning Does

Popular Object Detection Models

Confidence Scores and Thresholds

Common Pitfalls

1. False Positives

2. False Negatives

3. Duplicate Detections

What's Next?

Related Lessons

Discussion