Robotics From Zero
Module: See The World

Depth Perception

How robots measure depth — stereo vision, structured light, depth from LiDAR, and the trade-offs between different depth sensing methods.

10 min read

Depth Perception

A 2D camera flattens the world — you can see what is in the scene, but not how far away it is. Depth perception solves this. There are three main approaches: stereo vision, structured light, and time-of-flight (LiDAR). Let's understand how each works and when to use them.

Stereo Vision: Two Cameras, One Brain

Humans have two eyes spaced apart. Your brain compares the two images and figures out depth through triangulation. Stereo cameras do the same.

Stereo vision geometry — two cameras separated by a baseline, with triangulation lines meeting at a 3D point
Stereo vision uses two cameras separated by a baseline. By comparing pixel positions, depth is recovered through triangulation.

How It Works

  1. Capture two images from cameras separated by a known distance (the baseline)
  2. Find matching points — the same object pixel in both images
  3. Measure disparity — how much the pixel shifted between left and right images
  4. Calculate depth using geometry:
depth = (focal_length * baseline) / disparity

The bigger the disparity (larger shift), the closer the object.

Stereo Depth Estimation
# Simplified stereo matching
left_image = capture_left_camera()
right_image = capture_right_camera()
 
# For each pixel in the left image, find the matching pixel in the right
for y in range(height):
    for x in range(width):
        # Search along the same horizontal line (epipolar constraint)
        disparity = find_match(left_image[y, x], right_image[y, :])
 
        # Compute depth
        if disparity > 0:
            depth[y, x] = (focal_length * baseline) / disparity
        else:
            depth[y, x] = INVALID  # No match found
Disparity map — grayscale image where brighter pixels indicate larger disparity (closer objects)
A disparity map encodes depth: bright pixels are close (large shift between cameras), dark pixels are far away.

Challenges

  • Texture-less surfaces: A blank white wall has no unique features to match
  • Occlusions: Objects visible in one camera but not the other
  • Computational cost: Matching millions of pixels in real-time is expensive
  • Calibration: Cameras must be precisely aligned and calibrated together
Note

Stereo works best outdoors or in textured environments. It struggles in featureless areas (long hallways, empty rooms) where there's nothing unique to match between images.

Structured Light: Project a Pattern

Instead of relying on natural scene texture, structured light sensors project a known pattern (dots, lines, or grids) onto the scene. The camera sees how the pattern deforms, revealing depth.

Structured light sensor — infrared projector emits dot pattern, camera captures deformed pattern to compute depth
Structured light projects a known IR pattern onto the scene. A camera reads how the pattern deforms to calculate depth.

How It Works

  1. Project an infrared pattern (invisible to humans)
  2. Capture it with an IR camera
  3. Compare observed pattern to the known pattern
  4. Calculate depth from pattern distortion

Popular sensors: Intel RealSense, Kinect (original), iPhone Face ID.

Advantages

  • Works on textureless surfaces — the projected pattern provides the texture
  • Dense depth maps — every pixel gets a depth value
  • Fast computation — pattern matching is simpler than stereo correspondence

Disadvantages

  • Limited range — typically 0.5m to 5m (pattern gets too faint at distance)
  • Fails in sunlight — bright IR light overwhelms the projected pattern
  • Interference — multiple structured-light sensors in the same room interfere with each other
Warning

Don't use structured light sensors outdoors in daylight. The sun emits strong infrared light that washes out the projected pattern, causing depth readings to fail.

Time-of-Flight (ToF): Measure Light Speed

Time-of-flight sensor — emitter sends modulated light, sensor measures phase shift to compute distance
ToF sensors emit modulated light and measure the phase shift of the reflection to determine distance per pixel.

Some depth cameras (and all LiDAR) use time-of-flight: emit light, measure how long it takes to return.

How It Works (Depth Camera Version)

  • Emit modulated infrared light from an array of emitters
  • Capture the reflection with a sensor array
  • Measure phase shift to determine distance (continuous-wave ToF)
  • Output a depth image — one depth per pixel

This is essentially a "flash LiDAR" — instead of scanning a single beam, the whole scene is illuminated at once.

Advantages

  • Works in any lighting (active sensor, like LiDAR)
  • Fast — entire frame measured simultaneously
  • No correspondence problem — each pixel measures its own depth

Disadvantages

  • Lower resolution — typically 640×480 or less
  • Multipath errors — reflections off shiny surfaces confuse the sensor
  • Limited range — usually < 10m
Depth Image Structure
# A depth image looks like a grayscale image, but pixel values are distances
depth_image = {
  "width": 640,
  "height": 480,
  "encoding": "16UC1",  # 16-bit unsigned int, 1 channel
  "data": [
    1500,  # pixel (0,0): 1500mm = 1.5m away
    1520,  # pixel (0,1): 1.52m away
    0,     # pixel (0,2): invalid / no reading
    # ... 640 * 480 = 307,200 depth values
  ]
}
 
# Convert depth image to 3D point cloud:
for y in range(height):
    for x in range(width):
        z = depth_image[y, x] / 1000.0  # mm to meters
        if z > 0:  # valid depth
            # Use camera intrinsics to unproject
            X = (x - cx) * z / fx
            Y = (y - cy) * z / fy
            Z = z
            points.append((X, Y, Z))
Depth sensor comparison — stereo, structured light, ToF, and LiDAR side by side with range, accuracy, and use case labels
Each depth sensing method has different strengths: range, accuracy, lighting tolerance, and cost all vary significantly.

Comparing Depth Sensors

MethodRangeAccuracyTexture-less?Sunlight OK?ResolutionCost
Stereo1-50m±1-5%High$
Structured Light0.5-5m±1mmMedium$$
ToF Camera0.5-10m±1cm⚠️Low$$
LiDAR (2D)0.1-30m±1-3cmSparse$$$
LiDAR (3D)1-200m±2cmMedium$$$$
Tip

For indoor robots: structured light or ToF cameras work great (short range, dense depth, cheap). For outdoor robots: stereo or LiDAR (sunlight-resistant, longer range). For warehouse robots navigating aisles: 2D LiDAR is perfect (fast, cheap, sufficient for obstacle detection).

What's Next?

Now you can measure 3D geometry — but how do you recognize what you're looking at? The next lesson introduces object detection — finding and classifying objects in images using machine learning.

Got questions? Join the community

Discuss this lesson, get help, and connect with other learners on r/softwarerobotics.

Join r/softwarerobotics

Further Reading

Related Lessons

Discussion

Sign in to join the discussion.