Robotics From Zero
All Posts
Deep Dive

What Is SLAM in Robotics? A Beginner's Guide to Simultaneous Localization and Mapping

Understand SLAM (Simultaneous Localization and Mapping) — the technique that lets robots build maps and know where they are at the same time. Covers EKF-SLAM, particle filters, graph-based SLAM, and visual SLAM.

Robotics From Zero Team14 min readUpdated

What Is SLAM in Robotics?

Imagine you are blindfolded and dropped into an unfamiliar building. You need to figure out two things at once: where you are, and what the building looks like. You cannot wait to solve one before the other -- you need to explore to build a mental map, but you need a map to know where you are exploring. This is the fundamental problem that SLAM solves for robots.

SLAM stands for Simultaneous Localization and Mapping. It is one of the most important capabilities in autonomous robotics. Every self-driving car navigating city streets, every warehouse robot ferrying packages, every drone surveying a construction site, and every robotic vacuum cleaning your living room relies on some form of SLAM.

This post explains how SLAM works, what algorithms power it, what sensors feed it, and where it shows up in the real world.

The Chicken-and-Egg Problem

SLAM exists because of a circular dependency:

  • To localize (figure out where the robot is), you need a map to compare your sensor readings against.
  • To map (build a representation of the environment), you need to know where the robot was when it took each measurement.

Neither localization nor mapping can be solved perfectly without the other. If your position estimate is wrong, the map you build will be distorted. If your map is wrong, your position estimate will drift.

This is sometimes called the chicken-and-egg problem of robotics. SLAM algorithms tackle it by solving both problems together, iteratively refining the map and the robot's position as new sensor data arrives.

The mathematical framing is this: at each time step, the robot takes an action (moves forward, turns), receives sensor observations (laser scans, camera frames), and must estimate its own trajectory (sequence of poses) and the map (positions of landmarks or occupancy of cells) that best explain all the data it has seen so far.

What Sensors Drive SLAM?

SLAM is not tied to any single sensor. Different sensor modalities lead to different flavors of SLAM, each with distinct strengths.

Lidar (Light Detection and Ranging): Lidar sensors emit laser pulses and measure the time they take to bounce back. They produce precise distance measurements in 2D or 3D. Lidar-based SLAM (often called lidar SLAM) is the gold standard for accuracy in structured indoor environments. A 2D lidar spinning at 10 Hz gives you a clean 360-degree scan of distances to walls, furniture, and obstacles. 3D lidars, common on self-driving cars, produce dense point clouds.

Cameras: Monocular, stereo, and RGB-D cameras provide rich visual information. Visual SLAM extracts features (corners, edges, texture patches) from images and tracks them across frames to estimate motion and build a sparse or dense 3D map. Cameras are cheap and lightweight, but they struggle in low-texture environments (blank walls, dark rooms) and are sensitive to lighting changes.

IMU (Inertial Measurement Unit): An IMU measures acceleration and angular velocity. On its own, an IMU drifts rapidly -- double-integrating noisy acceleration data produces position estimates that diverge within seconds. But fused with lidar or camera data, an IMU provides critical short-term motion estimates that bridge the gaps between slower sensor updates.

Wheel Odometry: Encoders on a robot's wheels count rotations, giving a simple estimate of how far the robot has moved. Like the IMU, wheel odometry drifts over time due to wheel slip, uneven terrain, and measurement noise. It serves as a useful motion prior that SLAM algorithms refine.

Most production SLAM systems fuse multiple sensors. A typical setup on a mobile robot might combine a 2D lidar for accurate range data, wheel odometry for motion prediction, and an IMU for orientation -- each compensating for the others' weaknesses.

The Main SLAM Approaches

Over the past three decades, researchers have developed several families of SLAM algorithms. Here are the four you will encounter most often.

EKF-SLAM (Extended Kalman Filter)

EKF-SLAM was one of the earliest formulations. It models the robot's pose and all landmark positions as a single large Gaussian distribution, then updates this distribution every time the robot moves or observes a landmark.

The state vector contains the robot's position (x, y, heading) and the positions of all known landmarks. The covariance matrix captures the uncertainty in each of these estimates and, critically, the correlations between them. When the robot re-observes a known landmark, the update reduces uncertainty not just for that landmark but for everything correlated with it.

Strengths: Mathematically elegant. Produces consistent estimates when assumptions hold.

Weaknesses: The covariance matrix grows as O(n^2) where n is the number of landmarks. For environments with thousands of landmarks, this becomes computationally prohibitive. It also assumes Gaussian noise and linear motion/observation models, which are approximations.

Here is a simplified pseudocode sketch of the EKF-SLAM update cycle:

def ekf_slam_step(state, covariance, control, observations):
    # Predict: apply motion model
    state, covariance = predict(state, covariance, control)
 
    for obs in observations:
        landmark_id = obs.id
        measured_range = obs.range
        measured_bearing = obs.bearing
 
        if landmark_id not in known_landmarks:
            # Initialize new landmark from observation
            state = add_landmark(state, obs)
            covariance = expand_covariance(covariance)
        else:
            # Compute expected observation
            expected = expected_observation(state, landmark_id)
 
            # Innovation: difference between measured and expected
            innovation = [measured_range - expected.range,
                          measured_bearing - expected.bearing]
 
            # Compute Kalman gain
            H = observation_jacobian(state, landmark_id)
            S = H @ covariance @ H.T + observation_noise
            K = covariance @ H.T @ inv(S)
 
            # Update state and covariance
            state = state + K @ innovation
            covariance = (I - K @ H) @ covariance
 
    return state, covariance

Each observation tightens the joint estimate of the robot's pose and the map.

Particle Filter SLAM (FastSLAM)

FastSLAM takes a different approach. Instead of maintaining a single Gaussian, it uses a particle filter -- a set of particles, each representing a possible robot trajectory. Each particle carries its own map (typically a set of independent EKF landmark estimators or an occupancy grid).

The key insight is that if you knew the robot's exact trajectory, mapping would be easy -- you would just project all observations into a global frame. FastSLAM exploits this by sampling many candidate trajectories and weighting them by how well they explain the observations.

Strengths: Handles non-Gaussian distributions, scales better than EKF-SLAM for large maps (O(n log n) per step), and naturally handles multi-modal ambiguity (the robot might be in one of several places).

Weaknesses: Particle depletion -- over time, a few particles dominate and diversity is lost, which can cause the filter to converge to a wrong solution. More particles means more computation.

Graph-Based SLAM

Graph-based SLAM (also called GraphSLAM or pose-graph SLAM) is the dominant approach in modern systems. It formulates SLAM as an optimization problem over a graph:

  • Nodes represent robot poses at different time steps (and optionally landmark positions).
  • Edges represent constraints between nodes: odometry constraints (from motion), observation constraints (from seeing landmarks), and loop closure constraints (from recognizing a previously visited place).

The algorithm builds this graph as the robot moves, then solves a large nonlinear least-squares optimization to find the set of poses and landmark positions that best satisfies all constraints simultaneously.

Strengths: Extremely accurate. Can incorporate loop closures that dramatically reduce drift. Scales well with sparse optimization techniques (the graph is sparse because each pose is connected only to nearby poses and landmarks). Handles large environments.

Weaknesses: The optimization is a batch process (though incremental solvers like iSAM2 mitigate this). Detecting loop closures reliably is its own challenging sub-problem.

Graph-based SLAM is the backbone of most production mapping systems today.

Visual SLAM (vSLAM)

Visual SLAM is not a single algorithm but a family of SLAM systems that use cameras as their primary sensor. The most influential systems include ORB-SLAM, LSD-SLAM, and DSO (Direct Sparse Odometry).

Visual SLAM systems generally fall into two categories:

  • Feature-based methods (like ORB-SLAM) detect and match keypoints (ORB, SIFT, etc.) across frames, triangulate their 3D positions, and optimize a pose graph. They produce sparse point cloud maps.
  • Direct methods (like LSD-SLAM, DSO) work directly on pixel intensities rather than extracted features. They minimize photometric error across frames and can produce semi-dense or dense maps.

Recent years have seen a surge of learning-based visual SLAM approaches that use deep neural networks for depth estimation, feature extraction, or even end-to-end pose estimation. These methods show promise in difficult environments (low texture, dynamic scenes) but often lag behind classical methods in accuracy and robustness on standard benchmarks.

Strengths of visual SLAM: Cameras are inexpensive and ubiquitous. Rich semantic information is available (you can recognize objects, not just measure distances). Works well outdoors where lidar range may be insufficient.

Weaknesses: Sensitive to illumination changes, motion blur, and featureless environments. Scale ambiguity with monocular cameras (you cannot tell if an object is small and close or large and far away without additional information).

Common Challenges in SLAM

Even with decades of research, SLAM is not a solved problem. Several challenges remain active areas of work.

Loop Closure

When a robot returns to a place it has visited before, it must recognize this event -- called loop closure -- and use it to correct accumulated drift. If the robot has been driving for 10 minutes and its position estimate has drifted by 2 meters, detecting that it is back at the starting point lets it retroactively correct the entire trajectory.

Loop closure detection is essentially a place recognition problem. In lidar SLAM, scan matching techniques compare current scans against historical ones. In visual SLAM, bag-of-words models or learned descriptors compare current images against a database of past keyframes. False positives (incorrectly detecting a loop closure) can catastrophically corrupt the map, so high precision is essential.

Dynamic Environments

Most SLAM algorithms assume the environment is static. People walking through a room, doors opening and closing, cars driving past -- these all violate that assumption. Dynamic objects can corrupt the map (a person gets "baked into" the map as a wall) or confuse localization (the robot matches against landmarks that have moved).

Modern systems handle this with outlier rejection (discarding observations that do not match the expected static environment), semantic segmentation (identifying and filtering out known dynamic object classes like people and cars), or explicit dynamic object tracking.

Computational Cost

SLAM must run in real time on hardware that a robot can actually carry. A warehouse robot cannot lug around a server rack. This creates constant tension between algorithmic sophistication and computational feasibility.

Graph-based SLAM optimizations can involve thousands of variables. Visual SLAM must process high-resolution images at frame rate. 3D lidar SLAM must handle hundreds of thousands of points per scan. Efficient data structures (kd-trees for nearest-neighbor search, sparse matrix solvers for graph optimization), careful engineering, and hardware acceleration (GPU-based feature extraction, FPGA-based scan matching) all play a role.

Sensor Degeneracy

Certain environments cause specific sensors to fail. A long, featureless corridor gives lidar very little information about progress along its length (the scans all look the same). A room with white walls gives a camera nothing to track. Open outdoor environments can push lidar beyond its effective range. Robust SLAM systems must detect these degenerate cases and rely more heavily on other sensor modalities.

Real-World Applications

SLAM is not just an academic exercise. It is deployed at massive scale across industries.

Self-driving cars use lidar SLAM and visual SLAM to build and maintain HD maps of roads. Companies like Waymo and Cruise run SLAM pipelines on multi-sensor rigs (lidar, cameras, IMU, GPS) to produce centimeter-accurate maps, then use those maps for real-time localization during autonomous driving.

Warehouse robots from companies like Amazon Robotics and Locus Robotics use 2D lidar SLAM to navigate warehouse floors. They build a map during a setup phase, then localize against it in real time to shuttle goods between shelves and packing stations.

Drones performing building inspection, agricultural surveying, or search-and-rescue use visual-inertial SLAM (camera plus IMU) because weight constraints rule out heavy lidar sensors. The resulting 3D maps are used for photogrammetric reconstruction and mission planning.

Robotic vacuum cleaners like the iRobot Roomba (newer models) and Roborock use SLAM with either a small lidar sensor or a camera to build a floor plan of your home. This lets them clean efficiently in straight lines instead of bouncing randomly off walls.

Augmented reality devices like the Microsoft HoloLens and Apple Vision Pro use visual-inertial SLAM to understand the geometry of the room and place virtual objects that stay anchored to real-world surfaces as you move your head.

A Simple Lidar SLAM Pipeline in Python

To make this concrete, here is a simplified 2D lidar SLAM pipeline using scan matching and pose graph optimization. This is heavily simplified, but it illustrates the core data flow.

import numpy as np
from scipy.optimize import least_squares
 
class SimpleSLAM:
    def __init__(self):
        self.poses = [np.array([0.0, 0.0, 0.0])]  # x, y, theta
        self.scans = []
        self.odometry_edges = []
        self.loop_closure_edges = []
 
    def add_scan(self, scan, odometry):
        """Process a new lidar scan with odometry estimate."""
        # Predict new pose from odometry
        prev_pose = self.poses[-1]
        predicted_pose = self.apply_motion(prev_pose, odometry)
 
        # Refine pose by matching scan against previous scan
        if len(self.scans) > 0:
            correction = self.scan_match(self.scans[-1], scan,
                                         prev_pose, predicted_pose)
            refined_pose = predicted_pose + correction
        else:
            refined_pose = predicted_pose
 
        self.poses.append(refined_pose)
        self.scans.append(scan)
        self.odometry_edges.append((len(self.poses) - 2,
                                    len(self.poses) - 1,
                                    odometry))
 
        # Check for loop closure
        loop = self.detect_loop_closure(scan, refined_pose)
        if loop is not None:
            self.loop_closure_edges.append(loop)
            self.optimize_pose_graph()
 
    def apply_motion(self, pose, odometry):
        """Apply odometry motion model to a pose."""
        dx, dy, dtheta = odometry
        theta = pose[2]
        return np.array([
            pose[0] + dx * np.cos(theta) - dy * np.sin(theta),
            pose[1] + dx * np.sin(theta) + dy * np.cos(theta),
            pose[2] + dtheta
        ])
 
    def scan_match(self, ref_scan, new_scan, ref_pose, init_pose):
        """Align new_scan against ref_scan using ICP-style matching.
           Returns a correction to init_pose."""
        # In practice, this would run Iterative Closest Point (ICP)
        # or correlative scan matching. Simplified here.
        correction = np.zeros(3)
        for iteration in range(10):
            correspondences = self.find_closest_points(
                ref_scan, new_scan, ref_pose, init_pose + correction)
            delta = self.solve_alignment(correspondences)
            correction += delta
            if np.linalg.norm(delta) < 1e-4:
                break
        return correction
 
    def detect_loop_closure(self, scan, pose):
        """Check if current scan matches any historical scan."""
        for i, (hist_scan, hist_pose) in enumerate(
                zip(self.scans[:-20], self.poses[:-20])):
            distance = np.linalg.norm(pose[:2] - hist_pose[:2])
            if distance < 3.0:  # Only check nearby poses
                match_score = self.compute_scan_similarity(
                    hist_scan, scan, hist_pose, pose)
                if match_score > 0.85:
                    relative = self.compute_relative_pose(
                        hist_pose, pose, hist_scan, scan)
                    return (i, len(self.poses) - 1, relative)
        return None
 
    def optimize_pose_graph(self):
        """Run pose graph optimization over all constraints."""
        # Flatten poses into optimization variable
        x0 = np.concatenate(self.poses)
 
        def residuals(x):
            res = []
            poses = x.reshape(-1, 3)
            # Odometry constraints
            for i, j, odom in self.odometry_edges:
                predicted = self.apply_motion(poses[i], odom)
                res.extend(poses[j] - predicted)
            # Loop closure constraints (weighted higher)
            for i, j, relative in self.loop_closure_edges:
                predicted = self.apply_motion(poses[i], relative)
                res.extend(3.0 * (poses[j] - predicted))
            return np.array(res)
 
        result = least_squares(residuals, x0)
        optimized = result.x.reshape(-1, 3)
        self.poses = [optimized[i] for i in range(len(optimized))]

This shows the three core stages: motion prediction from odometry, scan-to-scan alignment to refine the prediction, and pose graph optimization to correct drift when loop closures are detected.

How to Choose a SLAM Algorithm

If you are building a robot and need to pick a SLAM approach, here are some practical guidelines:

  • Indoor, 2D, structured environment (offices, warehouses): 2D lidar SLAM with a pose graph backend. Mature libraries like Cartographer and SLAM Toolbox work well out of the box.
  • Indoor, 3D mapping (multi-floor buildings, complex geometry): 3D lidar SLAM with LOAM or LIO-SAM. Combine lidar with IMU for robustness.
  • Outdoor, large scale (self-driving, surveying): Multi-sensor fusion with lidar, cameras, IMU, and GPS. Graph-based backends with robust loop closure.
  • Weight/cost-constrained (drones, small robots): Visual-inertial SLAM with a stereo or RGB-D camera and IMU. ORB-SLAM3 is a strong baseline.
  • AR/VR applications: Visual-inertial odometry (VIO) with dense reconstruction. Apple ARKit and Google ARCore use this internally.

Try It Yourself

SLAM is one of those topics that clicks much faster when you see it working. Head to our SLAM and Navigation module in the lessons section to work through interactive exercises. You will build a simple particle filter localizer, watch a robot close loops and snap its map into place, and experiment with different sensor configurations to see how they affect map quality.

The best way to understand SLAM is to break it -- give a robot bad odometry and watch the map warp, then add loop closures and watch it heal.

Related Articles

Related Lessons

Enjoyed this article?

Subscribe to get notified when we publish new content.

Subscribe