Code Structure

This page summarizes the SLAM-MER architecture following Section 3 of Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling.

SLAM-MER Pipeline

The figure below illustrates the pipeline:

Pipeline figure from Section 3 (Figure 2) of Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling.

The system is organized around:

A map representation
A covisibility graph
A localization front end, a covisibility-graph back end, and optional loop detection

Map Representation

Given an input stream of monocular frames, the map is built online as:

K: keyframes
P^w: 3D map-points in world coordinates
C: 3D cells grouping map-points spatially

Each keyframe stores:

A camera pose, with optional Sim(3)-style scale handling in graph optimization
2D keypoints and descriptors
Local 3D points (from depth inference when a keyframe is created)

Each map-point stores:

A 3D world position
Descriptors from observing keyframes
Observation relations with keyframes

The 3D cells support efficient spatial queries by grouping nearby map-points.

Covisibility Graph

The covisibility graph contains nodes and constraints over keyframes and map-points:

Keyframe-keyframe edges: relative-pose constraints
Keyframe-point edges: 3D-2D projection and 3D-3D local-to-world distance constraints

This graph is the core structure used by the optimization back-end.

Pipeline Workflow

The localization front end runs for each incoming frame. Keyframe creation can run asynchronously for monocular frames, while the covisibility graph runs background edge-insertion and optimization threads. When visual place localization is enabled, loop detection also runs in a background thread.

Localization Module

For each incoming frame:

Extract 2D keypoints and descriptors.
Query map-points (temporal + spatial).
Build 3D-2D correspondences.
Estimate absolute pose.
Decide whether to create a new keyframe.

When a keyframe is created, local 3D points from RGB-D input or depth inference are added to the map and corresponding graph constraints are inserted.

For monocular input, depth inference is requested during keyframe creation, not for every frame. The current implementation gathers the current image plus a small set of recent frames or keyframes for the depth-inference call.

Adjustment Module

The adjustment module consumes queued observation and odometry edges and incrementally optimizes keyframe poses and map-point positions using a factor-graph back end (ISAM2/GTSAM). In SLAM and SLAM_WITHOUT_LOOP modes, a background optimization thread updates the map from the graph.

Loop closure constraints are also fused through this same graph update flow.

Modular Design

The pipeline is modular by construction:

2D feature extractors and descriptors can be replaced
Depth inference models can be replaced
Image retrieval methods can be replaced
The same workflow can be adapted to RGB-D inputs

Query Map-points

SLAM-MER queries map-points using two complementary strategies:

Temporal query (Q_T): map-points linked to keyframes associated with the last localized frames in the buffer
Spatial query (Q_S): map-points from closest visible 3D cells to the current camera pose

The final query set is:

Q_3D = Q_T union Q_S

Descriptors from Q_3D are matched against current frame descriptors to obtain robust 3D-2D correspondences.

Pose Estimation

Pose estimation is performed from 3D-2D correspondences with robust outlier rejection (RANSAC). During early frames, focal length is estimated jointly with pose; after stabilization, the focal value is fixed from the running estimate to speed up processing.

Keyframe Creation

A new keyframe is created when one or more of the following criteria is met:

Too few tracked points after pose estimation
Poor spatial spread of tracked/inlier keypoints in the image
Significant change in tracked-map-point distribution across keyframes after the configured minimum spacing between keyframes

The third criterion is measured by comparing consecutive histograms of tracked points over keyframes (KL-divergence), helping trigger keyframes when loops are being formed.

Loop Closure

Loop closure has two stages:

Loop detection retrieves candidate keyframes with image-level descriptors and validates candidates geometrically with pose estimation
Map fusion adds loop constraints to the graph and merges duplicated map-points when correspondences indicate the same scene structure

No special standalone optimization stage is required; the continuous graph adjustment thread handles the update.

Relocalization

When localization fails for buffered frames (for example, after long occlusion or kidnapping), relocalization is triggered:

Compute image-level descriptor for the current frame.
Retrieve similar keyframe candidates.
Validate candidates with geometric pose estimation.

After a valid relocalization, a new keyframe is created and normal localization continues.

Dense Reconstruction

Although the map representation is sparse (keyframes plus sparse map-points), dense reconstruction can be generated from keyframe-local 3D predictions and their optimized poses.