CS6476 Final Project

Dense Mapping using Feature Matching and Superpixel Clustering

Mandy Xie, Shicong Ma, Gerry Chen

Dec 4, 2019


Abstract

One of the fundamental tasks for robot autonomous navigation is to perceive and digitalize the surrounding 3D environment [1]. We replicate the results of [2] to produce semi-dense, surfel-based reconstruction using superpixels.

Introduction

One of the fundamental tasks for robot autonomous navigation is to perceive and digitalize the surrounding 3D environment [1]. To be usable in mobile robot applications, the mapping system needs to fast and densely recover the environment in order to provide sufficient information for navigation.

Unlike other 3d reconstruction methods that reconstructs the environment as a 3D point cloud, we hope to extract surfels based on extracted superpixels from intensity and depth images and construct a surfel cloud. This approach is introduced by [2] which can greatly reduces the memory burden of mapping system when applied to large-scale missions. More importantly, outliers and noise from low-quality depth maps can be reduced based on extracted superpixels.

The goal of our project is to reproduce results of Wang et al’s, namely implementing superpixel extraction, surfel initialization, and surfel fusion to generate a surfel-based reconstruction given a camera poses from a sparse SLAM implementation. The input to our system is an RGB-D video stream with accompanying camera poses and the output is a surfel cloud map of the environment, similar to Figures 4b or 8 of the original paper [2].

Implementation

The idea behind dense mapping is to first generate frame related poses, then reconstruct the dense map based on pre-generated poses and surfels.

  1. Select an RGB-D dataset [1], [3], [4]

  2. Read pose information from the dataset / Use a sparse SLAM system (VINS [5]/ORB-SLAM2 [6]) to estimate camera poses

  3. Run code from [7] directly to confirm functionality and set benchmark/expectations.

  4. (Implementation) – Single frame Superpixels extraction from RGB-D images using a k-means approach adapted from SLIC [8] - IV.D section in [2]

  5. (Implementation) – Single frame surfel generation based on extracted superpixels. - IV.E section in [2]

  6. (Implementation) – Surfel fusion and Surfel Cloud update. - IV.G section in [2]

Dataset

We have started with the kt3 sequence of the ICL-NIUM dataset [1]. Images and depth maps have been extracted and examples shown below.

rgb image from ICL-NIUM dataset depth image from ICL-NIUM dataset


Run Existing Code

The code written for the paper was run to ensure that the results could be reproduced. Below are some results of running the code for dense reconstruction on images from the KITTI dataset [4]. We showed that it can indeed produce dense reconstructions.

kitti kitti

kitti kitti

Superpixel Extraction

We have completed single-frame superpixel generation. The results are shown below.

superpixel annotation on RGB image superpixel annotation on depth image

We follow the standard implementation as described in the paper:

  1. Initialize superpixel seeds - Superpixel seeds are initialized on a grid of predefined size
  2. Update superpixels
    1. Pixels are assigned to their nearest superpixel
    2. Superpixel properties (x, y, size, intensity, depth) are updated accordingly.

The number of times that step 2 is repeated depends on the image. For example, the ICL-NIUM dataset image requires only 10 iterations or so to stabilize, but the KITTI dataset image requires roughly 25 to stabilize. One metric that can potentially be used in the future as a stopping criteria is the sum of distances traveled by the superpixels from one iteration to the next. The superpixel iteration is complete when they sufficiently small changes occur from one iteration to the next.

Surfel Generation and Fusion

Surfels are modeled with the superpixels extracted from intensity and depth images in the following method as described in the paper:

  1. Surfel Initialization:
    Initialize superpixel cluster that has enough assigned seeds with a set of reasonable initial values.
  2. Surfel Fusion:
    Fuse extracted local surfels with newly initalized surfels if they have similar depth and normals. Transform fused local surfels into the global frame, and remove those are updated less than 5 times.

For the surfel initialization, we transform each superpixel into a surfel according to the correspondence equations given in [2]. An example of a single frame with each superpixel processed in this manner is shown below.

Figure: Single frame surfel reconstruction
Figure: Single frame surfel reconstruction

To process additional frames, each new surfel from a superpixel in the frame to be added must first be checked amongst all existing surfels to see if it is similar enough to be fused. If not, then a new surfel is initialized. An example of a cloud consisting of a few frames containing fused surfels is shown below.

Figure: Multi-frame surfel fusion result
Figure: Multi-frame surfel fusion result

We can see that the reconstruction is much better filled in (denser) and also has better detail. For example, the filing cabinet achieves much better resolution when putting multiple frames together. This is partly because multiple surfels can occupy the same location in space if they have different normal directions.

We have now recreated the dense reconstruction surfel cloud results from [2] that we sought out to achieve.

Experiments and results

A number of parameters can be tuned for surfel generation and a few will be discussed:

  1. K-means iterations for superpixel extraction
  2. Number of frames used in surfel cloud generation
  3. Superpixel size used in initialization of superpixel seeds
  4. Outlier removal criteria

We then finish with a qualitative description of the resulting surfel clouds.

K-means Iterations

As mentioned earlier and shown in both the figure above and the figure below, the process of extracting superpixels is an iterative one based on k-means. This means that the number of iterations to run k-means affects how closely the superpixels will converge to the optimal superpixel assignments. The animated figures illustrate how the superpixel assignments change as more iterations are run. We found that, for the ICL-NIUM dataset, roughly 10 iterations were sufficient while roughly 25 were required for the KITTI dataset. We follow the lead of the reference paper [2] and consider the number of iterations a human-tuned parameter instead of setting a convergence stopping criterion.

superpixel annotation on RGB image superpixel annotation on depth image Figure: Superpixel extraction over multiple iterations

Number of Frames

The number of frames to use to generate a surfel cloud significantly affects the result. This is because the pose estimate that we read in was not completely accurate, so errors accrue with more frames causing inconsistencies in the surfel cloud. At the same time, too few frames results in sparser clouds with more gaps. Shown below are examples of surfel clouds generated with 1, 3, 25, and 50 frames.

Figure: Effect of the number of frames used on surfel cloud result
Figure: Effect of the number of frames used on surfel cloud result

Superpixel Size

The generated surfels result varies when we change the parameters, such as the size of superpixels and the size of surfels. The number of superpixels in our implementation does not change during the k-means process so the size of initialized superpixels affects the general size scale of the final superpixels as well. Similarly, surfel size is dependent upon superpixel size because surfels are initialized from superpixels, so the superpixel initialization density also affects the final sizes of the surfels. Shown below are examples of surfel clouds generated with initialization superpixel sizes of 50x50, 25x25, 12x12, and 9x9. We see that too large superpixels lose detail while too small superpixels become sparse.

Figure: Effect of superpixel initialization size on surfel cloud
Figure: Effect of superpixel initialization size on surfel cloud

Outlier Removal

Some surfels are poorly conditioned due to factors such as oblique viewpoint, small superpixel parent, only being visible in few frames, distance to camera, and other factors. Several checks exist in our code to eliminate obvious outliers. One example is removing surfels which don’t appear in many frames. Surfels which appear in multiple frames get “fused” and we keep track of how many times a given surfel has been fused. The animation below compares a raw surfel cloud and one which removes surfels fused less than 3 times. We notice that including these “outlier” surfels generates a more complete cloud, but at the expense of extra noise. For example, there is a cluster of surfels to the right of the filing cabinet which are not oriented correctly.

Figure: Effect of outlier removal on surfel cloud
Figure: Effect of outlier removal on surfel cloud

Qualitative Results

The superpixel-segmented images below from the KITTI [4] dataset demonstrate that the superpixels are indeed segmenting properly as they tend to “hug” similarly colored/depthed regions.

superpixel annotation on RGB image superpixel annotation on depth image

Converting the superpixels into surfels appears correct based on:

  1. The locations of the surfels clearly traces out the room shape (see reference image)
  2. The orientations of the surfels are aligned with their neighbors to form planar surfaces
  3. The orientations of the surfels match the room shape
  4. The orientations of the surfels near corners are slightly “curved” indicating fusing and averaging is working properly
  5. Overlapping and non-redundancy of surfels from multiple frames indicates fusion is working properly to only add new surfels when they are sufficiently different than the nearby surfels.

We can observe these qualitatively from the rgb image and example surfel cloud below.

Figure: Surfel cloud from 25 frames which matches expectations
Figure: Surfel cloud from 25 frames which matches expectations

Conclusion and Future Work

We successfully recreated the results of [2] by creating a surfel cloud given RGBD images and camera poses. Furthermore, we investigated and reported the effects of various parameters on the resulting surfel clouds and discussed qualitative results from our reconstructions.

During this project, several difficulties were faces, such as

Future directions include

References

[1] A. Handa, T. Whelan, J. McDonald, and A. J. Davison, “A benchmark for rgb-d visual odometry, 3D reconstruction and slam,” in 2014 ieee international conference on robotics and automation (icra), 2014, pp. 1524–1531.

[2] K. Wang, F. Gao, and S. Shen, “Real-time scalable dense surfel mapping,” in 2019 international conference on robotics and automation (icra), 2019, pp. 6919–6925.

[3] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in Proc. Of the international conference on intelligent robot systems (iros), 2012.

[4] M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in Conference on computer vision and pattern recognition (cvpr), 2015.

[5] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monocular visual-inertial state estimator,” IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018.

[6] R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.

[7] https://github.com/HKUST-Aerial-Robotics/DenseSurfelMapping.

[8] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “SLIC superpixels compared to state-of-the-art superpixel methods,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 11, pp. 2274–2282, 2012.