Dense 3D scene reconstruction is in high demand today for view synthesis, navigation, and autonomous driving. A practical reconstruction system inputs multi-view scans of the target using RGB-D cameras, LiDARs, or monocular cameras, computes sensor poses, and outputs scene reconstructions. These algorithms are computationally expensive and memory-intensive due to the presence of 3D data. Thus, it is essential to exploit sparsity adequately to reduce memory footprint, increase efficiency, and improve accuracy.
In this thesis, I will develop practical systems for fast and high-quality scene reconstruction. First, I will introduce a highly efficient hierarchical reconstruction system that serves as a foundational pipeline for integrating diverse pose estimation and scene reconstruction modules. Next, I will focus on the global registration of point clouds by learning deep features and their matches. Equipped with sparse convolutional networks, these studies define the state-of-the-art at the scene scale in both supervised and self-supervised setups. They are applied to reconstruction systems to produce globally consistent poses.
I will then shift to the topic of scene representation and reconstruction, introducing a modern engine, ASH, for parallel spatial hashing in the era of tensor and auto-differentiation. I will elaborate on the details of building this efficient and user-friendly engine from the ground up and discuss a series of downstream applications. These applications include real-time dense RGB-D SLAM, large-scale surface reconstruction from LiDAR scans, and fast scene reconstruction from monocular data. While achieving comparable or better accuracy than state-of-the-art methods, we demonstrate 2-10 times speed improvements with less development effort.
@phdthesis{Dong-2023-136767,
author = {Wei Dong},
title = {Large Scale Dense 3D Reconstruction via Sparse Representations},
year = {2023},
month = {May},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-23-29},
keywords = {SLAM, Spatial Hashing, 3D reconstruction, Differentiable Rendering},
}
We combine classical TSDF fusion and differentiable volume rendering to reconstruct scenes from monocular images. Without an MLP, it is fast, and the results are comparable to SOTA.
Indoor scene reconstruction from monocular images has long been sought after by augmented reality and robotics developers. Recent advances in neural field representations and monocular priors have led to remarkable results in scene-level surface reconstructions. The reliance on Multilayer Perceptrons (MLP), however, significantly limits speed in training and rendering.
In this work, we propose to directly use signed distance function (SDF) in sparse voxel block grids for fast and accurate scene reconstruction without MLPs. Our globally sparse and locally dense data structure exploits surfaces’ spatial sparsity, enables cache-friendly queries, and allows direct extensions to multi-modal data such as color and semantic labels.
To apply this representation to monocular scene reconstruction, we develop a scale optimization algorithm for fast geometric initialization from monocular depth priors. We apply differentiable volume rendering from this initialization to refine details with fast convergence. We also introduce efficient high-dimensional Continuous Random Fields (CRFs) to further exploit the semantic-geometry consistency between scene objects.
Experiments show that our approach is 10x faster in training and 100x faster in rendering while achieving comparable accuracy to state-of-the-art neural implicit methods.
@article{dong2023fast,
title={Fast Monocular Scene Reconstruction with Global-Sparse Local-Dense Grids},
author={Dong, Wei and Choy, Chris and Loop, Charles, and Zhu, Yuke, and Litany, Or and Anandkumar, Anima},
journal={CVPR},
year={2023}
}
We present ASH, a modern and high-performance framework for parallel spatial hashing on GPU.
Compared to existing GPU hash map implementations, ASH achieves higher performance, supports richer functionality, and requires fewer lines of code (LoC) when used for implementing spatially varying operations from volumetric geometry reconstruction to differentiable appearance reconstruction.
Unlike existing GPU hash maps, the ASH framework provides a versatile tensor interface, hiding low-level details from the users. In addition, by decoupling the internal hashing data structures and key-value data in buffers, we offer direct access to spatially varying data via indices, enabling seamless integration to modern libraries such as PyTorch.
To achieve this, we
We first profile our hash map against state-of-the-art hash maps on synthetic data to show the performance gain from this architecture. We then show that ASH can consistently achieve higher performance on various large-scale 3D perception tasks with fewer LoC by showcasing several applications, including
ASH and its example applications are open sourced in Open3D.
@article{dong2021ash,
title={ASH: A Modern Framework for Parallel Spatial Hashing in 3D Perception},
author={Dong, Wei and Lao, Yixing and Kaess, Michael and Koltun, Vladlen},
journal={PAMI},
year={2023}
}
We start with revisiting a factor graph and its corresponding Jacobian
So we have a factor graph, and its corresponding linear system \(Ax = b\).
In large sparse SLAM setups, we usually factorize A with QR. Partial QR is iteratively applied to submatrices, with permutation (?).
The ordering in elimination matters. A bad order results in fill-in and leads to a dense matrix to solve.
We prefer smaller degrees in order
Computationally dense
Reorder according to degrees: COLAMD
[TODO] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.124.1109&rep=rep1&type=pdf
A graph can be somehow partitioned into
Fact 1
Fact 2
We can form a clique graph from a graph, by connecting max-cliques
Fact 3
Trees are good for elimination or factorization esp. when the cliques are small
A clique tree depicts the global structure
Each clique depicts local distributions
In a clique is a dense upper triangular matrix after reducing the separator column
Incremental update
Up to now, we know the connection between
Heuristically, we know local topology (vertex degree) matters. There could be more fundamental insights.
A V x E matrix, each column encode the (weighted) edge connecting 2 vertices
Sounds familiar? A Jacobian is E x V, each row encodes a measurement connecting 2 variables
There are even more similarities.
Another derivation is \(L = MM^\top\), where \(M\) is the incidence matrix.
If we take in one prior then the Jacobian is the reduced incident matrix (transpose)
\[z = \mathbf{M}^{r, \top} x + \epsilon\]If we extend to arbitrary dimension d then it becomes
\(z = (\mathbf{M}^r \otimes \mathbf{I}_d)^\top x + \epsilon\) (Kronecker multipler expands each 1x1 entry to a dxd block)
Then considering the variance as a diagonal weight matrix, the information matrix is given by
\[\Lambda = (\mathbf{M}^r \otimes \mathbf{I}_d)^\top (\mathbf{W} \otimes \mathbf{I}_d) (\mathbf{M}^r \otimes \mathbf{I}_d) = (\mathbf{M}^r \mathbf{W} \mathbf{M}^r)^\top \otimes \mathbf{I}_d = \mathbf{L}^r_w \otimes \mathbf{I}_d\]Here we are using the weighted Laplacian, where the weight per edge is given by the measurement variance.
This is a very close association between Information and Laplacian matrices.
\[\Lambda = \mathbf{L}^r_w \otimes \mathbf{I}_d\]So the albebraic connectivity defines the variance error bound
There are extensions to less trivial 2D/3D SLAM, and corresponding corollaries. In a nutshell, graph topology and estimation uncertainty is connected via Laplacian and Fisher Information.
With the empirical connection, we drop covariance now and consider only the topology or the Laplacian
We are interested in edge selection (e.g., loop closure selection in a pose graph)
We know this can be computed in a batch:
Can we do it incrementally by adding one edge $m_e$ in the incidence matrix?
So we can use the greedy algorithm: every time we add the edge with the max gain, then update the Laplacian
Other solutions include convex relaxation
\[L = L_{init} + \sum_j \pi_j L_{e_j} = MW^\pi M\]Non-convex version
Convex relaxation (can be solved with a laplacian multiplier
A workflow is to jointly use it with Open3D that generates non-physically based rendering.
Using the snippet
import open3d as o3d
mesh = o3d.io.read_triangle_mesh('/path/to/mesh.ply')
mesh.compute_triangle_normals()
o3d.visualization.draw_geometries([mesh])
an interactive window will pop up. Move your mouse until you find a good view point, then press ‘P’. Open3D will take a decent screenshot along with a json file in the shape of
{
"class_name" : "PinholeCameraParameters",
"extrinsic" :
[
0.8390546152869085,
0.54133390210883348,
0.054267476386522975,
0.0,
-0.3175183711744502,
0.40625158739394251,
0.85682071152991279,
0.0,
0.44177985075426657,
-0.73615029319258285,
0.51275072822962653,
0.0,
1.075349911825306,
0.86201671957262604,
9.441486541912365,
1.0
],
"intrinsic" :
{
"height" : 1012,
"intrinsic_matrix" :
[
876.41770862985197,
0.0,
0.0,
0.0,
876.41770862985197,
0.0,
959.5,
505.5,
1.0
],
"width" : 1920
},
"version_major" : 1,
"version_minor" : 0
}
Now copy the extrinsic matrix (note in json it is stored in colume major) to a simplest configure file of Mitsuba
, and you are almost ready to render.
<scene version="2.0.0">
<shape type="ply">
<string name="filename" value="/path/to/mesh.ply"/>
</shape>
<integrator type="path">
<integer name="max_depth" value="8"/>
</integrator>
<default name="spp" value="256"/>
<emitter id="light_0" type="constant">
<spectrum name="radiance" value="2.0"/>
</emitter>
<sensor type="perspective">
<transform name="to_world">
<scale x="-1"/>
<scale y="-1"/>
<matrix value=" 0.83905462 0.5413339 0.05426748 -1.88128183 -0.31751837 0.40625159 0.85682071 -8.09841352 0.44177985 -0.73615029 0.51275073 -4.68162316 0. 0. 0. 1. "/>
</transform>
<float name="fov" value="60"/>
<sampler type="independent">
<integer name="sample_count" value="$spp"/>
</sampler>
<film type="hdrfilm">
<integer name="width" value="512"/>
<integer name="height" value="512"/>
<rfilter type="box"/>
</film>
</sensor>
</scene>
In practice, first use a small number (like 16) for spp
(samples per point), and a small resolution for the image (like 128 x 128), and see if the output looks reasonable. Try to adjust the perspective in the config or the extrinsics back in Open3D
interactively, until a decent thumbnail is available. Then increase spp
and resolution to render the final image:
As of now (Octobor 2021) only perspective cameras are supported. If you want to play with fancy orthognal rendering, you may want to build contribution plugins. This will be helpful in larger room rendering, e.g.
but the tuning of parameters is less intuitive and more depending on manual adjustment.
]]>We present self-supervised geometric perception (SGP), the first general framework to learn a feature descriptor for correspondence matching without any ground-truth geometric model labels (e.g., camera poses, rigid transformations).
Our first contribution is to formulate geometric perception as an optimization problem that jointly optimizes the feature descriptor and the geometric models given a large corpus of visual measurements (e.g., images, point clouds). Under this optimization formulation, we show that two important streams of research in vision, namely robust model fitting and deep feature learning, correspond to optimizing one block of the unknown variables while fixing the other block.
This analysis naturally leads to our second contribution – the SGP algorithm that performs alternating minimization to solve the joint optimization. SGP iteratively executes two meta-algorithms:
As a third contribution, we apply SGP to two perception problems on large-scale real datasets, namely relative camera pose estimation on MegaDepth and point cloud registration on 3DMatch. We demonstrate that SGP achieves state-of-the-art performance that is on-par or superior to the supervised oracles trained using ground-truth labels.
@inproceedings{yang2021sgp,
author = {Yang, Heng and Dong, Wei and Carlone, Luca and Koltun, Vladlen},
title = {Self-Supervised Geometric Perception},
booktitle = {CVPR},
month = {June},
year = {2021},
pages = {14350-14361}
}
We present Deep Global Registration, a differentiable framework for pairwise registration of real-world 3D scans.
Deep global registration is based on three modules:
Experiments demonstrate that our approach outperforms state-of-the-art methods, both learning-based and classical, on real-world data.
@inproceedings{choy2020deep,
title={Deep Global Registration},
author={Choy, Christopher and Dong, Wei and Koltun, Vladlen},
booktitle={CVPR},
year={2020}
}
We propose a fast and accurate 3D reconstruction system that takes a sequence of RGB-D frames and produces a globally consistent camera trajectory and a dense 3D geometry.
We redesign core modules of a state-of-the-art offline reconstruction pipeline to maximally exploit the power of GPU. We introduce GPU accelerated core modules that include
Therefore, while being able to reproduce the results of the high-fidelity offline reconstruction system, our system runs more than 10 times faster on average. Nearly 10Hz can be achieved in medium size indoor scenes, making our offline system even comparable to online Simultaneous Localization and Mapping (SLAM) systems in terms of the speed.
Experimental results show that our system produces more accurate results than several state-of-the-art online systems. The system is open source at https://github.com/theNded/Open3D.
@inproceedings{dong2019gpu,
title={GPU accelerated robust scene reconstruction},
author={Dong, Wei and Park, Jaesik and Yang, Yi and Kaess, Michael},
booktitle={IROS},
pages={7863--7870},
year={2019},
organization={IEEE}
}
We aim to recover a high resolution texture representation of objects observed from multiple view points under varying lighting conditions.
For many applications the lighting conditions need to be changed and thus require a texture decomposition into shading and albedo components. Both texture super-resolution and intrinsic texture decomposition have been separately studied in the literature. Yet, no method has investigated how these methods can be combined. We propose a framework for joint texture map superresolution and intrinsic decomposition. To this end, we define shading and albedo maps of the 3D object as the intrinsic properties of its texture and introduce an image formation model to describe the physics of the image generation.
Our approach accounts for surface geometry and camera calibation errors and is also applicable to spatio-temporal sequences. Our method achieves state-of-the-art results on a variety of datasets.
@inproceedings{tsiminaki2019joint,
title={Joint Multi-view Texture Super-resolution and Intrinsic Decomposition.},
author={Tsiminaki, Vagia and Dong, Wei and Oswald, Martin R and Pollefeys, Marc},
booktitle={BMVC},
pages={15},
year={2019}
}
We propose a novel 3D spatial representation for data fusion and scene reconstruction. Probabilistic Signed Distance Function (Probabilistic SDF, PSDF) is proposed to depict uncertainties in the 3D space. It is modeled by a joint distribution describing SDF value and its inlier probability, reflecting input data quality and surface geometry.
A hybrid data structure involving voxel, surfel, and mesh is designed to fully exploit the advantages of various prevalent 3D representations. Connected by PSDF, these components reasonably cooperate in a consistent framework. Given sequential depth measurements, PSDF can be incrementally refined with less ad hoc parametric Bayesian updating. Supported by PSDF and the efficient 3D data representation, high-quality surfaces can be extracted on-the-fly, and in return contribute to reliable data fusion using the geometry information.
Experiments demonstrate that our system reconstructs scenes with higher model quality and lower redundancy, and runs faster than existing online mesh generation systems.
@inproceedings{dong2018psdf,
title={PSDF fusion: Probabilistic signed distance function for on-the-fly 3D data fusion and scene reconstruction},
author={Dong, Wei and Wang, Qiuyuan and Wang, Xin and Zha, Hongbin},
booktitle={ECCV},
pages={701--717},
year={2018}
}