1 Technical University of Munich 2 Université Pierre et Marie Curie
Introduction
Given a 3D Model for a box, we want to re-construct the model from multiple images that captures the model from different angles. First, we build a list of 2D-3D correspondances per image. After that, we estimate the camera pose per image. Then, we project the 2D point onto the 3D model from each image (using the estimated camera pose) to re-construct the 3D model. Eventually, we shall use the constructed models to detect the object in variant poses and occlusions in different images using SIFT-based feature matching.
During this tutorial, I would assume a basic knowledge of Camera Intrinsics, Exterinsics, World Camera Pose, Point-n-Perspective (PnP) Pose Estimation and Randomly Sampled Consensus (RANSAC). In this section, I will briefly touch on each one of these methods to get you through this tutorial.
Camera Model
For any given point (in the real world), we need to reflect that point in the camera’s perspective, as a point , where:
is the point translated (i.e. shifted) and rotated from the camera’s perspective. is the actual point in real world. is a matrix that represents the ‘‘rotation’’. is a vector that represents the ‘‘translation’’ (i.e. shift).
The above expression could be written in Homogenous coordinates as follows:
where is a rotation and translation matrix. That above operation is now a linear transformation rather than an affine (geometrical) one.
In order to project the 3D $\mathbf M_\text{cam}$ coordinates onto the image plane, we obtain where
We could rewrite that as:
In order to move from a plane projection to an image coordinate system ( and ), we need to know the ratio of pixels per square unit of measurements along each axis ().
Let be the pixels per square unit of measurements along the X- and Y-axis, and be the projection of the camera center in the image coordinate system.
We could squeeze all translation (), rotation () and calibration () into a projection matrix ().
RANSAC is an outlier-extraction method that could be used with various regression methods. It defines the quality of regression based on the number of inliers. Inliers/Outliers are defined by comparing distances against a pre-defined threshold. More details are explained in RANSAC Wikipedia article.
Camera Pose Estimation
We used Matlab to manually create the 2D-3D Correspondences between at least 6 points in the image and the box. Those points were usually the corners of the box. After that, we evaluated the camera pose on each image using the following Matlab code:
SIFT Feature Extraction
I used the vl_feat [1] implementation on Matlab to extract the SIFT features as follows:
3D Model Construction
In order to project the 2D points onto the 3D model, we need the following steps:
Project given 2D Points in 3D given camera pose and shoot a ray from the center of the camera passing through the projected 3D point.
Intersect the shoot ray with each face of the box and select the face the meets the following criteria:
The ray intersects the plane in which the face lies.
The plane faces the camera; at any point, there are two faces with which our ray will intersect: one on the side of the box that faces the camera and the other on the back of the box– thus, not facing the camera.
The intersection point should lie on that face (and, thus on the box) not just on the plane.
A face is defined as a triangle here. And, the box has 12 triangles (2 lying on each side of the box– a box has 6 sides, right?).
Step 1: Shoot Ray
In order to project the 2D points in the world coordinates (i.e. in 3D), we need the projection matrix. In order to get the projection matrix, we need the calibration matrix, rotation matrix, and translation vector. Calibration matrix is priori. Rotation matrix and translation vector are obtained from the estimated pose. Another workaround is to use the worldOrientation and worldLocation given by the estimateWorldCameraPose matlab function to evaluate the 3D point directly through the following equation:
where is the world orientation, and is the world location.
Step 2: Find Ray-Face/Triangle Intersection Point
Sub-Step 2.1: Intersect Ray with Plane of Triangle/Face
Sub-Step 2.2: Choose Plane Intersection That Faces The Camera
We could do that by checking if the angle between the norm of the plane and the intersection ray is less than 90 degrees as follows:
Sub-Step 2.3: Intersection Point Should Lie on that Triangle/Face
Given that the point lies in the same plane as the triangle, a simple check that the area of the triangle is equal to the sum of all sub-triangles created by the new point on the triangle should do the trick. Formally, assume that a triangles has three points , and , and the intersection point is . The following equality should hold:
Step 3: Put it all together!
Final Pipeline: 3D Model Construction
Now, we have all components to loop through all of the images (here, those are 8 images) and reconstruct the model.
Now, we plot the camera world locations and the box with the SIFT-features on it. However, we don’t plot the sift features themselves, we plot RGB values instead.
3D Model Constructed from 8 Images from Different Camera Poses
References
[1]A. Vedaldi and B. Fulkerson, “VLFeat: An Open and Portable Library of Computer Vision Algorithms.” 2008 [Online]. Available at: http://www.vlfeat.org/