# 2D-3D Correspondences-based Model Construction

24 December 2017
Yehya Abouelnaga1, Balamurugan Thambiraja1, Kamel Guerda1,2
1 Technical University of Munich
2 Université Pierre et Marie Curie

## Introduction

Given a 3D Model for a box, we want to re-construct the model from multiple images that captures the model from different angles. First, we build a list of 2D-3D correspondances per image. After that, we estimate the camera pose per image. Then, we project the 2D point onto the 3D model from each image (using the estimated camera pose) to re-construct the 3D model. Eventually, we shall use the constructed models to detect the object in variant poses and occlusions in different images using SIFT-based feature matching.

Note that this is a graduate class project conducted in the ‘‘Tracking and Detection in Computer Vision’’ class in the Technical University of Munich (TUM) taught by Dr. Slobodan Ilic.

## Background

During this tutorial, I would assume a basic knowledge of Camera Intrinsics, Exterinsics, World Camera Pose, Point-n-Perspective (PnP) Pose Estimation and Randomly Sampled Consensus (RANSAC). In this section, I will briefly touch on each one of these methods to get you through this tutorial.

### Camera Model

For any given point $\mathbf{M}$ (in the real world), we need to reflect that point in the camera’s perspective, as a point $\mathbf{M}_\text{cam}$  , where:

$\mathbf{M}_{cam}$   is the point translated (i.e. shifted) and rotated from the camera’s perspective. $\mathbf{M}$ is the actual point in real world. $\mathbf{R}$ is a $3 \times 3$ matrix that represents the ‘‘rotation’’. $\mathbf{T}$ is a $3 \times 1$ vector that represents the ‘‘translation’’ (i.e. shift).

The above expression could be written in Homogenous coordinates as follows:

where $(\mathbf{R} \, \mid \, \mathbf{T} )$ is a $3 \times 4$ rotation and translation matrix. That above operation is now a linear transformation rather than an affine (geometrical) one.

In order to project the 3D $\mathbf M_\text{cam}$ coordinates onto the image plane, we obtain $\mathbf m$ where

We could rewrite that as:

In order to move from a plane projection to an image coordinate system ($m_X \rightarrow m_u$ and $m_Y \rightarrow m_v$), we need to know the ratio of pixels per square unit of measurements along each axis ($k = pixels \mathbin{/} mm$).

Let $k_u , k_v$ be the pixels per square unit of measurements along the X- and Y-axis, and $(u_0, v_0)$ be the projection of the camera center $C$ in the image coordinate system.

We could squeeze all translation ($T$), rotation ($R$) and calibration ($K$) into a projection matrix ($P$).

### Point-n-Perspective Algorithm (PnP)

The wikipedia explanation of the PnP algorithm is fairly simple to follow. You can also refer to Prof. Prof Steven LaValle lecture on the same topic for a better understanding.

### Random Sample Consensus (RANSAC)

RANSAC is an outlier-extraction method that could be used with various regression methods. It defines the quality of regression based on the number of inliers. Inliers/Outliers are defined by comparing distances against a pre-defined threshold. More details are explained in RANSAC Wikipedia article.

## Camera Pose Estimation

We used Matlab to manually create the 2D-3D Correspondences between at least 6 points in the image and the box. Those points were usually the corners of the box. After that, we evaluated the camera pose on each image using the following Matlab code:

## SIFT Feature Extraction

I used the vl_feat [1] implementation on Matlab to extract the SIFT features as follows:

## 3D Model Construction

In order to project the 2D points onto the 3D model, we need the following steps:

1. Project given 2D Points in 3D given camera pose and shoot a ray from the center of the camera passing through the projected 3D point.
2. Intersect the shoot ray with each face of the box and select the face the meets the following criteria:
1. The ray intersects the plane in which the face lies.
2. The plane faces the camera; at any point, there are two faces with which our ray will intersect: one on the side of the box that faces the camera and the other on the back of the box– thus, not facing the camera.
3. The intersection point should lie on that face (and, thus on the box) not just on the plane.

A face is defined as a triangle here. And, the box has 12 triangles (2 lying on each side of the box– a box has 6 sides, right?).

### Step 1: Shoot Ray

In order to project the 2D points in the world coordinates (i.e. in 3D), we need the projection matrix. In order to get the projection matrix, we need the calibration matrix, rotation matrix, and translation vector. Calibration matrix is priori. Rotation matrix and translation vector are obtained from the estimated pose. Another workaround is to use the worldOrientation and worldLocation given by the estimateWorldCameraPose matlab function to evaluate the 3D point directly through the following equation:

where $\mathbf O$ is the world orientation, and $\mathbf L$ is the world location.

### Step 2: Find Ray-Face/Triangle Intersection Point

#### Sub-Step 2.1: Intersect Ray with Plane of Triangle/Face

Here, I am using the Plane Line Intersection Matlab code by Nassim Khaled to check for intersection and get the intersection point.

#### Sub-Step 2.2: Choose Plane Intersection That Faces The Camera

We could do that by checking if the angle between the norm of the plane and the intersection ray is less than 90 degrees as follows:

#### Sub-Step 2.3: Intersection Point Should Lie on that Triangle/Face

Given that the point lies in the same plane as the triangle, a simple check that the area of the triangle is equal to the sum of all sub-triangles created by the new point on the triangle should do the trick. Formally, assume that a triangles has three points $P_1$, $P_2$ and $P_3$, and the intersection point is $I$. The following equality should hold:

## Final Pipeline: 3D Model Construction

Now, we have all components to loop through all of the images (here, those are 8 images) and reconstruct the model.

Now, we plot the camera world locations and the box with the SIFT-features on it. However, we don’t plot the sift features themselves, we plot RGB values instead.

3D Model Constructed from 8 Images from Different Camera Poses

## References

1. [1]A. Vedaldi and B. Fulkerson, “VLFeat: An Open and Portable Library of Computer Vision Algorithms.” 2008 [Online]. Available at: http://www.vlfeat.org/