3D vision

University of Texas Deep Learning Lecture

*This video covers supervised learning only. The main tasks are: 1. Inferring the shape of objects (depth estimation) 2. Object classification

Summary: Four Key Considerations

Data structure

As summarized below, the methods differ entirely depending on which data structure is used to represent 3D, even for classification problems. Each has its own strengths.
Distance definition

How to compare 3D data. Choose wisely among Chamfer distance, Euclidean distance, F1 loss, etc. Do not use IoU.
Camera specifications

Canonical coordinates: Define the coordinate axes based on the orientation of the target object rather than the camera angle (e.g., the direction the person in the image is facing is +Z). If learned successfully, this can solve issues related to camera angles, but there is a high risk of overfitting.

View coordinates: Define coordinate axes relative to the camera; the direction the Kinect faces is +Z. Easy to implement.
Dataset

ShapeNet: Synthetic CAD data that you can interact with, but since there are no backgrounds, it is not suitable for training realistic inference. For some reason, chairs, cars, and airplanes are overrepresented.

Pix3D: Contains real RGB + D data and can also be used for segmentation training. Weaknesses include a small dataset size and only one object annotated per image.

3D Data Structures

Ways to represent 3D data:

Depth Map (H x W)

  A matrix of distances between each pixel and the camera (RGB-D image often called 2.5D).
  Powerless against occluded objects.

Voxel Grid (V x V x V)

A 3D matrix that indicates whether an object exists at each corresponding position, similar to Minecraft. High resolution is required and memory consumption is intensive, so improved versions such as Oct-tree and Nested Shape Layers exist.
Implicit surface (R^3 -> {0, 1})

A function that takes coordinates as variables and returns whether an object exists at those coordinates. During inference, it returns the probability that an object exists at the coordinates and predicts the color. Since it is a function, it can be trained with a simple approach: sample coordinates from a single 3D data point and learn whether the output is correct.
PointCloud (P x 3)

A set of coordinates. Captures the shape of objects while keeping memory consumption low, but has the weakness that the object surface is bumpy. Often used during training due to its simplicity, but rendering the bumpy surface requires additional processing when representing objects.
Mesh

A graph connecting V points to form triangles. By increasing the number of points, objects can be represented in a more realistic form (the surface becomes progressively smoother). Textures can be applied to the triangular faces.

Tasks

1. Predicting Depth Maps

A task of learning from 4-channel RGB + Depth Map images and predicting the Depth Map from only the RGB channels. Initially, fully connected CNNs were used, but there was a problem that it was unclear whether a small object in the RGB image appeared small due to perspective or because it was actually small and close. The countermeasure was to use scale-invariant RMS as the loss function.

In this loss, even if the network incorrectly estimates the distance of an object, the error is computed as 0 if the features of that object match. More precisely, with the predicted depth map y and the ground truth t:

y * x = t where t is a scaler

If a scalar x satisfying this exists, it is considered correct. The idea is presumably that it is acceptable as long as the relative positions within the image are correct.

1-2. Predicting Surface Normals

A task of estimating the normal vector for each pixel. This accurately captures the orientation and twist of objects.

2. Classify Voxel Grid

A task of classifying what object a voxel grid represents. 3D CNNs are used to process voxels. Just as filters are applied in 2D, rectangular filters are applied through 3D space. As the number of channels increases, the voxels get smaller, and finally they are processed by FC layers.

2-2. Generating Voxel

A task of predicting a Voxel grid from an RGB image. Initially, 2D CNNs were used to flatten the image, followed by upsampling with 3D CNNs. Because the computational cost of 3D feature maps was enormous, Voxel Tube was developed as a method to replace upsampling with 2D CNNs. This method first predicts an H x W image and then creates depth for each pixel. This approach is faster but the relative positions are correct while the entire object may be shifted (it captures the shape of the object but may misjudge the distance from the sensor).

3. Classify PointCloud Inputs

Since the order of points does not matter in PointCloud, each point can be fed into an MLP and classified with an Affine layer. The MLP output is sampled via MaxPool and reduced to a vector of class dimensions.

3-1. Predicting PointCloud Outputs

To predict PointCloud, a loss function that compares two PointClouds (output and ground truth) is needed. Called the Chamfer loss function, the algorithm works as follows:

Y = predicted_point_cloud
T = correct_point_cloud

sum_A, sum_B = 0, 0
for each_point in Y:
    y = the_point_in_Y
    x = the_closet_neighbour_of_y_in_T

    sum_A += square_of(y - x)

# Do the same thing for each point in T
# Add it to sum_B

Loss = sum_A + sum_B

Adjust the network to minimize the Chamfer loss.

4. Predicting Meshes

A task of creating a mesh of an object from an RGB image. The approach repeatedly deforms a clay-like mesh slightly to bring it closer to the target. Graph convolution is a layer used to process meshes. For each point, weights are applied to all edges emanating from it to modify the edges (which in turn moves the points). Repeating this process causes the points to move. By performing this for each filter and increasing the number of points, the 3D model becomes progressively more detailed.

Since the output mesh needs to be compared with the target, a new loss has been prepared. Initially, several points were sampled from the mesh surface and the Chamfer loss was computed.

The problem was that the Chamfer loss is sensitive to outliers, so computing it directly sometimes resulted in excessively large losses. Therefore, a method that computes point-to-point comparison using F1 loss has been adopted.

Strong Models

Mesh R-CNN is reportedly recommended. Its structure uses Mask R-CNN for 2D segmentation, and then creates meshes for each object using the method described in the Predicting Meshes section.

In fact, mesh creation through graph convolution has the problem of being highly dependent on the initial state of the mesh, and to prevent this, the mesh is initialized using the method described in the Generating Voxel section.

[Bounding_box, Category_label, Instance_segment] = Mask_RCNN(single_RGB_image)

initial_mesh = Voxel_prediction(Bounding_box, Category_label, Instance_segment)

while not good:
  initial_mesh = Graph_Conv(initial_mesh)

# More precisely, the above algorithm is performed for each bounding box.

Summary: Four Key Considerations​

3D Data Structures​

Tasks​

1. Predicting Depth Maps​

1-2. Predicting Surface Normals​

2. Classify Voxel Grid​

2-2. Generating Voxel​

3. Classify PointCloud Inputs​

3-1. Predicting PointCloud Outputs​

4. Predicting Meshes​

Strong Models​