In the following, we highlight the popular 2.5D and 3D data representations used to represent and analyze scenes.
A ‘point cloud’ is a collection of data points in 3D space. The combination of these points can be used to describe the geometry of the individual object or the complete scene. Range scanners (typically based on laser, e.g., LiDAR) are also used to capture 3D point clouds of objects or scenes.
A voxel (volumetric element) is the 3D counterpart of a pixel (picture element) in a 2D image. Voxelization is a process of converting a continuous geometric object into a set of discrete voxels that best approximate the object. A voxel can be considered as a cubic volume representing a unit sample on a uniformly spaced 3D grid. Usually, a voxel value is mapped to either 0 or 1, where 0 indicates an empty voxel while 1 indicates the presence of range points inside the voxel.
The mesh representation encodes a 3D object geometry in terms of a combination of edges, vertices, and faces. A mesh that represents the surface of a 3D object using polygon (e.g., triangles or quadrilaterals) shaped faces is termed as the ‘polyon mesh.’ A mesh might contain arbitrary polygons but a ‘regular mesh’ is composed of only a single type of polygons. A commonly used mesh is a triangular mesh that is composed entirely of triangle shaped faces. In contrast to polygonal meshes, ‘volumetric meshes’ represent both the interior volume along with the object surface.
Depth Channel and Encodings
A depth channel in a 2.5D representation shows the estimated distance of each pixel from the viewer. This raw data has been used to obtain more meaningful encodings such as HHA. Specifically, this geocentric embedding encodes depth image using height above the ground, horizontal disparity and angle with gravity for each pixel.
An octree is a voxelized representation of a 3D shape that provides high compactness. The underlying data structure is a tree where each node has eight children. The idea is to divide 3D occupancy of an object recursively into smaller regions such that empty and similar voxels are represented with bigger voxels. An octree of an object is obtained by a hierarchical process has follows: start by considering 3D object occupancy as a single block, divide it into eight octants. Then octants that partially contain an object part are further divided. This process continues until a minimum allowed size is reached. The octants can be labeled based on the object occupancy.
The idea of stixels is to reduce the gap between pixel and object level information, thus reducing the number of pixels in a scene to few hundreds. In stixel representation, a 3D scene is represented by vertically oriented rectangles with a certain height. Such a representation is specially useful for traffic scenes, but limited in its capability to encode generic 3D scenes.
Truncated Signed Distance Function
Truncated signed distance function (TSDF) is another volumetric representation of a 3D scene. Instead of mapping a voxel to 0 or 1, each voxel in the 3D grid is mapped to the signed distance to the nearest surface. The signed distance is negative if the voxel lies with in the shape and positive otherwise. RGB-D camera (e.g., Kinect) representations are based on TSDF further fuse them to obtain a complete 3D model.
Constructive Solid Geometry
Constructive solid geometry (CSG) is a building block technique in which simple objects such as cubes, spheres, cones and cylinders are combined with a set of operations such as union, intersection, addition, and subtraction to model complex objects. CSG is represented as a binary tree with primitive shapes and the combination operation as its nodes. This representation is often used for CAD models and computer vision and graphics.