Paper Daily: Indoor Scene Understanding in 2.5/3D for Autonomous Agents: A Survey

Core techniques

Convolutional Neural Networks

Recurrent Neural Networks

Encoder-Decoder Architectures

A closely related variant of an autoencoder is a variational autoencoder (VAE) that introduces constraints on the latent representation learned by the encoder.

Encoder-decoder style networks have been used in combination with both convolutional and recurrent designs. The applications of such designs for scene understanding tasks. The strength of these approaches is to learn a highly compact latent representation from the data, which is useful for dimensionality reduction and can be directly employed as discriminative features or transformed using a decoder to generate desired outputs. In some cases, the encoding step leads to irreversible loss of information which makes it challenging to reach the desired output.

Markov Random Field

Markov Random Field (MRF) is a class of undirected probabilistic models that are defined over arbitrary graphs. The graph structure is composed of nodes (e.g., individual pixels or super-pixels in an image) interconnected by a set of edges (connections between pixels in an image). Each node represents a random variable which satisfies Markovian property, i.e., conditional independence from all variables if the neighboring variables are known. The learning process from a MRF involves estimating a generative model i.e., the joint probability distribution over input (data; [katex]X[/katex]) and output (prediction; [katex]Y[/katex]) variables, i.e., [katex]P(X, Y)[/katex]. For several problems, such as classification and regression, it is more convenient to directly model the conditional distribution [katex]P(Y|X)[/katex] using the training data. The resulting discriminative Conditional Random Field (CRF) models often provide more accurate predictions.

Both MRF and CRF models are ideally suited for structured prediction tasks where the predicted outputs have inter-dependent patterns, instead of, e.g., a single category label in the case of classification. Scene understanding tasks often involve structured prediction. These models allow the incorporation of context while making local predictions. The context can be encoded in the model by pair-wise potentials and clique potentials (defined over groups of random variables). This results in more informed and coherent predictions which respect the mutual relationships between labels in the output prediction space. However, training and inference in several of such model instantiations are not tractable, which makes their application challenging.

Sparse Coding

Sparse coding is an unsupervised method used to find a set of basis vectors such that an input vector ‘x’ can be represented by their linear sparse combination. The set of basis vector is called as a ‘dictionary’ (D), which is typically learned over the training data. Given the dictionary, a sparse vector [katex]\alpha[/katex] is calculated such that the input [katex]x[/katex] can be accurately reconstructed back using [katex]D[/katex] and [katex]\alpha[/katex]. Sparse coding can be seen as decomposing a non-linear input into sparse combination of linear vectors. If the basis vectors are large in number or when the dimension of feature vectors is high, optimization process required to calculate [katex]D[/katex] and [katex]\alpha[/katex] can be computationally expensive.

Decision Forests

A decision tree is a supervised algorithm that classifies data based on a graph based hierarchy of rules learned over the training set. Each internal node in a decision tree represents a test or attribute (true or false question) while each leaf node represents the decision on a class label. To build a decision tree, we start with a root node that receives all the training data and based on the test question we split the data into subsets. One can quantify the mixing or uncertainty at a single node by a metric called ‘Gini impurity’ which can be minimized by devising rules based on information gain. Decision trees can quickly overfit the training data which can be rectified by using random forests. Random forest builds an ensemble of decision trees using a random selection of data and produces class labels based on many decisions trees.

Support Vector Machines

When it comes to classifying the n-dimensional data points, the goal is not only to separate the data into a certain number of categories but to find such a dividing boundary that offers maximum possible separation between classes. Support vector machine (SVM) offers such a solution. SVM is a supervised method that separates the data with linear hyperplane, also called maximum-margin hyperplane, that offers maximum separation between any combination of two classes. SVM can also be used to learn the nonlinear classification boundaries with the kernel trick. The idea of kernel trick is to project the nonlinearly separable low dimensional data into high dimensional space where the data is linearly separable. After applying SVM into high dimensional linearly separable space, project the solution back to low-dimensional, nonlinearly separable space to get a nonlinear hyperplane boundary.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.