Paper Daily: Indoor Scene Understanding in 2.5/3D for Autonomous Agents: A Survey

With the availability of low-cost and compact 2.5/3D visual sensing devices, computer vision community is experiencing a growing interest in visual scene understanding of indoor environments. This survey paper provides a comprehensive background to this research topic. We begin with a historical perspective, followed by popular 3D data representations and a comparative analysis of available datasets.


Scene Understanding: “To analyze a scene by considering the geometric and semantic context of its contents and the intrinsic relationships between them.”

Visual scene understanding can be broadly divided into two categories based on the input media: static (for an image) and dynamic (for a video) understanding. This survey specifically attends to static scene understanding of 2.5/3D visual data for indoor scenes.

As much as being highly significant, 3D scene understanding is also remarkably challenging due to the complex interactions between objects, heavy occlusions, cluttered indoor environments, major appearance, viewpoint and scale changes across different scenes and the inherent ambiguity in the limited information provided by a static scene.

A Brief history of 3D Scene analysis

There exists a fundamental difference in the way a machine and a human would perceive the visual content. An image or a video is, in essence, a tensor with numeric values representing color (e.g., r, g, and b channels) or location (e.g., x, y, and z . coordinates) information. An obvious way of processing such information is to compute local features representing color and texture characteristics. To this end, a number of local feature descriptors have been designed over the years to faithfully encode visual information.

Representation is a key element of understanding the 3D world around us. In the early days of computer vision, researchers favored parts-based representations for object description and scene understanding.

While the initial systems developed for scene analysis bear notable ideas and insights, they lack generalizability to new scenes. This was mainly caused due to handcrafted rules and brittle logic-based pipelines.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.