According to recent research published in the IEEE Transactions on Pattern Analysis and Machine Intelligence, Michael Ying Yang, a researcher from the faculty of ITC at UT, has developed a new method for generating realistic and coherent images by graphing scenes from images. While generative AI programs excel at generating images of single objects, creating complete scenes has remained challenging. Yang’s approach provides a blueprint that can aid in generating more accurate and comprehensive images.
Humans possess a remarkable ability to perceive and define relationships between objects. For example, we can easily recognize that a chair is placed on the floor, or that a dog is walking on the street. However, AI models struggle with this task. Enhancing a computer’s capacity to detect and comprehend visual relationships is crucial not only for image generation but also for improving the perception capabilities of autonomous vehicles and robots.
Michael Ying Yang serves as an assistant professor in the Scene Understanding Group of the Faculty of Geo-Information Science and Earth Observation (ITC).
From two-stage to single-stage
Presently, there are existing methods for graphing the semantic understanding of an image, but they suffer from slow processing. These methods typically follow a two-stage approach. In the first stage, all objects within a scene are mapped. Then, in the second stage, a specific neural network examines all possible connections and assigns them the appropriate relationship label.
The drawback of this two-stage method is that the number of connections to be processed grows exponentially as the number of objects increases. However, Yang’s model introduces a novel approach that streamlines the process. Instead of going through multiple stages, their model accomplishes the task in a single step. It automatically predicts the subjects, objects, and their relationships simultaneously.
According to Yang, this advancement significantly improves efficiency and eliminates the need for exhaustive connection evaluations.
The one-stage method utilized by the model involves analyzing the visual features of objects within the scene and focusing on the key details that determine their relationships. The model identifies significant areas where objects interact or have connections with one another. By employing these techniques and leveraging a relatively small amount of training data, it becomes possible to recognize the most crucial relationships between different objects.
After identifying these relationships, the model’s final task is to generate a description that accurately reflects how the objects are connected. For instance, Yang explains that the model can detect in a given example image that the man is highly likely to be interacting with the baseball bat. Subsequently, the model is trained to generate the most probable relationship description, such as “man swings baseball bat.”
Source: University of Twente