Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a system called the Masked Generative Encoder (MAGE) that combines image recognition and generation capabilities. MAGE is trained to understand the content of images and fill in missing parts, resulting in accurate image identification and the creation of realistic new images.
Unlike traditional methods that work with raw pixels, MAGE converts images into “semantic tokens,” representing compact and abstracted versions of image sections. These tokens act like mini jigsaw puzzle pieces, forming an abstracted image that can be used for complex processing tasks while retaining the information from the original image. MAGE employs a technique called “masked token modeling” where it randomly hides tokens and trains a neural network to fill in the gaps. This allows MAGE to learn both image recognition patterns and image generation.
What sets MAGE apart is its flexibility in pre-training. It can train for image generation or recognition tasks, depending on the masking strategy employed during pre-training. MAGE operates in the “token space” rather than the “pixel space,” resulting in high-quality image generation and semantically rich image representations.
MAGE offers various applications, such as object identification, swift learning from minimal examples, conditional image generation based on user-defined criteria, image editing, and enhancing existing images. It excels in few-shot learning and achieved impressive results on large datasets like ImageNet with only a small number of labeled examples.
The validation of MAGE’s performance has been remarkable, breaking records in image generation and achieving high accuracy in image recognition tasks. However, the researchers acknowledge that MAGE is still a work in progress. They aim to explore methods to compress images without losing important details and test MAGE on larger datasets.
According to Huisheng Wang from Google, MAGE is a groundbreaking system that combines image generation and recognition tasks in a single framework, inspiring future research in computer vision.
The research findings have been published on the arXiv preprint server.