A collaborative team of researchers from MIT, the MIT-IBM Watson AI Lab, IBM Research, and other institutions have made significant progress in analyzing unlabeled audio and visual data to enhance the performance of machine learning models. This new technique combines two self-supervised learning architectures, contrastive learning and masked data modeling, in order to advance machine learning tasks such as speech recognition and object detection without the need for annotation. By leveraging the power of self-supervised learning, the researchers aim to replicate how humans comprehend and perceive the world around us.
Yuan Gong, an MIT postdoc in the Computer Science and Artificial Intelligence Laboratory (CSAIL), explains, “A larger portion of human knowledge is learned in a self-supervised way because we don’t always receive supervision signals, and we want to enable the machine learning model to possess the same capability.” Jim Glass, an MIT senior research scientist and member of the MIT-IBM Watson AI Lab, adds, “Self-supervised learning often serves as the foundation of an initial model as it can learn from vast amounts of unlabeled data. Subsequently, classical supervised learning or reinforcement learning can be employed to fine-tune the model for specific purposes.”
The technique, known as the contrastive audio-visual masked autoencoder (CAV-MAE), employs a neural network capable of learning meaningful latent representations from acoustic and visual data, thereby mapping them into high-dimensional space. The network is trained using large YouTube datasets containing 10-second clips of audio and video. The researchers emphasize that their approach outperforms previous methods by explicitly modeling the relationships between audio and visual data.
The research team includes Yuan Gong, Jim Glass, graduate students Andrew Rouditchenko and Alexander H. Liu from MIT, David Harwath from the University of Texas at Austin, and MIT-IBM Watson AI Lab members Leonid Karlinsky and Hilde Kuehne, who is also affiliated with Goethe University Frankfurt. They recently presented their method at the International Conference on Learning Representations.
A joint and coordinated approach
The collaborative approach taken by the researchers involved combining two distinct methods in the CAV-MAE model: masked data modeling and contrastive learning. Masked data modeling involves taking a video and its corresponding audio waveform, converting the audio to a spectrogram, and masking 75% of both data types. The model is then trained to reconstruct the missing information based on the masked inputs, using the difference between the reconstructed prediction and the original audio-visual combination as a training signal.
On the other hand, contrastive learning aims to map similar representations close to each other. For example, the model would learn to associate different video and audio data of parrots and distinguish them from pairs of video and audio of guitars playing. This method involves passing audio-visual pairs through separate modality encoders and then keeping the audio and visual components separate within the joint encoder. The model performs pooling and contrastive loss to identify the relevant parts of each modality.
The combination of these two techniques in the CAV-MAE model leads to improved performance. By leveraging multiple forward data streams with masking, modality-specific encoders, and layer normalization, the researchers found that CAV-MAE outperformed other state-of-the-art methods on audio-visual retrieval and audio-visual event classification tasks. These tasks involved identifying sounds or actions within data and searching for missing audio or visual components in a query pair.
The researchers tested CAV-MAE against other methods using standard datasets and found that it achieved better performance in event classification and kept pace with or outperformed models with greater computational resources. Additionally, the incorporation of multi-modal data during pre-training improved the fine-tuning of single-modality representations and performance on audio-only event classification tasks. This demonstrates the value of multi-modal information in enhancing the model’s understanding and performance, similar to how humans benefit from multiple sensory inputs.
The CAV-MAE model stands out for its ability to perform both classification and retrieval tasks, which is not commonly seen in other models. This approach has been well-received for its elegant combination of contrastive and reconstruction loss. Many audio-visual learning frameworks now utilize a combination of contrastive loss and masked autoencoder, thanks to the insights provided by this work. Overall, the CAV-MAE model exhibits strong performance across various tasks and showcases the potential of self-supervised learning techniques in advancing machine learning capabilities.
Bringing self-supervised audio-visual learning into our world
The researchers view their development of the contrastive audio-visual masked autoencoder (CAV-MAE) as a significant milestone and a step forward for applications that rely on audio-visual fusion in a multi-modal context. They envision its potential use in various domains such as sports, education, entertainment, motor vehicles, and public safety, particularly in action recognition tasks. While currently limited to audio-visual data, the researchers recognize the broader trend of machine learning moving towards multi-modal learning. Humans perceive the world through multiple modalities, including touch and smell, and the goal is to mimic this multi-modal perception in AI systems. Therefore, the researchers believe that this method could be extended to explore and incorporate other unexplored modalities in the future.
As machine learning models become increasingly pervasive in our daily lives, techniques like CAV-MAE hold great value in advancing the capabilities of these models and enabling them to better understand and interpret multi-modal information.