Researchers have developed a new methodology called “Patch-to-Cluster attention” (PaCa) that addresses the challenges faced by vision transformers (ViTs) in object identification and classification. ViTs, which use transformer architecture, are powerful AI models trained on visual inputs to detect and categorize objects in images. However, ViTs require significant computational power and lack transparency in decision-making.
To overcome these challenges, the PaCa methodology utilizes clustering techniques to improve object identification and reduce computational demands. Clustering involves grouping sections of an image based on similarities in the data, which allows the transformer architecture to focus on objects more effectively. By clustering the data, computational demands become linear instead of quadratic, resulting in a significant reduction in complexity. Model interpretability is also enhanced with clustering, as it provides insight into the AI’s decision-making process by examining the features considered important during the clustering process.
Comprehensive testing of PaCa was conducted, comparing it to state-of-the-art ViTs like SWin and PVT. The results showed that PaCa outperformed these models in object classification, identification, and segmentation tasks. Additionally, PaCa demonstrated greater efficiency, enabling faster performance compared to other ViTs.
The researchers’ next objective is to scale up PaCa by training it on larger foundational datasets. The paper detailing the PaCa methodology, titled “PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers,” will be presented at the IEEE/CVF Conference on Computer Vision and Pattern Recognition in Vancouver, Canada. The paper is available on the arXiv preprint server.
Source: North Carolina State University