Scientists at Pacific Northwest National Laboratory (PNNL) have drawn an analogy between increasing traffic congestion in the Seattle area and the growing congestion on high-performance computing (HPC) systems. In a paper published in The Next Wave, the National Security Agency’s review of emerging technologies, the scientists argue that more complex workloads, particularly AI model training, are causing bottlenecks in HPC systems.
According to Sinan Aksoy, a senior data scientist and team leader at PNNL specializing in graph theory and complex networks, the congestion in HPC systems can be addressed by rethinking the network infrastructure. In HPC systems, numerous individual computer servers, called nodes, function as a single supercomputer, with the network topology determining the arrangement of nodes and the links between them.
HPC congestion arises when data exchange between nodes becomes concentrated on a single link, resulting in a bottleneck. The researchers, including Roberto Gioiosa, a computer scientist in the HPC group at PNNL, and Stephen Young, a mathematician in the math group at PNNL, highlight that bottlenecks are more prevalent in modern HPC systems compared to when they were originally designed. This is attributed to the evolving usage patterns of HPC systems over time.
Gioiosa notes that the shift in people’s behavior, with the advent of technologies like Facebook, big data, and large AI models, has contributed to the increased demand and congestion on HPC systems.
Big tech expands
During the 1990s, the computer technology industry experienced significant growth, leading to the disruption of the Seattle area’s economy and altering the dynamics of where people live and work. As a result, traffic patterns in the region became less structured, more congested, and less predictable, particularly along the east-west axis, which is constrained by two bridges across Lake Washington.
Drawing a parallel, the researchers at PNNL suggest that traditional HPC network topologies resemble the road network of the Seattle area. These topologies were originally optimized for physics simulations, such as modeling molecular interactions or regional climate systems, rather than for modern AI workloads.
In physics simulations, calculations on one server influence the calculations on neighboring servers. Consequently, network topologies are designed to facilitate efficient data exchange among adjacent servers. For instance, in a simulation of a regional climate system, one server might focus on modeling the climate over Seattle while another server focuses on the climate over the waters of the Puget Sound to the west of Seattle.
Stephen Young, a mathematician in PNNL’s computational math group, explains that the Puget Sound climate model primarily needs to communicate with the Seattle model, rather than with distant locations like New York City. Therefore, it makes sense to connect the Puget Sound computer and the Seattle computer in close proximity to optimize their communication.
However, the communication patterns in data analytics and AI applications are irregular and unpredictable. Calculations on one server may need to interact with computations on a distant computer within the same facility. Running these workloads on traditional HPC networks is analogous to navigating rush hour traffic in the greater Seattle region while participating in a scavenger hunt, according to Roberto Gioiosa from PNNL.
Network expansion
The research team at PNNL has proposed a solution to alleviate HPC bottlenecks using graph theory, a mathematical discipline that studies connections and relationships among a cluster of points in a space.
Stephen Young and Sinan Aksoy specialize in expanders, which are a specific type of graph that can effectively distribute network traffic, ensuring numerous options for reaching any point in the network. According to Aksoy, their network, known as SpectralFly, exhibits perfect mathematical symmetry, meaning that each node is connected to an equal number of other nodes, and the connections from each node maintain the same pattern throughout the entire network.
This symmetrical structure provides multiple identical routes for information to traverse between any two nodes, making it easier for computer programmers to route data through the network. Aksoy compares this feature to navigating a city where the directions from any neighborhood to all other neighborhoods remain the same, regardless of the starting point.
He further explains that this characteristic significantly reduces the computational complexity involved in determining how to route information across the network. With the consistent roadmap provided by SpectralFly, the routing process becomes less computationally expensive for computer systems.
Simulation results
To assess the performance of their SpectralFly network, the PNNL research team conducted simulations using various workloads, ranging from traditional physics-based simulations to AI model training. They compared the results of SpectralFly with those obtained from other HPC network topologies.
The findings revealed that SpectralFly outperformed other network topologies when dealing with modern AI workloads. Additionally, it achieved comparable performance on traditional workloads. This suggests that SpectralFly could serve as a hybrid topology, allowing users to conduct both traditional scientific simulations and AI tasks on the same HPC system.
The goal of the research team is to bridge the gap between the traditional and emerging worlds of computing, enabling scientists to leverage the capabilities of both traditional scientific simulations and AI-driven big data analysis. Roberto Gioiosa emphasized the importance of merging these two domains, allowing researchers to harness the power of HPC for scientific endeavors as well as advancements in AI and big data applications.