The radical network redesign that led AWS to forge a more resilient cloud

How a Slack shout-out, a dusted-off academic theory, and a spaghetti monster led an AWS team to crack an elusive code—and deliver greater reliability and performance for customers.

One afternoon in 2023, Seshadhri Comandur, an Amazon Scholar and professor at the University of California, Santa Cruz, casually answered a message on a company Slack thread that would pull him into a quest to solve one of the most stubborn puzzles in the data center industry.

The message came from Ratul Mahajan, a fellow Amazon Scholar, data center networking expert, and professor at the University of Washington: “Looking for someone with expertise in graph theory and routing.”

Comandur, a mathematician who works on algorithms and networks in the abstract, and who by his own admission knew “nothing” about data centers, replied: “Yeah, I know something about that.”

In that moment, Mahajan found the specialist math whiz he'd been looking for, and Comandur found the thing researchers live for: a chance to put theory into practice. Together with a third AWS principal applied scientist, Giacomo Bernardi, they would lead AWS to become the first company to build a flat data center network at scale, using an approach inspired by random graph theory, an idea that had been gathering dust in academia for decades.

Their achievement, outlined in a recent technical report ‘RNG: Flat Datacenter Networks at Scale’, is a breakthrough that will deliver greater reliability and performance for AWS customers, save billions of dollars in hardware, and lower CO2 emissions across a growing number of grids where the company operates.

But what is random graph theory? And how did Bernardi, Comandur, Mahajan and their team crack a code that had stumped the industry for years?

A dusted-off theory

The story starts with Bernardi, or rather, his obsession with routers—the specialized devices that conduct traffic in a data center. Traditionally, routers are connected in tree-like hierarchies. While this arrangement works effectively, it can also present choke points where data gets congested.

Bernardi was convinced there must be a better way to organize things. He thought that connecting routers in a flat, but still deeply ordered structure, could distribute load and remove single points of failure. He’d sketched a design inspired by Penrose tiling, a geometric configuration that uses a few simple shapes to cover a surface in a pattern that never truly repeats.

Working with Mahajan, he’d been trying to put the Penrose design into practice, but they were stuck. No practical application they could come up with was compatible with the scale at which AWS operates.

Refusing to give up, they turned instead to another concept that had been much discussed in academia, but conversely, was generally regarded as even more of an impossibility. What if, instead of stacking routers in layers, they connected them in a flat arrangement guided by the mathematics of randomness?

Scientific argument for building a network in this way had been building up for decades. It suggested that a flat network, where routers link directly to one another rather than through layers of hierarchy, could be faster and more resilient. But nobody had demonstrated how it could be applied within the physical constraints of an actual data center.

“It was typical for academia,” said Bernardi. “Everybody's excited, but then the real world hits.” Putting random graph theory into practice presented three seemingly insurmountable problems: 1) how to physically connect millions of randomly assigned fiber optic cables without creating an unmanageable tangle; 2) how to route data through a network with no fixed structure to guide it; 3) how to mathematically prove whether the whole thing would actually function before committing the time and money to build it.

And so the idea of applying random graph theory to a data center network remained just an idea, until Bernardi, Mahajan, Comandur, and a group of Amazon networking experts, optical engineers, and data center designers began their determined journey to bring it to life.

A dusted-off theory

The story starts with Bernardi, or rather, his obsession with routers—the specialized devices that conduct traffic in a data center. Traditionally, routers are connected in tree-like hierarchies. While this arrangement works effectively, it can also present choke points where data gets congested.

Bernardi was convinced there must be a better way to organize things. He thought that connecting routers in a flat, but still deeply ordered structure, could distribute load and remove single points of failure. He’d sketched a design inspired by Penrose tiling, a geometric configuration that uses a few simple shapes to cover a surface in a pattern that never truly repeats.

Working with Mahajan, he’d been trying to put the Penrose design into practice, but they were stuck. No practical application they could come up with was compatible with the scale at which AWS operates.

Refusing to give up, they turned instead to another concept that had been much discussed in academia, but conversely, was generally regarded as even more of an impossibility. What if, instead of stacking routers in layers, they connected them in a flat arrangement guided by the mathematics of randomness?

Scientific argument for building a network in this way had been building up for decades. It suggested that a flat network, where routers link directly to one another rather than through layers of hierarchy, could be faster and more resilient. But nobody had demonstrated how it could be applied within the physical constraints of an actual data center.

“It was typical for academia,” said Bernardi. “Everybody's excited, but then the real world hits.” Putting random graph theory into practice presented three seemingly insurmountable problems: 1) how to physically connect millions of randomly assigned fiber optic cables without creating an unmanageable tangle; 2) how to route data through a network with no fixed structure to guide it; 3) how to mathematically prove whether the whole thing would actually function before committing the time and money to build it.

And so the idea of applying random graph theory to a data center network remained just an idea, until Bernardi, Mahajan, Comandur, and a group of Amazon networking experts, optical engineers, and data center designers began their determined journey to bring it to life.

Tree-based network

Traditional data center networks are built in tree-like hierarchies, where data passes through routers in a prescribed order. It’s a logical approach, but one that can lead to bottlenecks.

Random graph theory

The mathematical study of what happens when you build a network by making connections randomly.

Random graph network

Arranging connections at random can make data transfer more efficient by removing bottlenecks and more resilient by eliminating single points of failure.

From theory to practice

Understanding academic math is one thing. Putting it into practice in an AWS data center is a whole different challenge.

Tree-based network

Traditional data center networks are built in tree-like hierarchies, where data passes through routers in a prescribed order. It’s a logical approach, but one that can lead to bottlenecks.

Random graph theory

The mathematical study of what happens when you build a network by making connections randomly.

Random graph network

Arranging connections at random can make data transfer more efficient by removing bottlenecks and more resilient by eliminating single points of failure.

Rather than trying to manually wrangle millions of individual fiber optic connections, the AWS team pursued a design that would produce the random connections automatically.

Controlling the chaos

The first problem was glaring: how to avoid a gigantic hairball of wires. Modern data centers contain millions of individual fiber optic connections linking servers and routers. A single campus can contain hundreds of miles of cabling. You could attempt to build a random graph network manually, taking all that fiber and connecting each router to another specific router at random. But as Bernardi put it: “It's a terrible idea.”

That’s because randomness, to be useful at data center scale, has to be the same kind of random, every time, everywhere. If done by hand, it would produce a different network every time, and a network you can't replicate is a network you can't reliably build, test, or maintain.

So the team set about designing a piece of hardware that would enable random connectivity without being actually random. They called it the ShuffleBox, a sealed enclosure with no power supply, which deterministically shuffles the connections internally. This shuffling, when paired with quasi-random connectivity between ShuffleBoxes, produced the random graphs they wanted.

The design had to be simple enough to produce at scale and straightforward enough for technicians to install consistently. They knew what they needed to do, but of course, the ‘how’ proved sticky. “We were attempting to design it for months, but could never quite get there,” Bernardi said.

That was until Comandur gave Bernardi a mysterious equation and asked him to run a massive simulation to find the eight numbers that satisfied it. The resulting digits Bernardi provided a few days later turned out to the be exact formula for arranging the fiber optic wiring inside each ShuffleBox. Or, in other words, the key to making random connectivity standardized and deployable worldwide.

Routing through randomness

The second problem relates to rules, and specifically, how to rewrite them. In a traditional data center network, routers are arranged in strict hierarchical tiers, like a corporate org chart. For data to travel from one server to another, it must pass through prescribed layers in a specific order, which under heavy loads can create bottlenecks. Think of it as trying to contact a senior manager in a bureaucracy, where instead of going to that person directly, policy dictates you first must go through your manager, then your manager’s manager, and so on.

Applying random graph theory would mean connecting those same routers without any fixed structure, giving data far more available paths at any given moment, allowing traffic to be distributed more naturally and more quickly across the whole system.

It sounds easy, but data can only travel from point to point however the routers tell it to, and routers create those rules based on routing protocols. In a traditional hierarchical network, finding the right route is relatively straightforward, as the structure itself provides a map. But in a random graph network, the best path becomes far harder to determine.

The breakthrough came with a routing protocol specifically for random graphs. The team called it Spraypoint, because, as Bernardi said: “The source router, where data starts, sprays the data to all its neighboring routers. Then there's a second phase called pointing, where waypoint routers direct data to its final destination.”

Spraypoint defies conventional networking logic. Rather than using only shortest paths between two points, it spreads data across hundreds of paths simultaneously.

“Shortest paths aren't always the best option,” said Comandur. “Sometimes you need to take a slightly longer route, but then you have many different options available, which dramatically reduces the risk of congestion.”

VP of Network Engineering Matt Rehder (left) led his team to implement the new approach as smoothly and easily as possible, with existing routers, fiber cables, optical modules, and transceivers.

Producing the proof

The final problem was perhaps the most consequential: How do you prove a random network will function before you commit to building it?

Before Comandur got involved in the project, Bernardi and Mahajan had already been investigating if random graphs could work at scale. They leaned heavily on cloud services like Amazon Elastic Compute Cloud (EC2), which allows users to scale up and down massive amounts of compute in an instant, to build giant software simulators to stress-test their ideas. In total, Bernardi estimates they used around 530 compute processing years (equivalent to running a single processor for half a millennium) across hundreds of thousands of failure scenarios.

The results were consistently encouraging. But they fell short of proof a random graph network would function at the magnitude required by AWS. They needed someone to literally discover new mathematical formulas that would provide the theoretical foundation for what the simulations were already showing.

“We started with experiments, observed results, and then asked: ‘But why does this work?’” Bernardi said. “It’s really the reverse of what scientists are supposed to do.”

Coming from different parts of Amazon's business—and different countries—Ratul Mahajan, Giacomo Bernardi, and Seshadhri Comandur led an effort to rethink how information moves through data centers.

With Comandur's help, they could finally move from observation to proof. Their simulations might have shown random graphs could work, but they didn’t predict how well, or indeed how far, they would hold up under real load. How many thousands of routers could the design scale to before it broke down? It required mathematical modeling that could predict behavior across any scenario, at any scale, before a single cable was connected.

Comandur provided not only confirmation that the idea was sound, but the mathematical language to describe exactly why, and a model precise enough to give the team (and eventually the rest of AWS) the confidence to commit to building it for real.

That confidence needed to be extraordinary, because when it comes to live customer data, there’s no room for experimentation. The network must work. The ultimate proof of concept would be the first production data center built using the new design, in Ireland.

To prove that the routing would work, the team built the random graph the hard way, by hand, without ShuffleBoxes. Working for several weeks, they wired up individual fiber cables in exactly the kind of crisscrossing jungle the ShuffleBox had been invented to avoid. “We still look at the pictures and feel the horror,” said Bernardi. But aesthetics aside, there was no doubt about the underlying design. The real-life test worked, and most importantly, in exactly the way the models had predicted.

The efficiency

By enabling more dynamic data connections between servers, the random graph theory-based architecture lets data move faster and more efficiently. The result is reduced hardware and power needs, and fewer CO2 emissions.

Up to 1/3 faster

In testing across most real-world traffic conditions, the new design moved data up to a third faster than the hierarchical structures it replaces.

40% reduction

AWS expects to reduce electricity consumption for network equipment using the new design by 40% compared to its previous architecture, lowering CO2 emissions across a growing number of grids where it operates.

Fewer points of failure

The ultimate goal of the AWS network is to be invisible. “You turn it on, it functions.” said Matt Rehder, the company’s VP of network engineering. “It's not something we want our customers to think about at all.” A flat network, it turns out, is an even more powerful way to ensure that.

Earlier on in the process, the team had realized that although what they were proposing was a radical rethinking of the data center network, it would be far too disruptive, and fundamentally risky, to suggest designing a whole swathe of new devices and components. They had to make the implementation as smooth and easy as possible, and they needed to work with the existing routers, fiber cables, optical modules and transceivers. The only new elements would be the Spraypoint protocol and the ShuffleBox.

Testing in real-world conditions shows that the new design moves data roughly a third faster, and more reliably, than the conventional hierarchical structure.

The new architecture delivers meaningfully fewer network devices between any two communicating servers, and fewer devices means fewer potential points of failure. It also means billions of dollars in cost savings and a network that can route around problems more dynamically, using more of its available capacity at any given moment.

In testing across most real-world traffic conditions, the new design moved data roughly a third faster than the hierarchical structures it replaces. The result is a network that is not just more efficient, but more reliable, and more powerful, for the customers who depend on it.

And the efficiency gains extend further. Significantly fewer networking devices means less power consumption, power that can instead be directed toward more compute capacity for customers. AWS expects to reduce electricity consumption for network equipment using the new design by 40% compared to its previous architecture, lowering CO2 emissions across a growing number of grids where it operates.

The company began rolling out the new network design in Spain and Germany in 2025, and will implement it across the majority of its data centers globally in 2026.

For Comandur, it's a story he looks forward to sharing with his students. Proof that the gap between abstract academia and real life can be bridged. That a puzzle can sit unsolved for a decade, not because the solution doesn't exist, but because the people who hold the different pieces of it haven't found each other yet.

Learn more about other Amazon data center innovations such as liquid cooling technology.

Illustrations by Chris Gash for Amazon News; Photos by Josh Edelson and Noah Berger