Overcoming the AI Data Center to Data Center Bottlenecks to Release Innovation

Gene Walker
Jan 24
7 min read

Hyperscalers depend on highspeed communications to process the volumes of data required for Large Language Models and GPU-to-GPU processing. The primary bottlenecks in AI data center to data center communications typically lie within network bandwidth limitations, particularly when transferring these large volumes of data, which can quickly overwhelm network capacity, leading to delays and reduced training speed; this is often compounded by the legacy network architecture, circuit speeds, and the need for low latency connections to facilitate real-time processing and distributed computing across multiple data centers. 

Hyperscalers use a variety of tools and approaches to optimize their operations at various layers of the Open Systems Interconnection (OSI) Model. For this discussion, we will focus on layers 1-4 (Physical, Data Link, Network, and Transport). The AI Data Centers are engineered to take advantage of GPUs, CPUs, Memory, Server Clusters, Load Balancers, Power, Software, Architectures, and many other tools that can be leveraged within the data center, but the one thing that these hyperscalers are limited by is the bandwidth bottlenecks between the geographically dispersed data centers.

In order to support the enormous data and training models, AI Data Centers need to leverage resources spread out not only geographically, but often require support for different types of sources. Key areas where bottlenecks can occur in AI data center communications:

Network infrastructure:
- Traditional network architecture and medium is insufficient to support the growing demand for network capacity between data centers, especially when dealing with massive datasets for AI training. 
- Outdated network protocols are not optimized for large data transfers, which requires new strategies and protocols. 
- Congestion on shared network paths due to high traffic from multiple AI workloads. This is driving strategies at all 4 layers of the OSI model. 
Data processing:
- Inefficient data compression techniques, leading to larger data sizes for transfer. This is forcing the entire strategy for supporting GPU-to-GPU processing to change, thus a discussion around the architecture and Network Management Services required to handle the high volume. 
- Excessive data pre-processing or transformation before transfer, adding latency.  Current protocols add an excessive amount of management overhead to the large volumes of data being transferred, forcing a new paradigm for payloads.
Hardware limitations:
- Network interface cards (NICs) with limited throughput.  The GPUs and servers are pushing out data at a rate the NICs can’t keep up with, not to mention the legacy strategy of how limited ports and services are used.
- Insufficient storage capacity in data centers, causing delays in data retrieval. Each cluster is capable of processing much more data, but feeding the system is limited based on where the data is located and how fast it can be accessed.
Latency issues:
- Geographical distance between data centers causing significant network latency.  This is a factor of both architecture and fiber (Gigabit Interface Converters are limited, but increasing in speed to handle the demand, and fiber is being improved to further reduce latency and loss).
- Network hops between data centers adding latency to data transfers.  Every interface and conversion adds overhead. The hyperscalers are working with transport companies to secure point-to-point fiber circuits, but that is not sufficient for AI to efficiently grow and morph to meet data processing needs. To do that, AI will need highspeed fiber interconnections that are themselves managed by AI, not to mention new fiber use techniques such as “Multimodal Fiber” and “Hollow Fiber”.

How to mitigate these bottlenecks:

Upgrade network infrastructure:

Hyperscalers like Meta, Amazon, Google, and Microsoft are investing in high-bandwidth network connections with dedicated paths for AI data transfers.  Recent announcements from these companies indicate a high level of collaboration with network hardware and software vendors as well as the transport providers.

Optimize data compression:

Utilize advanced data compression techniques to reduce data size during transfer such as Huffman coding, Run-length encoding, Deflate (combination of LZSS and Huffman coding), dimensionality reduction techniques like Principal Component Analysis (PCA), quantization, pruning, knowledge distillation, and low-rank factorization; all of which can significantly shrink data size while preserving essential information for AI model processing. 

Implement distributed processing:

Distribute AI workloads across multiple data centers to alleviate network congestion.  The hyperscalers like Meta utilize a dedicated backend network specifically designed for distributed training, which includes features like topology-aware job scheduling to minimize cross-zone traffic, leveraging technologies like RDMA over Converged Ethernet (RoCE) for high-bandwidth data transfer, and carefully designing their network fabric to optimize data flow between different AI clusters located in various data centers, effectively alleviating network congestion by distributing workloads across geographically dispersed locations

Utilize specialized hardware:

Employ high-performance network adapters and storage systems designed for large data transfers.  Multiplexers/demultiplexers are continuing to improve the ability to combine and separate light signals with different wavelengths within a single fiber to improve throughput. Think of it as light that passes through a prism where it is broken out into various colors/frequencies, with each frequency range/channel being able to transmit data independent of the other frequencies.

Consider network topology:

Design network architecture that minimizes latency and maximizes data throughput between data centers.  Instead of having a single point-to-point fiber circuit (limited in its ability to be scalable or extensible based on data processing requirements), most hyperscalers are employing a topology that is manageable (AI Network Managed), with the ability to turn on “Dark Fiber” as the demand increases.

So, what is being done to address the transport bottlenecks?

Let’s dive a little deeper into the fiber medium. Light travels through a fiber optic cable at approximately two-thirds the speed of light in a vacuum, meaning it travels significantly slower than the speed of light in space, due to the refractive index of the glass material in the cable; this translates to around 206,856,796 meters per second, which is about 31% slower than the speed of light in a vacuum. 

Breaking it down, the key points about light in fiber optic cable:

Slower than in vacuum: Light travels slower through glass (fiber optic cable) than through empty space. 
Refractive index: The speed reduction is due to the refractive index of the glass material. 
Typical speed: Around 206,856,796 (2/3 the speed in a vacuum) meters per second or 128,534 miles per second.

For comparison:

Dial-up modems in the 60’s started at 110 baud (Bell 101), and for those who first started out on 2400 bps dial-up modems, the delay although it did provide a new way of connecting, was painfully slow. As speeds increased, we were able to get increased speeds from home internet (thank goodness for FIOS and other connections).
Copper typically supports around 10 Gbps.
Fiber is the medium of choice, but due to the distances needed, the power levels of that light decrease, so you have to amplify it at different locations.
Recent innovations in transport, we can expect home broadband connections to reach up to 50 Gbps by 2030.
Demand for bandwidth is growing – currently, at around 30% year-on-year on transatlantic fiber optic cables.
Transmission speeds are being further investigated developed by solutions such as “Hollow-Core fiber”.
A "fiber optic cable with a vacuum core". Hollow-core is a specialized type of fiber optic cable where the central core is essentially empty, creating a vacuum inside, allowing light to travel with minimal disruption from the material within the core, potentially leading to faster transmission speeds compared to traditional fiber optic cables with a solid glass core; this concept is often called a "hollow-core fiber" and is primarily used in research applications due to the technical challenges involved in manufacturing and maintaining such a cable.

Here are the key points about vacuum core fiber optic cables:

Functionality:
- Light is guided through the hollow core by utilizing specific structures within the cladding that enable light to be confined within the vacuum space, minimizing refractive index variations and maximizing transmission speed. 
Applications:
- These cables are primarily used in research areas where extremely high data transmission speeds are required, such as high-precision metrology, advanced laser systems, and experiments studying light propagation in vacuum-like conditions. 
Challenges: It is difficult to create the vacuum-like seal needed, due to the fact that each cable is not the same length and requires varying distances and perhaps bend radius limitations. Back in the early fiber days, cutting and polishing were challenges that we successfully dealt with.
Maintaining vacuum integrity: Sealing the hollow core to prevent air leaks is a significant technical challenge being worked on in lab environments, will require further innovations in order to be reliable. 
Bend limitations: Bending the cable can disrupt the light guiding mechanism, making it difficult to use in situations where flexibility is needed. 
Coupling losses: Connecting a vacuum core fiber to standard fiber optic components can introduce significant signal loss, not to mention speed loss due to the refractive index mentioned above.

According to current information, Corning's most advanced fiber optic cable is considered to be their "SMF-28 ULL" (Ultra Low Loss) fiber, which offers the lowest loss terrestrial-grade fiber available, allowing for extended network reach and high data rates across demanding networks; it is often used in submarine and core network applications.  Transport providers have secured deals with Corning to be able to acquire a guaranteed percentage of that fiber over the next five years. This has motivated the hyperscalers to also secure deals with those transport companies in order to access that fiber with those increased performance characteristics.

Key points about Corning's SMF-28 ULL fiber:

High capacity: Enables high bandwidth transmission over long distances due to its ultra-low loss properties. 
Scalability: Designed to support future network upgrades with increasing data demands. 
Wide compatibility: Can be integrated with existing network infrastructure while offering improved performance.

Other notable Corning fiber optic cable features:

RocketRibbon cables:
- High-density ribbon cables ideal for tight spaces with quick installation capabilities. 
ClearCurve multimode fiber:
- Designed for high bandwidth data centers and local area networks with excellent macrobend performance. 
Ribbon cable technology:
- Offers high fiber counts and simplified splicing process for efficient network deployment.

Why this Matters

The speed of data transfer and processing is at the core of AI’s evolution. As AI models become more complex and multimodal, integrating text, images, video, and voice, the need for faster and more efficient networks grows exponentially.

The advancements in fiber optics and network infrastructure are not just about solving current bottlenecks—they are shaping the future. From immersive gaming and virtual assistants to breakthroughs in healthcare and education, the potential of AI depends on how well we can overcome these technical challenges.

In the same way that dial-up internet and early gaming consoles laid the groundwork for today’s hyper-connected world, today’s innovations in fiber and network technology are setting the stage for the next wave of AI-driven progress. Who knew that an Atari 2600 would lead to a growth in online gaming and virtual reality for products like the Meta Quest system. As our minds were able to consume the technological features, we wanted more. These technologies are likely to be more immersive, integrated into daily life, and used for more than just entertainment. Things that we couldn’t imagine, other than “Rosie” on the “Jetsons”, we will be using virtual assistants, virtual reality in education and healthcare, and many other aspects of our lives, further influenced by data, speed and AI.

As AI processing continues to evolve, leveraging LLMs and data lakes, we should also expect our ideas and needs to morph as well. Multimodal data sources are integrating LLMs, images, video, voice, and other sources to generate products and information that will help us navigate the advances being developed in every product and service in our life. One of the limiting factors today is speed, and innovations in fiber networks will be instrumental in helping to achieve these possibilities.

Let’s embrace the transformation, because the possibilities are limitless.

Overcoming the AI Data Center to Data Center Bottlenecks to Release Innovation

Network infrastructure:

Data processing:

Hardware limitations:

Latency issues:

How to mitigate these bottlenecks:

Upgrade network infrastructure:

Optimize data compression:

Implement distributed processing:

Utilize specialized hardware:

Consider network topology: