The Rise of IP Fabrics and the Corresponding Novel Routing Paradigms – Part 2

IEEE CTN

Written By:

Jordan Head and Tony Przygienda, Juniper Networks

Published: 29 Jul 2022

CTN Issue: July 2022

A note from the editor:

We hope you enjoyed reading part 1 of these series of articles focused on IP fabrics. In part 1 the authors provided background and foundational information to help understand the state of routing in IP fabrics. In this article, the authors provide a new perspective on how to address the challenges that this technology faces and introduce a new routing protocol, Routing in Fat Trees (RIFT), currently being standardized in IETF (Internet Engineering Task Force).

RIFT is a dynamic routing protocol that is tailored for use in Clos, Fat-Tree, and other anisotropic topologies and attempts to provide a highly robust, zero touch provisioning to what it is a rather complex system. As they write in the article: “One of the gnarliest dragons operators have had to battle is what it takes to deploy IP fabrics, and very few make it out unscathed.” Let’s see how this new protocol allows the operators to slay the dragons of IP Fabric deployment and management.

Yingzhen Qu, CTN Editor

A Day in the Life of an IP Fabric with RIFT

Jordan Head

Senior Resident Engineer

Juniper Networks

Tony Przygienda

Distinguished Engineer

Juniper Networks

“But I need it now!”
Introduction

Businesses like internet retailers, streaming services, social media platforms, and others strive to deliver services near-instantaneously because it’s what their customers have come to expect. Downtime is measured in seconds and by the potential revenue loss for each of those seconds. There are several key factors that could cause or contribute to said downtime, the SAMS framework defines this quite well for us in that a network should be: Scalable, Available, Manageable, and Secure. Consequently, there is enormous pressure on infrastructure teams to consider these factors, be it the service’s overall design, the network or server infrastructure, or even utilities.

Obviously, nothing comes for free. Protecting the revenue streams requires significant capital investment and operational expenditure. As the adage goes “Good, fast, cheap. Pick two”, but in our business we simply don’t have the luxury of only picking two, having all three is an imperative in the modern world of IP networks when those networks provide the globe’s critical infrastructure.

Modern IP fabrics are one of the resulting innovations stemming from those surmounting pressures, but they are also fraught with challenges of their own. And, while IP fabrics may be a newer concept, the routing protocols they first utilized were not. Link State protocols like OSPF [2] and IS-IS [3] and Distance Vector protocols like BGP [1] have been around for decades but were not initially designed to address the unique challenges of today’s IP fabrics [8]. The protocols have been modified significantly to better solve these challenges, but in many cases still leave many things to be desired.

Additionally, most of these solutions require sophisticated staff to deploy and operate something viable. This is much easier for businesses like cloud providers where the network is sold as core component of the product as they likely already employ a contingent of technological experts. Conversely, for commercial enterprises that don’t have such a fundamental reliance on the network, addressing the technical challenges is significantly more difficult. Often, this means that operating costs skyrocket as those businesses will typically rely on the expertise of network vendors or consultants.

Over the last 10-15 years network operators have slain many of these dragons or at least learned how to avoid being turned to ash. At a certain point it just made sense to exploit this knowledge and experience to design and build a radically new protocol to address the unique problem set. RIFT (Routing in Fat Trees) [4] is that protocol.

Measure Twice, Cut Once.
Planning and Deploying a Typical IP Fabric

IP fabrics are composed of an underlay and an overlay. The overlay is where the revenue generating services live and the underlay provides the reachability for the overlay. To use a real-world analogy, think of an underlay as the electrical wiring in a house and the overlay as the various appliances that use the electricity.

One of the gnarliest dragons operators have had to battle is what it takes to deploy IP fabrics and very few make it out unscathed.

Both underlays and overlays run routing protocols that distribute, process, and maintain the reachability information, traditionally OSPF, IS-IS, BGP or some combination thereof. The deployment of each protocol has specific parameters that require careful planning to ensure that nodes in the IP fabric are configured. The planning process must consider the correct scope relative to the network, they might be specific to the port, node, fabric, location within the fabric, or all of the above. Then even if an engineer plans perfectly, someone (or something) must apply the configuration in a way that reflects the intended network architecture. Additionally, for the protocols to negotiate the configured parameters, the cabling connecting the various devices must be precise according to the previously mentioned scopes.

Figure 1 depicts an example of a relatively small EBGP-based IP fabric. For it to become fully functional, the following parameters and scopes are considered and configured in the correct context on the devices:

7 unique BGP ASNs identify each node within the topology’s hierarchy (e.g., ASN 65401).
10 unique IPv4/IPv6 addresses for loopback interfaces that identify each node (e.g., 10.50.0.3/32).
32 unique IPv4/IPv6 addresses for ports which are signified with dots on each connection between the nodes.

This scale might not seem so daunting but let’s now consider a topology where the top stage has 12 nodes and lower stages each have 24 nodes. That translates to:

31 unique BGP ASNs.
60 unique IPv4/IPv6 addresses for loopback interfaces.
1,728 unique IPv4/IPv6 addresses for ports.

The scale of both scenarios is realistic and exemplifies just how quickly the operational complexity raises its head, even without considering the expected human error in the process. One way to ameliorate this problem is with automated tooling, but that is easier said than done. Developing robust automation requires knowledge of routing protocols, vendor-specific implementations and platforms, and programming to be synthesized into functional tool(s) that address all possible deployment scenarios and of course the tool must be maintained. As an example, what would happen when a new vendor is introduced to the network and the configuration syntax changes? Any change in the pieces of the solution or requirements brings the risk of introducing software defects.

RIFT in its very basic design completely eliminates this challenge by including native ZTP (Zero Touch Provisioning) functionality in the protocol itself. The other challenges described in [8] are also addressed by RIFT, but here we will simply focus on RIFT’s fundamentals in order to understand its different capabilities.

The Word of the Day is ‘Anisotropic’.
Deploying an IP Fabric

Anisotropic is defined as: “Having a physical property that has a different value when measured in different directions.” It basically means that something has a “sense of direction”.

Just about the most fundamental property of RIFT is that it is anisotropic in nature. RIFT requires an understanding of which nodes are at the top of the fabric and optionally which nodes are at the bottom. The topological properties of modern IP fabrics align perfectly with this or more accurately, RIFT exploited this fundamental attribute in its architecture. RIFT sorts the nodes in the topology into different Levels with the highest being comprised of the aptly named ToF (Top-of-Fabric) nodes and the lowest being comprised of Leaf nodes, nodes in between are considered spine nodes. Figure 2 illustrates this concept with a simple example of a RIFT IP fabric.

Traditional protocols do not have this inherent hierarchical understanding, which is limiting because a sorted topology allows for many algorithms that simplify operational challenges or discover problems.

Figure 2: RIFT: IP Fabric with Fundamental Structures

In other words, if RIFT knows where the ToF nodes are it can derive the location of other nodes in the fabric and their expected cabling as well. This is the basic idea behind RIFT’s ZTP mechanism. ZTP is beautiful in its simplicity, it only shares a partial requirement with its cousin protocols in that the cabling between the appropriate nodes needs to be correct, but the port-specific requirement is removed. Then, all that is needed is to tell the ToF nodes that they are in fact, ToF nodes. This anchors the topological sort and ZTP process. No more planning for thousands of potential variables to be applied in just the right place, everything is self-driven. Even miscabled connections are automatically pruned to ensure that the IP fabric adheres to the appropriate topological structure.

Prior to diving deeper into how ZTP works, it’s important to know that RIFT automatically derives IPv6 Link Local addresses [6] on the interfaces to facilitate protocol communication and forwarding in general.

Figure 3 shows a simple example of how the ZTP process works:

Mark the ToF nodes with a flag indicating that they are at the top of the fabric (i.e. the highest level).
ToF1-1 will send an offer to S1-1 indicating its level is 10.
S1-1 will compare all received offers, select the highest level, and consider that offer valid, and derive its level as 9 (10-1).
S1-1 will in turn send an offer to L1-1 indicating its level is 9.
L1-1 will then perform the same process and derive its level as 8 (9-1).

This process would continue until the leaf nodes (i.e., the bottom of the fabric) is reached, by which time the nodes will have functional adjacencies and are capable of exchanging reachability information.

It seems to be quite a leap to go from such complex and tedious deployments to something so straightforward as “just plug it in”, and perhaps some might believe that this is too-good-to-be-true. But it’s not, it’s almost as if Merlin helped us fight this dragon himself.

Link-State North, Distance Vector South.
Operating RIFT in a Stable IP Fabric

In a perfect world, networks remain stable and never break. And for that matter, unicorns dance by streams in the moonlight. Obviously, at least for one of those things it is not always the case, but we digress. Let us first explore that perfect world before diving into what happens when things get turned upside down by failures.

RIFT’s sense of direction also extends to other fundamental aspects of the protocol. Perhaps one of the most baffling facts at first glance is that RIFT operates like a Link-State protocol in the northbound direction and a Distance Vector protocol in the southbound direction. What that means in a generic sense is that nodes will advertise their full topology and all the specific prefixes in the northbound direction which provides nodes above them with a full view of reachability information below. Whereas under normal conditions, nodes higher up will only advertise a default route in the southbound direction. Afterall, if traffic is most commonly going to move from leaf to leaf (i.e., East/West), leaf nodes generally don’t benefit from having specific routing information from other leaf nodes as they can rely on their northern neighbors to have it.

The north and south directionality also describes the routing information, its movement, and computation. Once adjacencies are formed, routing information is exchanged with Topology Information Elements or TIEs. There are various types, but we will focus only on Node TIEs that describe a node’s adjacencies and capabilities and Prefix TIEs that describe the specific routing information. In either case, nodes behave differently depending upon the TIE’s direction (i.e., North or South) and to signify that, we describe them as such. For example, Node N-TIE or Prefix S-TIE (there is no concept of East TIEs or West TIEs).

Figure 4 illustrates an example of how nodes maintain various Prefix N-TIEs. Take note that nodes only hold state for the current level and any levels below it. And, while no Node-TIEs are explicitly shown, they are maintained in the same manner.

Conversely, if a Link State protocol were used, every single node would be required to maintain all prefix and adjacency state in addition to flooding that state across the entire fabric as necessary. Similar considerations are applied when using Distance Vector protocols except that all of the specific reachability information is not disseminated through the fabric automatically, but with carefully crafted routing policy that must be configured on relevant nodes.

Figure 5 for example, illustrates how OSPF would maintain the same state.

The reasoning behind this design approach is so that nodes only receive TIEs that contain required or beneficial information. This is coupled with tightly bound flooding scopes to ensure that TIEs are only distributed to nodes that need them. Finally, Shortest Path Computation or SPF is also run to consider direction (i.e., N-SPF and S-SPF) so that changes to northbound topology information don’t trigger computation over southbound topology information and vice versa, N-SPF computes over N-TIEs and S-SPF computes over S-TIEs. These factors ultimately distill into behaviors that limit the amount of unnecessary flooding and route computation, both of which drastically reduce state and convergence times. Or in other words, any entropy is kept to its minimally viable blast radius to increase the fabric’s stability.

Another beneficial by-product of this is that it enables the use of cheaper hardware with less CPU and memory resources for devices at progressively lower Levels in the fabric. Power consumption also goes down as a result. Other protocols require greater amounts of flooding, state, computation, and time to make efficient routing decisions. By comparison, RIFT only requires the practical minimum for traffic to take the ideal paths.

Link State and Distance Vector protocols also make it difficult to perform weighted ECMP (Equal Cost Multi-Path) load balancing. The shortest path first algorithm used by OSPF and IS-IS (Djikstra) are simply not capable of doing this. BGP on the other hand, could do this but significant enhancement would be required for it to have awareness of the interface speeds and would still only be able to make locally significant decisions based on that information. Even with this type of advanced implementation, BGP would still lack the guarantee of a loop-free path across the fabric.

RIFT natively distributes traffic across multiple paths more efficiently. It does this by factoring in the total available bandwidth towards northbound neighbors. If a node upstream has slower interfaces, less capacity, or is experiencing a failure, traffic will be distributed across available paths in a weighted fashion.

“To infinity and beyond!”
Scaling RIFT for Now and Forever

This discussion on state and multipathing gives us the perfect opportunity to segue and discuss how RIFT can scale far beyond traditional Link-State and Distance Vector protocols.

More and more services are choosing to perform Routing on the Host (RotH) by joining the overlay network. While this can be very desirable, it reduces the scale of the devices in the fabric as they must now maintain, compute, and exchange that reachability information. RIFT allows hosts to join the underlay network directly with very little effect on scale, in fact, RIFT can in theory scale to infinity because servers are most often only maintaining the default route.

Multihoming is also another major challenge for traditional RotH architectures as it typically requires the use of proprietary or fragile technologies such as Multi-Chassis Link Aggregation (MC-LAG), Spanning Tree Protocol variants (STP), or Virtual Chassis (VC). Some of these technologies may also require additional hardware costs and/or software licensing costs. However, all of them are Layer 2 technologies and therefore expose a security gap between the server and the ToR switch.

All in all, having servers that perform RotH functions join the RIFT underlay directly addresses these challenges and mean that they also benefit from other RIFT-native features, such as ZTP.

Reducing Mean Time to Innocence
Operating RIFT in an Unstable IP Fabric

Now we’ve come full circle back to the requirement that everything must remain stable and be capable of delivering services almost instantly. Network engineers are rarely commended when the network is stable, but when there’s an outage, they certainly become quite popular in all the wrong ways. Failures happen in any network, fans stop spinning, the power goes out, a rat chews through that one critical cable. It’s important to consider the failure’s blast radius and how quickly it can be mitigated. In short, the network should fail really, really well. Remember that neither Link State nor Distance Vector protocols were designed for dense IP fabrics, despite the myriad of enhancements that attempt to remedy that fact. RIFT is designed for IP fabrics and the manner in which it fails is too.

A blackhole is a term to describe when reachability to certain destination(s) appears reachable to some nodes but not to others, traffic will reach the nodes lacking reachability ultimately resulting in the traffic loss. It is a significant issue that can occur with typical failures and the resulting suboptimal route convergence.

RIFT mitigates blackholes through a process known as “route disaggregation” [7]. It does this by advertising a more specific prefix than the fabric’s default route which attracts the affected traffic onto the correct path. It comes in two flavors, positive and negative. Positive disaggregation is used to advertise the reachability of a specific prefix and negative disaggregation is used to advertise a lack of reachability to a specific prefix. It should be noted that both forms of disaggregation are ultimately solving the blackhole problem, but negative disaggregation is only required when multiplane fabrics are in use. The nuances of multiplane fabrics are highly complex and indeed merit their own papers [5], therefore we will not discuss negative disaggregation in any further detail. One point that warrants mention is that neither flavor of disaggregation necessitates any specialized silicon. RIFT works on the simplest networking silicon supporting the longest prefix match on IP addresses and optionally some mechanism to assign weights to multipath components.

“There is no security on this earth, only opportunity.”
Securing a RIFT IP Fabric

RIFT supports a security model that allows for varying levels of risk-tolerance. Like any other protocol however, security is a delicate balancing act between said security and network manageability. In some ways security is a matter of perspective (or trust if you prefer). Security-conscious operators will find it more difficult to leverage RIFT’s ZTP capabilities thereby relying more and more on external provisioning functions. The inverse is also true, for those that want ease of provisioning, security will inherently become more relaxed.

In RIFT’s case, the security model can be applied at the port, node, fabric level or a combination:

The Port Association Model (PAM) exchanges keys between pairs of ports.
The Node Association Model (NAM) exchanges keys between pairs of nodes.
The Fabric Association Model (FAM) exchanges a key fabric wide.

RIFT facilitates these models by wrapping RIFT packets in a “security envelope”. This does more than secure things, it is also performant in that the entire packet doesn’t have to be decoded and re-encoded to validate or modify important fields, which reduces load on the device’s CPU.

Summary

By now, the benefits of deploying a RIFT-based IP fabric should be obvious. Remember that the SAMS model stands for Scalable, Available, Manageable, and Secure. Let’s quickly summarize how RIFT fits that model:

RIFT is Scalable, it efficiently exchanges, maintains, and processes reachability information, even in fabrics that extend to multihomed servers.

RIFT is Available, it makes effective use of its resources through weighted load balancing and contains the blast radius of failures to an absolute minimum.

RIFT is Manageable, it natively supports Zero Touch Provisioning without the need for complex variable planning or external tooling. And, finally,

RIFT is Secure: it supports a flexible security model.

As a final note, for readers that want a comprehensive understanding of RIFT’s fundamentals and complementary examples, the RIFT Day One book [7] is a great resource.

Terms

IS-IS: Intermediate System to Intermediate System
OSPF: Open Shortest Path First
BGP: Border Gateway Protocol
ASN: Autonomous System Number

References

BGP - Rekhter, Yakov et al, “A Border Gateway Protocol 4 (BGP-4)”, RFC4271, 2006, IETF
OSPF – Moy, John, “OSPF Version 2”, RFC2328, 1998, IETF
ISIS - ISO/IEC, International Organization for Standardization, "Intermediate system to Intermediate system intra-domain routing information exchange protocol for use in conjunction with the protocol for providing the connectionless-mode Network Service (ISO 8473)”, 2002
RIFT – Przygienda, Antoni et al, “RIFT: Routing in Fat Trees”, draft-ietf-rift-rift-15, 2021, IETF
IPR – Rijsman, Bruno, “Automatic Disaggregation in the Routing in Fat Trees Protocol”, 2021, The Internet Protocol Journal
IPV6 – Hinden, Robert et al, “IPv6 Addressing Architecture”, RFC4291, 2006, IETF
DAYONE – Aelmans, Melchior et al, “Day One: Routing in Fat Trees (RIFT), A complete look at the cutting-edge protocol, 2020, Juniper Networks Books
IEEE-CTN-IP-FABRICS - Przygienda, Antoni et al, “The Rise of IP Fabrics and Corresponding Novel Routing Paradigms – Part 1”, “Principles and Problems of Routing in IP Fabrics”, 2022, IEEE Communications Society Technology News

Statements and opinions given in a work published by the IEEE or the IEEE Communications Society are the expressions of the author(s). Responsibility for the content of published articles rests upon the authors(s), not IEEE nor the IEEE Communications Society.

Publications