Architecting For High Availability

Designing solutions with resiliency is one of the most important aspects of network and edge architecture. While there is no correct answer to how much resiliency is needed (we often refer to these topics as being "in the eye of the beholder"), there are best practices, suggestions for different use cases and some specific services and packages that Network Edge offers.

Levels of Redundancy

If we imagine a network design from the origin of a data packet and moving from inside outward to its destination, each point where traffic is processed or traversed becomes a possible point of failure. The key for a customer is to decide which points of that infrastructure matter most or are the most open to issues and failure, and design ways to mitigate against that. While everyone probably wants as much redundancy and diverse layers as possible to ensure that services are always available, the competing pressure on that design is always price and complexity.

In a simple network flow from a consumer of data at a premise to the source of data inside a cloud or other external location, we can begin to look at the many steps in the flow, and layers of redundancy that would add to our need for high availability:

Starting from the left, a customer's server flows through the premise devices and into a network. It may pass through multiple networks on its way to the cloud whether it is internet, MPLS or others. It eventually arrives at a router or device in the data center (we will treat this as the Network Edge device), before flowing into the ECX FabricECX Fabric is an advanced interconnection solution that improves performance by providing a direct, private network connection, to a cloud provider and finally to the source or storage of the data.

Any one of these could be duplicated and/or isolated from each other to form complete redundancy, as in this example in which physical path, physical device, and physical locations are all completely diverse from one another:

However, this design is probably very expensive and very cumbersome to maintain and keep in synchronization. Most users choose some subset of these to implement, such as:

Or:

Or any number of nearly infinite combinations.

Of course, Equinix only controls a small portion of this overall solution, so the HA features will focus on certain aspects of the “middle” of this scenario, from which you can “bolt on” whatever left and right side combinations you see fit in a relatively modular way.

Without an explicit HA feature added to your virtual devices, there are several points of redundancy already built into the Network Edge platform and ECX Fabric. For example, some provider destinations allow or require dual connections to their environment, and this can be done through a single ECX Fabric router or two, depending on how the provider sets up their side.

Note: The Network Edge device owner has little control over this.

Or:

The Network Edge device owner could always connect to the same cloud through another metro, assuming the cloud provider allows such combinations or the same data is reachable, as well:

This may introduce other potential points of failure, but additional options are possible to mitigate against them.

All of these scenarios are possible using a single Network Edge device with a combination of other services. With the High Availability service (detailed below), the virtual device is duplicated in various, user-selectable ways, to introduce several new options into the above drawings.

The Network Edge platform always maintains dual LAG group ports into both the ECX Fabric and the Public Internet (aka Equinix ConnectEquinix Connect is a turnkey solution for customers to connect to the Internet. The customer will connect their equipment to an Equinix-provided router or switch by means of one or more physical cross connects) services from each compute cluster:

ECX Fabric Redundancy Design With Providers:

When connecting to other Equinix services, all have some level of core device redundancy. The ECX Fabric product has at least 2 core routers in each metro (and frequently more) that are exclusively designated as either primary or secondary devices and are also deployed in different physical data centers. You can read more about ECX Fabric redundancy in that documentation.

Any customer who is buying a connection to a redundant provider connects from two ports (at left) primary and secondary, to equivalent ports facing the cloud provider (at right). See an example below:

This model assumes a physical implementation where the Pri and Sec ports are attached to two devices. In the Network Edge platform with a virtual device, there is no such thing as two physical ports. Therefore, we implement High Availability via duplication of virtual devices into two planes (at left):

Occasionally a provider or other destination does not have secondary ports facing their environment. When this occurs, Equinix will terminate both the primary and secondary requests to the same port, where available, by passing the secondary connection back to the primary plane:

Other services such as Equinix Connect (also referred to as “additional internet bandwidth”) and IX are offered in a redundant manner. If the primary device that connects Network Edge to the Internet fails, a secondary one will automatically take over. It is not necessary for customers to specify the level of redundancy available here. Equinix recommends that customers have an additional option for accessing their device at all times, regardless of Equinix architecture. This might be a private MPLS connection or an additional public internet service.

High Availability Service

The Network Edge platform offers a High Availability service on most devices. Although there is a separate section of the deployment guide describing services, the HA service is a large, complex, and important one that deserves its own article to understand and consider options.

The HA service creates a second, duplicate device that is permanently paired with the main one that the user configures, and places each in a different “plane” from the other, regardless of where each device is located (see the different deployment implementations below). The HA setup is always active/active, and can be deployed in a variety of diversity scenarios, depending on availability per metro.

The system also mandates duplicated connections, services, and other configuration from that point forward, on each device in the pair.

Unless specifically mentioned otherwise or limited by available account structure, Network Edge resources, or other uncommon situations, selection of an HA service always includes the selection of one of the following four deployment scenarios:

Diverse Compute

Diverse IBX

Diverse Metro

Diverse Regions

Note: For diverse regions, the basic layout and behavior is no different from diverse metros. However, where possible, some differences may be enforced. These differences are mostly transparent to the user:

  • If a “more local” version of the cloud you want to connect to is available in each region, the system may enforce interconnection to that destination

  • Some Equinix management systems, such as TACACs, orchestration, and collection engines, will be deployed per region and may not have “knowledge” of equivalent systems in another region for certain synchronization needs.

  • IP address pools and other resources may be significantly different than each other; you may also need to check for IP address overlap in your private space if you intend to connect devices and clouds together in some combination.

  • User accounts must have purchasing authority in each country in order to deploy in this way

In order for HA to work, the following must always be true:

Equinix recommends are several rules and best practices for HA:

  • Users should keep their secondary devices as “close” to the primary as is reasonable for the level of diversity your organization is comfortable with. The “closest” would be the first scenario above, and the system will always default to this unless users tell us otherwise.

  • Users should always connect to the “closest” most cloud provider or destination to the device. In general, this means that the device and the most frequently used providers should be in the SAME METRO.

  • When you begin reaching across metros in any configuration or scenario, additional charges will apply on your ECX Fabric connections.

  • When a user decides they will connect to different locations, metros, or regions of the same provider, the ability to access the same data source for primary and secondary cannot be guaranteed by Equinix; we recommend you consult with your cloud provider. While some natively provide reach to every region from any connection point (or have services you can add to access them), some providers have full isolation and diversity of data, regions, and zones. You should never assume that connecting to Metro A and Metro B of the same provider will grant access to the same customer environment:

Example of multiple edges ultimately connecting to the same data stores in the public cloud you are trying to reach.

Example of when the edges do not reach the same data stores. The user could enable duplicate storage with the cloud provider; or in some cases, the cloud provider may tether their regions together with network services they offer. In all cases, Equinix cannot control these settings.

Equinix always allows the user to view either the primary or secondary device in inventory, and the user can toggle between them. They always appear as a pair, however, and are not separable other than by deleting both entirely.

Below is a table of various services and activities and rules about whether they are required to be in sync or not:

 

Category

Sync Enforced

Additional Rules

User Access

Service

Yes

User may opt to have same credentials on both devices, or unique credentials on each

Primary InterfaceAn interface is a point on a device where data flows in and out. The virtual device is like a physical device because it has some amount of interfaces that allow it to transmit and receive data from the outside world. This can be in the form of an Internet connection, a connection to ECX, a service chain to another virtual device or any other communication.

Interface

Yes

 

Cloud Interface

Interface

Yes*

Because of the possibility of differences in connections on primary and secondary, the BGPBorder Gateway Protocol. A standardized exterior gateway protocol designed to exchange routing and reachability information between autonomous systems on the internet and peering settings per cloud interface and VRF may differ over time, but the requirement to keep both active with the same destinations remains intact

Primary Int ACL

Service

No

Allowable IP addresses may be different per device

VPN Tunnel

Service

Yes*

Users are required to maintain a pair of tunnels each time they create one, for pri and sec. However, the remote location of each tunnel may be different, as well as the settings.

Device Throughput

Device Config

Yes

 

Device VendorA vendor is the company, manufacturer or owner of a specific device that can be deployed. For example, Cisco and Juniper manufacture devices that can be launched into the ENE platform. Each vendor may have one or more devices offered on ENE, and the vendor usually has a uniform licensing model and other services for their library of available devices., Model

Device Config

Yes

 

Device Software package

Device Config

Yes

 

Device Notifications

Device Config

Yes*

Users may opt for unique notification list on secondary; but otherwise the primary list is notified on both

ConnectionConnection is a general term that refers to any solution that results in the ability to pass data from one point to another. Connections can be made with Layer 2 or Layer 3 technology, may involve several parts or components and can be created from the portal or with APIs in a variety of ways.

Connection

Yes

Users must always create two connections each time they opt to go to a destination on the ECXF. However, per the rules below, users may have some level of differentiation within the settings and destination of those connections, if the provider allows.

Connection Speed

Connection Config

No*

Users may select different speeds per connection up to the total throughput possible of the device. Some providers mandate a single ID and uniform connection settings; these will be enforced by provider only

Device Delete

Device Activity

Yes

Users must delete both devices in the pair; at this time the pairing cannot be decoupled and keep one or both devices as is.

BGP Settings

Service

No

Users should note that services may not operate properly unless appropriate BGP settings are placed on both connections. System will not enforce same settings or timing or BGP state in case users do not create them at same time or wish to set up BGP settings that allow for route preferences

Additional Internet Bandwidth

Service

No*

Both devices will always have at least the baseline Internet service of 15Mbps; users may increase amount per the service specification but will not be required to have the same amount on both devices

Network Interface

Interface

No

TBD future capabilities

Creating Connections with HA

The Network Edge platform always enforces dual connections when am HA-paired device is selected as the A-end of any connection through the ECXF. Learn how to carry out the workflow in Creating A Connection.

While some providers may vary requirements, there are a few things to know about redundant connections:

  • Some providers do not mandate redundancy at all but HA devices will enforce it anyway

  • Users may choose to initiate a connection from either the secondary or the primary, but the system will enforce it with the other every time

  • Unless a provider mandates the same speed, Network Edge systems will allow differential connection speeds from each device. Equinix always recommends that you keep settings as similar as possible

  • If one connection request fails for any reason, the system will deprovision both and best effort will be made to inform the user why it failed. Because all connections are paired, the system will queue both until they are successfully able to proceed.

  • The connection selected may never exceed the provisioned throughput of the device, even if the provider does not offer speeds that align with your device settings. This could also mean that theoretically there may be no viable solution from a device whose throughput is smaller than the smallest possible allowed speed to a provider. This is quite rare.

  • Although a provider may require redundancy to their cloud on primary and secondary in the same metro, the Network Edge platform does not require that your connections originate in that same metro. You should consult with your provider to ensure this is acceptable and that traffic will flow appropriately

  • When a user deletes one connection, the other one must also be deleted

When a device or connection is provisioning, each goes into a queue.

Some common redundant connection models:

Provider

Provider requirement

HA connect behavior

Failure behavior

Amazon Web Services Direct Connect

Does NOT require redundancy

AWS has primary and secondary ports in all Equinix markets and full diversity will be maintained. User needs a single AWS ID for both. User may choose any metro for any combination of primary and secondary

Both connections are active/active

Google Cloud Interconnection

Does NOT require redundancy; has two discrete edges to connect to; at customer discretion

Google maintains two discrete services on ECX Fabric for Zone 1 and Zone 2. To achieve redundancy, users should select one of each. Google issues ID’s for both in a fully redundant scenario, but user may choose any combination of Zone 1 and 2

Both connections are active. Google zones are never taken down at the same time

Microsoft Azure ExpressRoute

Always requires redundancy in the same metro

System will enforce a single ID and single metro for the Z-end of both connections. the connections MUST be the same speed

Both connections are active

Oracle Cloud Interconnect

Does NOT require redundancy

System requires a discrete ID for each connection; may be different sizes

 

General provider with redundancy option

Does NOT require redundancy

System will enforce any rules spelled out by the provider in the service profile

At provider discretion

General provider with redundancy requirement

Always requires redundancy

System will enforce any rules spelled out by the provider in the service profile

At provider discretion

General provider with no redundancy options

Does NOT require redundancy;

System will enforce any rules spelled out by the provider in the service profile; system will terminate both connections to the same provider port

Provider port failure will result in service outage

There are many possible combinations of redundancy that a user may choose or be obligated to do, depending on the available assets in each metro. In descending order, starting with the most cost effective and performance optimized layout, below is every combination you are likely to see when creating an HA device pair, and then creating connections. For the purpose of these simple diagrams, it is not relevant which is primary and which is secondary; though Equinix always recommends that customers design so the primary achieves the best performance.

Devices And Clouds Are All In The Same Metro

Possible Use Cases:

  • Traditional single metro deployment

  • CSP may require redundancy in a single metro

Pros:

  • Best performance

  • Least cost (will pay local ECX Fabric connection charges only)

  • Both primary and secondary paths will be extremely similar in performance

Cons:

  • Does not protect from wide metro-level failure

Devices Are Metro Diverse And There Is A Local Path To The Same Cloud In Each Metro

Possible Use Cases:

  • Full metro diversity w/ access to same data stores

  • Maintain consistent performance locally even though not in same metro

Pros:

  • Both primary and secondary paths will be similar in performance because both stay local

  • Avoids inter metro/remote charges

Cons:

  • Can only be done when CSP does not enforce pro/sec in same metro

Metro Diversity For Devices And Cloud Is In Primary

Possible Use Cases:

  • Metro diversity but CSP requires primary and secondary in the same metro

  • Second metro homes back to same CSP region to access same data set

Pros:

  • Devices maintain metro-level diversity; each can still be used to aggregate local traffic first and connect to other local clouds

Cons:

  • If connection to the Metro A cloud is the primary use case, and Metro A fails, the device in Metro B is largely useless

Devices In Same Metro But Clouds Are Metro Diverse

Possible Use Cases:

  • Cloud provider did not have a redundant/secondary connection in your deployed metro

Pros:

  • Minimizes management and cost for most of the solution

  • Same metro diversity for aggregation of branch or premise locations

Cons:

  • Performance, latency will be inconsistent and unpredictable

Devices In Metro And Cloud In Different Metro

Possible Use Cases:

  • Local deployment of devices to aggregate and route branch traffic before sending to cloud provider that is not present in your metro

Pros:

  • Performance will be consistent, predictable on both primary and secondary

Cons:

  • Cost and overall performance of remote connections

Devices In Metro And Clouds Are Metro Diverse

Possible Use Cases:

  • Cloud provider does not have redundancy in the closest metros

Pros:

  • Aggregate and route traffic locally before sending to cloud

  • Route only necessary traffic across the metros and minimize cost with smaller connection speeds

Cons:

  • Could be differences in performance and distance

Cloud Is In Single Metro And Devices Are Metro Diverse Not In Same As Cloud

Possible Use Cases:

  • Cloud provider requires redundancy in same metro

  • Aggregation of local traffic from multiple sources to a single data store

Pros:

  • Ensures access to same cloud provider resources and services

Cons:

  • Cost, performance of links

What Happens During An Outage

While there are many points of failure in the solution, the creation of an HA solution ensures the highest SLA possible. Refer to the legal agreements and product documents for details and rules.

Because both solutions are active/active at all times, there is no need for a failover time or ongoing sync to determine what devices are running adequately. The secondary solution is running all the time.

The customer is responsible for ensuring that traffic is flowing properly, and that the secondary solution is not being used for additional, non-duplicative traffic needs, although there is nothing to stop this from occurring. If a customer is doing this and a failure occurs, and some traffic is lost or dropped, it will negate the SLA for that incident.

Many customers opt to use other customized means of routing traffic over one path or the other. This can be accomplished in many ways, such as AS pre-pending or BGP communities. Although these are not explicitly supported by Network Edge services, they can often be configured manually inside the CLI or other methods.

Consult with your engineering teams for setup and configuration assistance.