I’ve been fortunate to work in a space where I’ve had exposure to see and discuss infrastructure and architecture/design of some of the most highly utilized and scalable networks in the world. It’s fair to say living in Silicon Valley has definitely had a positive contribution as I’ve probably learned just as much outside of work as I have while on the job.
What I found most interesting is how highly automated the data centers of these respective companies have become. Due to the level of automation and orchestration of workflows, I consider these networks as “intelligent” and extremely fluid in terms of automatically adapting to the surrounding environment.
A few examples of the automated capabilities include the following;
- Identifying problem areas in the network/infrastructure or application and automatically generating trouble tickets
- Self-healing or automatically detecting and fixing issues
- Redirecting workloads based on workload type, workload size, and on current local link and route utilization
- Traffic engineering based on known and consistent traffic patterns that are automatically learned and analyzed
- Automatically identifying problems in the network/infrastructure or application by analyzing trends and historical data
- Detecting security threats by behavioral and historical data analysis
The other thing I noted is the push to keep the hardware infrastructure as simple as possible with minimal requirements, while focusing on software for automation and reliability.
If you haven’t already, read this article from Facebook network engineer Alexey Andreyev. Some interesting statements posted below:
“A large fabric network – which has a more complex topology and a greater number of devices and interconnects – is definitely not the kind of environment that can be realistically configured and operated in a manual way. But the uniformity of the topology helps enable better programmability, and we can use software-based approaches to introduce more automation and more modularity to the network.”
“We developed a centralized BGP controller that is able to override any routing paths on the fabric by pure software decisions. We call this flexible hybrid approach “distributed control, centralized override.”
In this article, Facebook’s Director of Network Engineering, Najam Ahmad, mentions a network automation tool they use, Facebook Auto-Remediation (FBAR), which sifts through 3.37 billion notifications from network devices each month and filters out the noise down to roughly 750,000 alarms that need action to be taken. It’s mentioned that FBAR resolves 99.6 percent of the alarms without human intervention.
I remember visiting Facebook a few years back and being surprised when one of the network engineers told me they had moved over to the networking team from one of the Facebook development teams. But actually it makes a lot of sense with the approach Facebook has taken in regarding its network as one large distributed system where automation is crucial. Interestingly, the same article mentioned above describes this workforce transition at Facebook.
Now it’s fair to say the level of automation and orchestration mentioned above is not achieved in the average enterprise environment. There could be several reasons for this, including lack of resources/investment or necessary skillset.
I’ve been working on the VMware NSX team since last July, and what’s fascinating to me is that the approach taken by many of these successful high tech companies is the same approach VMware takes with NSX: keep the hardware deployment and IP connectivity simple and implement the intelligence and automation with software. Once you follow this strategy, it becomes drastically easier to build “intelligent” and highly automated networks.
The nice thing with VMware NSX for those looking for implementing a similar software-based approach to networking is NSX is a pre-packaged solution ready to be deployed on any physical network and consumed via REST API and already existing automation and management tools.
Tags: Alexey Andreyev, automated data center, automated networks, BGP controller, Data Center, Facebook, Facebook Auto-Remediation, Facebook Director of Network Engineering, Facebook network engineer, FBAR, historical data analysis, intelligent data center, intelligent networks, Najam Ahmad, network architecture, network behavioral analysis, network modularity, network programmability, NSX, NSX REST API, orchestration, orchestration of workflows, programmability, REST API, SDN, Self-healing, Self-healing network, Silicon Valley, software defined networking, traffic engineering, vmware, VMware NSX, VMware NSX REST API