I also published this blog post about the CDO Mode feature introduced in NSX-V 6.3 on the VMware NSX Network Virtualization Blog on March 4, 2017. The full blog post is provided below and can also be seen on the VMware NSX Network Virtualization Blog site.
VMware NSX Network Virtualization Blog
Title: NSX-V 6.3: Control Plane Resiliency with CDO Mode
Author: Humair Ahmed
Date Published: March 4, 2017
NSX-V 6.3, released in February, introduced many new features. In my last blog post, NSX-V 6.3: Cross-VC NSX Security Enhancements, I discussed several new Cross-VC NSX security features. In this post I’ll discuss another new feature called Controller Disconnected Operation (CDO) mode which provides additional resiliency for the NSX control plane. Note, in NSX-V 6.3.1, CDO mode is a tech preview feature; the feature GA’ed in NSX-V 6.3.2.
The NSX Controllers already offer inherent resiliency for the control plane by design in several ways:
- complete separation of control plane and data plane (even if entire controller cluster is down, data plane keeps forwarding)
- controller cluster of three nodes allows for loss of controller with no disruption to NSX control plane
- vSphere HA provides additional resiliency by recovering the respective NSX controller on another node if host it’s running on fails
- When a VM vMotions or moves either by manual vMotion/intervention or in an automated way (such as DRS) to another host that was never a member of the respective logical switch before control plane connectivity loss.
- When a new VM connected to a logical switch is powered-on on a host which was not a member of the respective logical switch before control plane connectivity loss.
For the reasons mentioned above, it’s a rare event and unlikely that communication would be lost with the entire NSX Controller Cluster. In NSX-V 6.3, this control plane resiliency is enhanced even further via CDO mode.
CDO mode targets specific scenarios where control plane connectivity is lost; for example, a host losing control plane connectivity, losing control plane connectivity to the controller cluster, or NSX controllers are down. CDO mode enhances control plane resiliency for both single site and multi-site environments. However, multi-site environments and typical multi-site solutions such as disaster recovery (DR) provide a good use case for CDO mode; this is explained in more detail further below. Below I dig into the details of how CDO mode works and how it provides additional resiliency for specific scenarios.
CDO mode is enabled from the NSX Manager at the transport zone level. It can be enabled on a local transport zone or/and on a universal transport zone. When enabled on a universal transport zone, it must be enabled from the Primary NSX Manager. The screenshot below shows CDO mode being enabled on a universal transport zone via the Primary NSX Manager.
In the initial release of the CDO mode feature, it can be enabled on multiple transport zones only if each of the transport zones are on a different VDS. If the VDS is shared by a universal transport zone and local transport zone, it can still be enabled on the universal transport zone but not on the local transport zone; this allows for use of CDO mode on the universal transport zone, where it will likely be more preferred for Cross-VC NSX and multi-site use cases.
When CDO Mode is enabled, the next available VNI is designated for the CDO Logical Switch, which all hosts of the transport zone join. In the below example, I have not yet created any universal logical networks so it selects the first available VNI from the Universal Segment ID Pool I configured when I setup my Cross-VC NSX environment. In this case, the CDO Logical Switch VNI is 900000 since my configured Universal Segment ID Pool is 900000 – 909999.
Looking at the logical switches in the GUI, it can be seen in the below screenshot that I have not yet created any local or universal logical switches and the CDO Logical Switch is not listed. Since the CDO Logical switch is used only for control plane purposes it is not visible under the Logical Switches tab.
If I create a new logical switch in the universal transport zone, it can be seen below it skips VNI 900000 and selects VNI 900001, since VNI 900000 is being used by the CDO Logical Switch.
When CDO mode is enabled on a transport zone, all hosts in the transport zone join the CDO Logical Switch (next available VNI), and, one controller in the cluster is designated the responsibility for updating all hosts in the transport zone with the VTEP information of every other host in the transport zone. Since all hosts are members of the CDO Logical Switch, this effectively creates a Global VTEP List that is initially populated when control plane connectivity is up. If control plane connectivity is later lost, this Global VTEP List will be utilized.
Even prior to the introduction of CDO Mode in NSX-V 6.3, due to complete separation of control and data plane, if control plane connectivity was lost as shown in the below figure, the data plane would continue to forward as expected. However, if control plane connectivity or controllers were down and workloads on a logical switch moved to another host that was not already a member of that logical switch (VTEP of host not member of VNI), there would be data plane disruption. CDO mode targets this specific scenario to provide overall better control plane resiliency.
I step through the behavior before and after CDO mode is enabled below. In this example, I use universal networks but use only one site for ease of demonstration/explanation.
In the below scenario, the NSX Controller Cluster is up and two VMs/workloads on Host 1 are on the same universal logical switch with VNI 900002. Here, prior to NSX-V 6.3 or with CDO disabled in NSX-V 6.3, all works normally and communication is good as expected. In Figure 6 further below, looking at the VTEP table for VNI 900002 via Central CLI on the NSX Manager, it can be seen that the controllers have been informed that Host 1 (VTEP IP 192.168.125.51) is a member of the logical switch with VNI 900002. If other hosts had VMs/workloads on this same logical switch, the controllers would also have their VTEP entries and ensure the VTEP table for VNI 900002 is distributed to all respective hosts who are members of the logical switch.
In this case, even if control plane connectivity or the NSX Controller Cluster were to go down as shown in Figure 7 below, communication between the VMs would continue to work and communication with any other VMs on the same universal logical switch on other hosts would also continue to work because the NSX controllers would have already distributed the correct VTEP table information to all respective hosts.
Even in the scenario below, where the two VMs communicating are on different hosts, communication would still continue to work if control plane connectivity or the NSX Controller Cluster were to go down. Again, this is because, the two hosts already had VMs on the same logical switch, and, as such, both hosts were already members of the logical switch/VNI.
Prior to the shutting the NSX Controllers down for this example, I ran the below command from NSX Manager Central CLI to confirm both Host 1 (VTEP IP 192.168.125.51) and Host 2 (VTEP IP 192.168.135.51) were known to be members of VNI 900002.
In this scenario, even if the NSX Controllers are shut down, the VTEP information for the universal logical switch VNI 900002 has already been distributed to the respective ESXi hosts as shown in Figure 10 and Figure 11 below. This is why there is no disruption to data plane communication even if control plane connectivity is lost.
In a similar scenario as above, described in Figure 12 below, where two VMs on the same universal logical switch on Host 1 are communicating, and, if control plane connectivity or the NSX Controller Cluster were to go down, and then one of the VMs is moved (either manually or autoumatically) to Host 2, communication would still continue.
The reason for this is that another VM already exists on the same universal logical switch on Host 2, and, as such, Host 2 is already a member of the universal logical switch VNI 900002. Prior to control plane loss, the host/VTEP membership information for the logical switch was already distributed to all other hosts who have membership. When the VM moves to Host 2 during the control plane connectivity loss period, a RARP is sent over the data plane to update any stale ARP cache and communication continues to work. Similarly, if a new VM is powered-on on Host 2 during the control plane loss period, it would also be able to communicate to the VM on Host 1.
The specific scenarios that CDO Mode targets are the following:
In both of the above scenarios, a new host has become a member of a specific logical switch/VNI, however, since the control plane connectivity is lost or controllers unavailable, the NSX Controllers cannot be notified of the new member for the logical switch, and, without CDO mode, the new logical switch membership information cannot be distributed to the other hosts. Figure 13 below helps visualize the issue.
With the CDO mode feature introduced in NSX-V 6.3, both of the scenarios mentioned above are handled by using the Global VTEP List. As mentioned prior, all hosts in the CDO enabled transport zone automatically become members of the CDO Logical Switch (next available VNI). When the host determines control plane connectivity is lost for the logical switch in question, the Global VTEP List is leveraged and all BUM traffic is sent to all members of the transport zone. Figure 14 below helps visualize the result.
In conclusion, CDO mode brings even additional resiliency for the NSX control plane for specific scenarios and adds to the overall robustness of both single and multi-site solutions. For more information on NSX-V 6.3, check-out the NSX-V 6.3 documentation.
Follow me on Twitter: @Humair_Ahmed
I follow and refer your blog posts to find solutions on VMware operations. Thank your solution on networking issue.
I have a scenario where a Universal Transport Zone is used across a primary and DR site, and there are no VMs running in the DR site, and the Controller cluster is running in the primary site.
If we lose the primary site we will recover the VMs in the DR site. I’m confused as to whether communication over NSX will be ok or not, because there were no VMs running at the time of loss of the controller node. Could you clarify?
I am using VMware for multi OS purpose, thank you for showing different uses of them.