All about the 4.x SD WAN Routing Behaviour Part 1: Overlay Flow Control (OFC), Distributed Cost Calculation (DCC) and Lost Reason Codes

 

As already described in my Blog regarding  Lost OFC Routes 

https://sd-wahn.blogspot.com/2022/01/where-have-all-ofc-routes-gone-or-my.html

there were considerable routing behaviour changes starting with version 3.4.0

Excerpt from VMware SD-WAN Operator Guide Version 4.5

Configure Distributed Cost Calculation

By default, the Orchestrator is actively involved in learning the dynamic routes. VMware SD-WAN Edges and Gateways rely on the Orchestrator to calculate initial route preferences and return them to the Edge and Gateway. The Distributed Cost Calculation feature enables you to distribute the route cost calculation to the Edges and Gateways.

Note: Enabling Distributed Cost Calculation is recommended for all customers.

This default method of involving the Orchestrator in both dynamic route calculation and the distribution of those routes to Edges and Gateways has drawbacks of significant higher route convergence time and dependency on presence of VCO for routing updates

When a customer enterprise uses Distributed Cost Calculation, the Orchestrator is no longer actively involved in the route preference calculation and instead routes are properly inserted in order by the Edge and Gateway instantly upon learning them and then convey these preferences to the Orchestrator.

When you choose to enable Distributed Cost Calculation for the Edges and Gateways, the feature provides the following benefits:

  • Minimizes the impact on route learning when an Orchestrator is unreachable.
  • Route convergence time is reduced from minutes to seconds in large networks with thousands of dynamic routes.
  • Network delays are significantly reduced.
  • Provides instantaneous Data Plane convergence.
  • Supports enhanced re-ordering and pinning of routes on the Overlay Flow Control.
  • Provides an option to refresh routes in the Overlay Flow Control page. Whenever there is a change in the Overlay Flow Control policy, the Refresh Routes option applies the changes to the existing routes immediately, without the need to restart the Edge or Gateway.

 So now Edges and Gateways will do their own calculation based on their local view of routes, thus the OFC route table is only a copy and non normative display of learned rotes, but will still forward set preferences and route orders to Edges and Gateways.


Whenever an administrator experiences problems the only real reference to routing is the local routing table of the Edge or the Gateway

NOTE: this is not the common linux based routing table, a seperate internal routing daemon is used for this Underlay/Overlay routing


However when running the remote diagnostic there are some fields and information which unfortunately is not documented anywhere


So let's start with the Type field:

Routes for the same prefix are sorted by priority and each route is marked with a type.

Type = Cloud

Route to the VCG or offload directly to the interface

Type Edge

Routes learned from another VCE as a result of Cloud VPN enabled

Type = OSPF/BGP

Underlay routes that have the next-hop IP

Type = Connected

Local, connected IP addresses and networks 

Type = NON-Velocloud Site 

             non SD WAN connected network (NSD)

  • Type = N/A

      -- ??? (seems to be local Interface addresses of WAN interfaces /32)


Now to the second new column Lost Reason

The Lost Reason column displays the codes for different reasons for the routes being lost to next preferred route, on Edges and Gateways.

Unfortunately this is the only information in documentation about that field.

There are around 50 different Lost Reason Codes for Edges and similar but not identical  around 25 such codes for Gateways

Let's try to debug some of that codes for specific prefixes: 192.1.9.0/24

Here the OFC view where you can also click and see the metrics

 and now the route table dump for that prefix on BR-3 (with explanation)

or for another prefix

There seems to be still some bugs in version 4.5 in the Lost Reason Coding as in the next example you see that some valid prefixes do not show the 

Best Route as LR_NO_ELECTION

I hope that in one of the next versions all Lost Reason Codes will be documented (currently there are existing VMware internal documents) so that troubleshooting Route Selection will get easier.

What would also make things easier would be the possibility to also display the routing protocol metrics in that table (similar to what is visible in the OFC).

Another thing on my wishlist would be the possibility to easily save such tables as .xls or .csv for further evaluation.

Comments

Popular posts from this blog

Orchestrator Upgrade to Version 5.2

Deep Dive on DMPO and its Performance Features (available and missing) Part 1

Deep Dive on DMPO and its Performance Features (available and missing) Part 2