All about the 4.x SD WAN Routing Behaviour Part 1: Overlay Flow Control (OFC), Distributed Cost Calculation (DCC) and Lost Reason Codes
As already described in my Blog regarding Lost OFC Routes
https://sd-wahn.blogspot.com/2022/01/where-have-all-ofc-routes-gone-or-my.html
there were considerable routing behaviour changes starting with version 3.4.0
Excerpt from VMware SD-WAN Operator Guide Version 4.5
Configure Distributed Cost Calculation
By default, the Orchestrator is actively involved in learning the dynamic routes. VMware SD-WAN Edges and Gateways rely on the Orchestrator to calculate initial route preferences and return them to the Edge and Gateway. The Distributed Cost Calculation feature enables you to distribute the route cost calculation to the Edges and Gateways.
Note: Enabling Distributed Cost Calculation is recommended for all customers.This default method of involving the Orchestrator in both dynamic route calculation and the distribution of those routes to Edges and Gateways has drawbacks of significant higher route convergence time and dependency on presence of VCO for routing updates
When a customer enterprise uses Distributed Cost Calculation, the Orchestrator is no longer actively involved in the route preference calculation and instead routes are properly inserted in order by the Edge and Gateway instantly upon learning them and then convey these preferences to the Orchestrator.
When you choose to enable Distributed Cost Calculation for the Edges and Gateways, the feature provides the following benefits:
- Minimizes the impact on route learning when an Orchestrator is unreachable.
- Route convergence time is reduced from minutes to seconds in large networks with thousands of dynamic routes.
- Network delays are significantly reduced.
- Provides instantaneous Data Plane convergence.
- Supports enhanced re-ordering and pinning of routes on the Overlay Flow Control.
- Provides an option to refresh routes in the Overlay Flow Control page. Whenever there is a change in the Overlay Flow Control policy, the Refresh Routes option applies the changes to the existing routes immediately, without the need to restart the Edge or Gateway.
So now Edges and Gateways will do their own calculation based on their local view of routes, thus the OFC route table is only a copy and non normative display of learned rotes, but will still forward set preferences and route orders to Edges and Gateways.
Whenever an administrator experiences problems the only real reference to routing is the local routing table of the Edge or the Gateway
NOTE:
this is not the common linux based routing table, a seperate internal
routing daemon is used for this Underlay/Overlay routing
However when running the remote diagnostic there are some fields and information which unfortunately is not documented anywhere
So let's start with the Type field:
Routes for the same prefix are sorted by priority and each route is marked with a type.
•Type = Cloud
–Route to the VCG or offload directly to the interface
•Type Edge
–Routes learned from another VCE as a result of Cloud VPN enabled
•Type = OSPF/BGP
–Underlay routes that have the next-hop IP
•Type = Connected
–Local, connected IP addresses and networks
•Type = NON-Velocloud Site
– non SD WAN connected network (NSD)
- Type = N/A
-- ??? (seems to be local Interface addresses of WAN interfaces /32)
Now to the second new column Lost Reason
The Lost Reason column displays the codes for different reasons for the routes being lost to next preferred route, on Edges and Gateways.
Unfortunately this is the only information in documentation about that field.
There are around 50 different Lost Reason Codes for Edges and similar but not identical around 25 such codes for Gateways
Let's try to debug some of that codes for specific prefixes: 192.1.9.0/24
Here the OFC view where you can also click and see the metrics
and now the route table dump for that prefix on BR-3 (with explanation)
or for another prefix
There seems to be still some bugs in version 4.5 in the Lost Reason Coding as in the next example you see that some valid prefixes do not show the
Best Route as LR_NO_ELECTION
I hope that in one of the next versions all Lost Reason Codes will be documented (currently there are existing VMware internal documents) so that troubleshooting Route Selection will get easier.
What would also make things easier would be the possibility to also display the routing protocol metrics in that table (similar to what is visible in the OFC).
Another thing on my wishlist would be the possibility to easily save such tables as .xls or .csv for further evaluation.
Comments
Post a Comment