The incident in the datacenter was triggered by an overload of monitoring resources on a specific switch model (N9K-C93180YC-FX3). This switch was configured to handle multiple types of SPAN sessions (Switched Port Analyzer), which are used to monitor network traffic. However, the switch reached its hardware limit for handling these sessions, leading to a failure.
The fault (F3849) indicated that the SPAN limit was exceeded, triggering a reboot due to a known bug in the system. This bug was activated when we tried to set up a combination of monitoring sessions and then removed some of them. This action caused the bug to affect all similar switches in the network.
In simpler terms, the switch was asked to do more monitoring than it could handle, which led to a failure and reboot. The specific bug in the system was triggered by the way we configured and then changed the monitoring settings, causing widespread issues across all similar switches at the same time.