Alert Correlation Logic

Related alerts are correlated into incidents for visibility into high-level, actionable issues.

👍

Welcome to the New Docs Site Structure!

BigPanda docs moved to this new organization on September 30th, 2022.

If you're not finding what you're looking for, let us know what's missing in this short survey.

Understanding how BigPanda determines which events are correlated into an alert and which alerts are grouped together into incidents can help you configure and use BigPanda more effectively.

After merging events into alerts, BigPanda provides additional noise suppression and improved visibility by clustering highly related alerts into a single, high-level incident. For example, a connectivity problem may cause several checks on the same host to enter a critical state. All of these alerts are clustered into a single incident so that you can see up-to-date information from each check on the same timeline. BigPanda uses correlation patterns to define relationships between alerts and applies pattern recognition to dynamically cluster alerts into incidents.

BigPanda can correlate a maximum of 300 alerts into a single incident.

Correlation patterns define the relationships between alerts by using the following parameters:

  • Source Systems - the integrated monitoring systems for which the pattern applies. For example, show alerts that come from Nagios, Datadog, etc.
  • Tags - the properties that indicate when alerts are related. For example, correlate all alerts that come from the same cluster and have the same check.

👍

Tags

To learn more about a tag, you can view an alert in the BigPanda UI, reference the documentation on standard tags, or review the custom tags defined for your organization.

  • Time window - The amount of time between when the alerts started. For example, network-related alerts may start within a short time from one another, while load issues may develop over a longer period of time.
  • Filter - (optional) The conditions that further refine which alerts this relationship applies to. For example, correlate only network-related alerts by data center.

The default correlation patterns are:

  • Same primary property, started within a two-hour time window - Identifies related alerts from the same primary object, such as a host, application, or service. For example, several alerts on the same host with different checks may be related to the same problem.
  • Same cluster, started within a 30-minute time window - Identifies when different objects within the same topological area of your infrastructure may be experiencing the same problem. For example, high CPU alerts on several servers in your MySQL cluster.

📘

If multiple correlation patterns match an incident, the pattern with the longest time window is the one that appears in the UI.

Merge Alerts into an Incident from a Single Host

A single incident in BigPanda can contain one or more alerts. Alerts are merged into the same incident if they have the same application key and primary and secondary properties. The current status and properties of the incident in BigPanda represent the most recent alert, which is determined by the timestamp property.
In the following example, these two alerts would be merged into a single incident with a status of Warning and the description CPU is above warning limit (40%).

# First alert
   {
    "status": "critical",
    "host": "production-database-1",
    "timestamp": 1492452456, # 17 Apr 2017 18:07:36 GMT
    "check": "CPU overloaded",
    "description": "CPU is above upper limit (70%)"
   }

# Second alert
   {
    "status": "warning",
    "host": "production-database-1",
    "timestamp": 1492452578, # 17 Apr 2017 18:09:38 GMT
    "check": "CPU overloaded",
    "description": "CPU is above warning limit (40%)"
   }

️ Send Multiple Alerts with the Alerts REST API

BigPanda uses the timestamp to determine the latest status of an incident. If it is not included, BigPanda uses the time when the alert was first received. To ensure that BigPanda accurately reflects the most current status, when sending multiple alerts, you must include the timestamp for each alert or sort the incidents array by when the alerts occurred, in ascending order.

Apply Pattern Recognition

To achieve dynamic clustering, BigPanda keeps track of all the patterns that match an incident, and matching patterns are evaluated in real time. When a new alert is received, BigPanda evaluates it against any patterns for active incidents that are within the start time window. If the alert matches a pattern for an existing incident, it is added as a related alert, and any patterns that no longer match all of the related alerts are eliminated as matching the incident. If the alert doesn't match an existing incident, a new incident is created with any patterns that match the alert. The incident title is determined by the matching pattern with the widest time window.

As new alerts are received, the process continues, and incidents become more well-defined as more information is available. After the time windows of matching patterns have elapsed, no new alerts are added to the incident. The Incident Life Cycle Logic is determined by the alerts it contains, and the incident remains open until all related alerts are resolved.

Alert Correlation Steps

Once normalized and enriched, the alert begins the Alert Correlation process:

Check for Matching Alerts

First, BigPanda checks to see if the new event matches an existing alert in an incident. The system checks the event incident key to determine if the event is a match.

If the event properties match an alert in an active or recently resolved incident, the event is added to that incident as an alert.

If the new event status changes (For example, Warning to Critical), the tag values will be merged into the last event.

If it does not, the next check is performed.

Check for Matching Correlation Patterns

BigPanda checks to see if the event matches any active correlation patterns. If the event matches, it is added as an alert to an existing incident. If more than one correlation pattern matches, the pattern with the largest correlation window is selected. The incident’s active correlation patterns are updated to include only patterns that apply to all active alerts.

If the event does not match an active correlation pattern, a new incident is created.

Correlation Patterns Are Updated

The incident’s active correlation pattern matches are updated to include only the patterns that apply to all active alerts. Any of the incident’s pattern matches that are no longer in the correlation window are deactivated.

📘

It’s possible for an alert to be added to an incident even if it does not match some of the active matched patterns. When this happens, the patterns that do not match are deactivated. Deactivated patterns remain attached to the incident, but alerts are no longer correlated into the incident based on these matches.

The Incident Title is Updated

The incident title is updated based on the active matched pattern with the largest correlation window.

If the incident has only one entity, the title is generated based on the primary and secondary properties of the active entity.

If the incident has multiple active entities, the title is generated based on the correlation tags of the pattern with the largest correlation window that matches all of the active entities.

Incident titles can change as alerts join the incident and change status.

The Incident Status is Updated

The incident status is updated based on the status of the most severe active entity.

Incident Titles

Incident titles give you insight into the scope and impact of an incident. The titles are dynamically generated based on how the alerts are correlated and how many subjects are affected.

Key Features

Descriptive incident titles provide insight into:

  • Incident impact—What part of the infrastructure is affected by the incident. For example, see whether an entire cluster is down or just a single host.
  • Correlation logic—Why the alerts are grouped together, which can help you investigate the problem.
  • Related alerts—A summary of the individual alerts that the incident contains.

How It Works

Incident titles are generated based on the following logic:

  • Main title—shows why the alerts are correlated into an incident. For example, if an incident correlates all CPU alerts on the database cluster, the main title is: cluster: **database** · check: **CPU**
  • Subtitle—summarizes the subjects that are part of the incident. For example, if the incident is correlated by a cluster, the subtitle lists alerting hosts in the cluster. If the incident is correlated by host, the subtitle lists alerting checks. The subtitle includes the following elements:
    • _Counter—_number of unique subjects that are affected.
    • _Type—_type of subjects (ie: host or check).
    • _Time period—_amount of time during which the alerts were first triggered.
    • _Examples—_list of unique subjects. The subtitle shows as many examples as can fit, depending on the screen size.

Examples of Incident Titles

Checks Correlated by Host

In the example incident below, the main title indicates that all alerts belong to the api12.nyc.acme.com host. The subtitle indicates that 4 checks are alerting on the host and all of the checks were triggered within 1 minute of each other. Also, the subtitle lists a few examples of the checks.

Hosts Correlated by Cluster and Check

In the example incident below, the main title indicates that all alerts have a check of CPU load and they all occurred within the billing cluster. The subtitle indicates that a total of 6 hosts are involved in the incident. The alerts were triggered within 22 minutes of each other. Also, the subtitle lists a few examples of the hosts.

Ping Alerts Correlated Within a Data Center

This example demonstrates a type of incident that commonly occurs when a network switch fails. In the example below, the main title indicates that the Dublin data center is experiencing a ping issue. The subtitle indicates that a total of 243 hosts reported the problem over a time span of 4 minutes. Also, the subtitle lists a few examples of the hosts.

Examples of Alert Correlation

Multi-Node MySQL Cluster Experiencing Loads

In the example, six different nodes in the same MySQL cluster began sending load-related alerts within 29 minutes of each other. Some of the nodes recovered momentarily before returning to a critical state, while others remained critical or in a warning state. BigPanda grouped more than 75 critical and recovery events into a single incident, and displays them together into a timeline that makes it easy to spot which alerts occurred first, and what nodes alerted later.

16741674

Multi-Node Correlation Timeline

Multiple Flapping Alerts For A Single Application

In this example, two different alerts came in for the same application. The alerts then proceeded to resolve and reopen in rapid succession. When an alert is changing states frequently, or flapping, it may generate numerous events that are not immediately actionable. BigPanda grouped each of these potential notifications into one ongoing incident to maximize visibility without overwhelming your team with duplicate notifications.

11131113

Flapping Incidents Timeline

To learn more about when incidents are resolved, reopened, or considered in a flapping state, see our Incident Life Cycle Logic documentation.

Next Steps

Learn more about Managing Alert Correlation

Dig more into Managing Incident Enrichment

Learn more about Incident Life Cycle Logic