Alert Correlation Logic

BigPanda ingests the raw event data from your monitoring systems and merges the events into alerts so that you can visualize the life cycle of a detected issue over time. Then, related alerts are correlated into incidents for visibility into high-level, actionable issues.

Understanding how BigPanda determines which events are correlated into an alert and which alerts are grouped together into incidents can help you configure and use BigPanda more effectively, particularly if you are using the Alerts REST API to develop a custom integration or the correlation editor to modify a correlation pattern.

Primary and Secondary Properties

Internally, BigPanda considers certain properties of each alert primary and secondary properties. These two properties are used for various purposes:

  • During correlation, BigPanda uses both properties to identify which events are part of the same alert.
  • In the default correlation pattern, BigPanda uses the primary property to determine if alerts are related to each other.
  • In the UI, BigPanda uses the primary property to construct the title and the secondary property to construct the subtitle of an incident.

📘

The secondary property is optional. If the alert does not contain a value for the secondary property, BigPanda uses only the primary property to process the alert.

For example, in the payload below, the host property is considered primary, and the check property is considered secondary.

{
  "app_key": "123",
  "status": "critical",
  "host": "production-database-1",
  "timestamp": 1402302570,
  "check": "CPU overloaded",
  "description": "CPU is above upper limit (70%)",
  "cluster": "production-databases",
  "my_unique_attribute": "my_unique_value"
}

Processing Events in the Alert Life Cycle

In BigPanda, an alert represents the current state of a specific sensor in the source monitoring system. Every alert has its own life cycle—it starts at some point, ends at another, and occasionally flaps between states. For example, a CPU load alert may start with a warning event, then increase in severity with a critical event, and finally get resolved with an ok event. BigPanda uses deduping and merging to process events so that you can see the life cycle of an alert as it unfolds.
Event updates are displayed in the Incident Timeline. Each row in the timeline represents an individual alert, each dot representing an event in the life cycle of that alert.

Event Deduping

Also known as event deduplication and event marshalling, deduping is the process by which BigPanda eliminates redundant data to reduce noise and simplify incident investigation. The following three scenarios can occur if BigPanda receives two or more events with similar payloads.

Scenario

Action

The event payload (including the application key, timestamp, and primary and secondary properties) exactly matches an event that was already received.

The event is dropped.

The timestamp (or any other value in the event payload) has changed, but its status (ok/warning/critical) has not changed.

The event is merged with the previous event, updating the tag values from the new event.

The event payload's status has changed from the previous event.

The event is added.

Merging Alerts into an Incident from a Single Host

A single incident in BigPanda can contain one or more alerts. Alerts are merged into the same incident if they have the same application key and primary and secondary properties. The current status and properties of the incident in BigPanda represent the most recent alert, which is determined by the timestamp property.
In the following example, these two alerts would be merged into a single incident with a status of Warning and the description CPU is above warning limit (40%).

# First alert
   {
    "status": "critical",
    "host": "production-database-1",
    "timestamp": 1492452456, # 17 Apr 2017 18:07:36 GMT
    "check": "CPU overloaded",
    "description": "CPU is above upper limit (70%)"
   }

# Second alert
   {
    "status": "warning",
    "host": "production-database-1",
    "timestamp": 1492452578, # 17 Apr 2017 18:09:38 GMT
    "check": "CPU overloaded",
    "description": "CPU is above warning limit (40%)"
   }

❗️

Sending Multiple Alerts with the Alerts REST API

BigPanda uses the timestamp to determine the latest status of an incident. If it is not included, BigPanda uses the time when the alert was first received. To ensure that BigPanda accurately reflects the most current status, when sending multiple alerts, you must include the timestamp for each alert or sort the incidents array by when the alerts occurred, in ascending order.

Clustering Alerts into Incidents

After merging events into alerts, BigPanda provides additional noise suppression and improved visibility by clustering highly related alerts into a single, high-level incident. For example, a connectivity problem may cause several checks on the same host to enter a critical state. All of these alerts are clustered into a single incident, so that you can see up-to-date information from each check on the same timeline. BigPanda uses correlation patterns to define relationships between alerts and applies pattern recognition to dynamically cluster alerts into incidents.
BigPanda can correlate a maximum of 300 alerts into a single incident.

Defining Correlation Patterns

Correlation patterns define the relationships between alerts by using the following parameters:

  • Source Systems - the integrated monitoring systems for which the pattern applies. For example, show alerts that come from Nagios, Datadog, etc.
  • Tags - the properties that indicate when alerts are related. For example, correlate all alerts that come from the same cluster and have the same check.

👍

Tags

To learn more about a tag, you can view an alert in the BigPanda UI, reference the documentation on standard tags, or review the custom tags defined for your organization.

  • Time window - The amount of time between when the alerts started. For example, network-related alerts may start within a short time from one another, while load issues may develop over a longer period of time.
  • Filter - (optional) The conditions that further refine which alerts this relationship applies to. For example, correlate only network-related alerts by data center.

The default correlation patterns are:

  • Same primary property, started within a 2 hour time window - Identifies related alerts from the same primary object, such as a host, application, or service. For example, several alerts on the same host with different checks may be related to the same problem.
  • Same cluster, started within a 30 minute time window - Identifies when different objects within the same topological area of your infrastructure may be experiencing the same problem. For example, high CPU alerts on several servers in your MySQL cluster.

Applying Pattern Recognition

To achieve dynamic clustering, BigPanda keeps track of all the patterns that match an incident, and matching patterns are evaluated in real time. When a new alert is received, BigPanda evaluates it against any patterns for active incidents that are within the start time window. If the alert matches a pattern for an existing incident, it is added as a related alert, and any patterns that no longer match all of the related alerts are eliminated as matching the incident. If the alert doesn't match an existing incident, a new incident is created with any patterns that match the alert. The incident title is determined by the matching pattern with the widest time window.

As new alerts are received, the process continues, and incidents become more well-defined as more information is available. After the time windows of matching patterns have elapsed, no new alerts are added to the incident. The Incident Life Cycle Logic is determined by the alerts it contains, and the incident remains open until all related alerts are resolved.

Incident Titles

Incident titles give you insight into the scope and impact of an incident. The titles are dynamically generated based on how the alerts are correlated and how many subjects are affected.

Key Features

Descriptive incident titles provide insight into:

  • Incident impact—What part of the infrastructure is affected by the incident. For example, see whether an entire cluster is down or just a single host.
  • Correlation logic—Why the alerts are grouped together, which can help you investigate the problem.
  • Related alerts—A summary of the individual alerts that the incident contains.

How It Works

Incident titles are generated based on the following logic:

  • Main title—shows why the alerts are correlated into an incident. For example, if an incident correlates all CPU alerts on the database cluster, the main title is: cluster: **database** · check: **CPU**
  • Subtitle—summarizes the subjects that are part of the incident. For example, if the incident is correlated by a cluster, the subtitle lists alerting hosts in the cluster. If the incident is correlated by host, the subtitle lists alerting checks. The subtitle includes the following elements:
    • Counter—number of unique subjects that are affected.
    • Type—type of subjects (ie: host or check).
    • Time period—amount of time during which the alerts were first triggered.
    • Examples—list of unique subjects. The subtitle shows as many examples as can fit, depending on the screen size.

Examples of Incident Titles

Checks Correlated by Host

In the example incident below, the main title indicates that all alerts belong to the api12.nyc.acme.com host. The subtitle indicates that 4 checks are alerting on the host and all of the checks were triggered within 1 minute of each other. Also, the subtitle lists a few examples of the checks.

Hosts Correlated by Cluster and Check

In the example incident below, the main title indicates that all alerts have a check of CPU load and they all occurred within the billing cluster. The subtitle indicates that a total of 6 hosts are involved in the incident. The alerts were triggered within 22 minutes of each other. Also, the subtitle lists a few examples of the hosts.

Ping Alerts Correlated Within a Data Center

This example demonstrates a type of incident that commonly occurs when a network switch fails. In the example below, the main title indicates that the Dublin data center is experiencing a ping issue. The subtitle indicates that a total of 243 hosts reported the problem over a time span of 4 minutes. Also, the subtitle lists a few examples of the hosts.

Learn more...

To learn more about how BigPanda uses pattern recognition to cluster alerts into meaningful, actionable incidents, see our Algorithmic Correlation user guide.

To learn more about defining and managing correlation patterns, see our Working with Correlation Patterns guide.

To learn more about when incidents are resolved, reopened, or considered in a flapping state, see our Incident Life Cycle Logic guide.