Algorithmic Correlation

Correlation is a method of grouping highly related alerts into a single, high-level incident. BigPanda uses pattern recognition to automatically process the data generated by your monitoring systems and to dynamically cluster alerts into meaningful, actionable incidents. The default correlation patterns provide excellent noise suppression and get you up and running quickly. In addition, you can add and modify patterns to increase the effectiveness of the alert correlation for your organization.

Benefits of Correlation

By automating alert correlation with BigPanda, you gain:

  • Improved detection—find critical issues faster. During outages, massive alert volumes are intelligently clustered into incidents, so important alerts stand out from the noise and you can stay focused on the main issue.

  • Faster remediation—get the full context of an incident, instead of just one data point. For example, you can quickly learn that the entire MongoDB cluster is having a multitude of disk issues, instead of analyzing an isolated DISK IO alert.

  • Better productivity—reduce the number of tickets that operators have to handle, thereby improving their ability to effectively manage emergency situations.

  • More control—customize how alerts are correlated to improve accuracy and efficiency. The correlation logic is fully transparent and easily configurable, without writing any code and with only a handful of high-level patterns.

How It Works

BigPanda ingests the raw event data from your monitoring systems, such as Nagios, CloudWatch, and systems integrated via the Alerts API. The data is normalized into standard tags and enriched with configuration information, operational categories, and other custom tags. Then, the BigPanda alert correlation engine merges the events into alerts and clusters the alerts into high-level, actionable incidents by evaluating the properties against patterns in:

  • Topology—the host, host group, service, application, cloud, or other infrastructure element that emits the alerts. Alerts are more likely to be related when they come from the same area in your infrastructure.

  • Time—the rate at which related alerts occur. Alerts occurring around the same time are more likely to be related than alerts occurring far apart.

  • Context—the type of alerts. Some alert types imply a relationship between them, while others don’t.

As new alerts are received, BigPanda evaluates all matching patterns, and determines whether to update an existing incident or create a new incident. With this powerful algorithm, BigPanda can effectively and accurately correlate alerts to dramatically reduce your monitoring noise, by as much as 90 – 99% in some environments. Correlations occur in under 100ms, so you see updates in real time for maximum visibility into critical problems.

To learn more about merging events into alerts and clustering alerts into incidents, see Alert Correlation Logic. To learn about when incidents are resolved, reopened, or considered in a flapping state, see Incident Life Cycle Logic.

Custom Correlation Patterns

Correlation patterns are high-level definitions that determine how alerts are clustered into BigPanda incidents. To increase the effectiveness of your alert correlation, you can customize the correlation pattern definitions in BigPanda based on the structure and processes of your company's production infrastructure. For example, you can create patterns that correlate:

  • Network-related connectivity issues within the same data center.
  • Application-specific checks on the same host.
  • Load-related alerts from multiple servers in the same database cluster.
  • Low memory alerts on a distributed cache.

To learn how patterns are applied, see Alert Correlation Logic. To configure custom correlation patterns, see Defining Correlation Patterns.

Examples Of Alert Correlation

Multi-Node MySQL Cluster Experiencing Loads

In the example timeline shown below, each row represents a separate alert for a node in the same MySQL cluster. From the incident summary at the top of the timeline, you can see that six different hosts began sending load-related alerts within 29 minutes of each other. The timeline shows that some of the nodes recover momentarily before returning to a critical state. In this example, BigPanda grouped more than 75 critical and recovery events into a single incident.

Connectivity Issues Escalate into Multiple Service Failures For A Single Host

In this example, BigPanda effectively grouped the various alerts for a single host into one incident, despite time differences between events. From the timeline shown below, you can see how the connectivity issues escalated into multiple, related service failures.

Multiple Flapping Alerts For A Single Application

When an alert is changing states frequently, or flapping, it may generate numerous events that are not immediately actionable. In the example timeline shown below, you can see how hundreds of potential notifications are grouped into one incident for the application that is flapping.

Alert Correlation Logic

BigPanda ingests the raw event data from your monitoring systems and merges the events into alerts so that you can visualize the life cycle of a detected issue over time. Then, related alerts are correlated into incidents for visibility into high-level, actionable problems. Understanding how BigPanda processes your monitoring data can help you configure and use BigPanda more effectively, in particular if you are using the REST API to develop a custom integration or the correlation editor to modify a correlation pattern.

Primary and Secondary Properties

Internally, BigPanda considers certain properties of each alert as the primary and secondary properties. For example, in the payload below, the host property is considered primary, and the check property is considered secondary. These two properties are used for various purposes, including:

  • During correlation, BigPanda uses both properties to identify which events are part of the same alert.

  • In the default correlation pattern, BigPanda uses the primary property to determine if alerts are related to each other.

  • In the UI, BigPanda uses the primary property to construct the title and the secondary property to construct the subtitle of an incident.

The secondary property is optional. If the alert does not contain a value for the secondary property, BigPanda uses only the primary property to process the alert.

{
  "app_key": "123",
  "status": "critical",
  "host": "production-database-1",
  "timestamp": 1402302570,
  "check": "CPU overloaded",
  "description": "CPU is above upper limit (70%)",
  "cluster": "production-databases",
  "my_unique_attribute": "my_unique_value"
}

Processing Events in the Alert Life Cycle

In BigPanda, an alert represents the current state of a specific sensor in the source monitoring system. Every alert has its own life cycle—it starts at some point, ends at another, and occasionally flaps between states. For example, a CPU load alert may start with a warning event, then increase in severity with a critical event, and finally get resolved with an ok event. BigPanda uses deduping and merging to process events so that you can see the life cycle of an alert as it unfolds.

Events Deduping

If BigPanda receives two events with the same application key, timestamp, and primary and secondary properties, the last of these events is dropped.

Sending Duplicate Events With The REST API

When an event sent via the Alerts API is deduplicated, BigPanda returns an HTTP response code of 204 No Content.

Merging Alerts into an Incident from a Single Host

A single incident in BigPanda can contain one or more alerts. Alerts are merged into the same incident if they have the same application key and primary and secondary properties. The current status and properties of the incident in BigPanda represent the most recent alert, which is determined by the timestamp property. In the following example, these two alerts would be merged into a single incident with status of Warning and description of CPU is above warning limit (40%).

# First alert
   {
    "status": "critical",
    "host": "production-database-1",
    "timestamp": 1492452456, # 17 Apr 2017 18:07:36 GMT
    "check": "CPU overloaded",
    "description": "CPU is above upper limit (70%)"
   }

# Second alert
   {
    "status": "warning",
    "host": "production-database-1",
    "timestamp": 1492452578, # 17 Apr 2017 18:09:38 GMT
    "check": "CPU overloaded",
    "description": "CPU is above warning limit (40%)"
   } 

A helpful way to understand how alerts are merged into incidents is by using the timeline to visualize the life cycle of an incident in BigPanda. Each row in the timeline represents an individual alert, and each dot in a row represents an event in the life cycle of that alert.

Sending Multiple Alerts With The REST API

BigPanda uses the timestamp to determine the latest status of an incident. If it is not included, BigPanda uses the time when the alert is received. To ensure that BigPanda accurately reflects the current status, when sending multiple alerts, you must include the timestamp for each alert or sort the incidents array by when the alerts occurred, in ascending order.

Clustering Alerts into Incidents

After merging events into alerts, BigPanda provides additional noise suppression and improved visibility by clustering highly related alerts into a single, high-level incident. For example, a connectivity problem may cause several checks on the same host to enter a critical state. All of these alerts are clustered into a single incident, so that you can see up-to-date information from each check on the same timeline. BigPanda uses correlation patterns to define relationships between alerts and applies pattern recognition to dynamically cluster alerts into incidents.

Defining Correlation Patterns

Correlation patterns define the relationships between alerts by using these parameters:

  • Tags—properties that indicate when alerts are related. For example, correlate all alerts that come from the same cluster and have the same check.

  • Time window—amount of time between when the alerts started. For example, network-related alerts may start within a short time from one another, while load issues may develop over a longer period of time.

  • Filter—(optional) conditions that further refine which alerts this relationship applies to. For example, correlate only network-related alerts by data center.

The default correlation patterns are:

  • Same primary property, started within a 2 hour time window—identifies related alerts from the same primary object, such as a host, application, or service. For example, several alerts on the same host with different checks may be related to the same problem.

  • Same cluster, started within a 30 minute time window—identifies when different objects within the same topological area of your infrastructure may be experiencing the same problem. For example, high CPU alerts on several servers in your MySQL cluster.

You can customize the correlation pattern definitions to increase the effectiveness of your alert correlation. To learn more, see Defining Correlation Patterns.

Applying Pattern Recognition

To achieve dynamic clustering, BigPanda keeps track of all the patterns that match an incident, and matching patterns are evaluated iteratively in real time. When a new alert is received, BigPanda evaluates it against any patterns for active incidents that are within the start time window. If the alert matches a pattern for an existing incident, it is added as a related alert, and any patterns that no longer match all of the related alerts are eliminated as matching the incident. If the alert doesn't match an existing incident, a new incident is created with any patterns that match the alert. The incident title is determined by the matching pattern with the widest time window.

As new alerts are received, the process continues, and incidents become more well-defined as more information is available. After the time windows of matching patterns have elapsed, no new alerts are added to the incident. The Incident Life Cycle Logic is determined by the alerts it contains, and the incident remains open until all related alerts are resolved.

BigPanda correlates a maximum of 300 alerts into a single incident.

Defining Correlation Patterns

You can define and manage the correlation patterns that determine how alerts from an integrated monitoring source are clustered into BigPanda incidents. Use custom patterns to optimize alert correlation based on the structure and processes of your company's production infrastructure.

Prerequisites

Creating New Correlation Patterns

Correlation patterns determine how alerts are clustered into incidents based on the monitoring source, the values of specific tags, the starting time, and a filter. For example, you can create a pattern to correlate AppDynamics alerts with the same application, starting within 30 minutes of one another, in the production cluster. To create a new correlation pattern:

  1. In the top right, click the Settings icon (), and then click Correlation Patterns.
    A list of the existing patterns appears.

  2. Click New Pattern.

  3. Define the conditions that indicate the alerts are related.

Field
Description

Source Systems

One or more integrated monitoring systems for which this pattern applies. As you type, the field displays matching source system names. Click an item to add it.

If you enter more than one source system, see the Allow Cross Source Correlation check box.

Allow Cross Source Correlation

Option to correlate alerts from different source systems into the same incident. This option applies only if you select more than one source system for the pattern.

  • Select the check box to correlate alerts from different source systems into the same incident, when applicable.

  • Clear the check box to correlate only alerts from the same source into the same incident. The pattern still applies to every alert from every selected source.

Tags

Tag names for which alerts with matching values are correlated. For example, enter cluster and check to correlate all alerts that come from the same cluster and have the same check. You can enter up to five tags. As you type, the field displays matching tag names and relevant source systems. Click an item to add it.

Time Window

Maximum duration between the start time of correlated alerts. You can select a time window from 1 minute up to 4320 minutes (3 days).

Query Filter

(Optional) Query that further refines which alerts are correlated. For example, you can specify a tag of datacenter and then enter a query of check=*ping* to correlate only ping alerts by data center.

Click Add a Query Filter and enter a query in BigPanda Query Language (BPQL). As you type, the field displays suggested tags and relevant source system names.

Note

(Optional) Short description of the pattern. Consider adding a note that explains why the pattern is important and how it works.

Create As Inactive

Option to save the pattern definition without affecting your BigPanda instance. If you do not select the check box, BigPanda begins correlating new alerts according to the pattern immediately after it is created.

Tags

To learn more about a tag, you can view an alert in the BigPanda UI, reference the documentation on standard tags, or review the custom tags) defined for your organization.

Correlation Time Window vs. Incident Reopen Window

The correlation time window applies to the first event for a new alert. Alerts are correlated into the same incident only if their first event falls within the same time window (that is, they started around the same time).

The reopen window for an incident applies to existing alerts that change status from closed to open. If an alert on a resolved incident reoccurs within the reopen window, the incident is reopened. If the reopen window has elapsed, it is treated as a new alert.

  1. Use the preview to test the pattern, and adjust the pattern definition as necessary to optimize the correlation.

  2. Click Create Pattern.

When Patterns Are Applied

When you create a new correlation pattern or activate a previously disabled pattern, any new alerts are correlated according to the pattern. Existing incidents are not affected by the pattern.

After an alert is correlated into an incident, it remains in that incident. If more than one pattern matches an incident, the incident title is based on the pattern with the largest time window.

Using The Preview

When creating or editing a pattern, you can use the preview to test the pattern against your real historical data in BigPanda, without affecting any live data.

  1. (Optional) In the right pane, select a time frame for the preview, and click Regenerate to update the preview.

The preview displays up to 50 incidents within the selected time frame that match the correlation pattern.

  1. Evaluate the correlation results in the preview for:

    • Effectiveness—review the compression rate to see the percentage of alerts that are correlated into incidents. You can adjust the time frame to see whether it impacts the compression rate. If a pattern is not as effective as it used to be, you may need to optimize the pattern to account for infrastructure changes.

    • Accuracy—review how actual alerts would have been correlated into incidents according to this pattern. Confirm that alerts in each incident are related to the same problem. If the correlation patterns are not accurate, they won’t help you resolve issues faster and may lead to confusion or missed issues.

  2. If necessary, adjust the pattern definition in the left pane and click Regenerate to update the preview.

  3. Repeat as necessary to optimize the correlation pattern.

Managing Existing Correlation Patterns

You can edit, duplicate, temporarily disable, or permanently delete correlation patterns.

  1. In the top right, click your name, then click Settings.

  2. In the left pane, click Correlation Patterns. The existing patterns are sorted with the most recently used pattern listed first.

  3. Locate the pattern you want to change.

Searching For Patterns

You can filter the list of patterns by entering a search term in the field above the list. For example, enter Nagios to see all of the correlation patterns that have Nagios included as a source system.

  1. Use any of the following options to manage the pattern:

    • To edit the pattern, click the Edit icon (), and then modify the conditions under which alerts are correlated. Click Update Pattern to apply the changes.

    • To duplicate the pattern, click the Duplicate icon (), and then modify the pattern as necessary. Click Duplicate Pattern to save it as a new pattern. For example, to create similar patterns for two different data sources with different tag names, you can create the first pattern. Then, duplicate and modify the copy to select the different source system and tag.

    • To temporarily disable the pattern, click the pattern, and then click the Active toggle at the top right of the pattern details pane. Click Deactivate to confirm the change. You can re-enable the pattern by clicking the Inactive toggle.

    • To permanently delete the pattern, click the Delete icon (), and then click Delete to confirm the deletion.

Changes Affect Only New incidents

Changes to correlation patterns affect only new incidents, not existing incidents. When you disable or delete a pattern, new alerts are no longer correlated according to it. However, existing incidents stay correlated according to the pattern logic for the remaining life cycle of the incident.

Post-Requisites

Ensure the correlation patterns are giving the desired results by periodically reviewing incidents and obtaining feedback from your team.

Incident Life Cycle Logic

The life cycle of an incident is defined by the life cycle of the alerts it contains. An incident remains active if at least one of the alerts is active, it is automatically resolved when all the alerts are resolved, and it is reopened when a resolved alert becomes active again.

Alert Resolution and Closing Incidents

An incident remains open as long as at least one of the alerts associated with it is open. When BigPanda receives an event with a status of ok, the related alert is automatically resolved.

Alerts that have not been resolved remain open in BigPanda. The corresponding incident also remains open and continues to appear in the Incident Feed.

Resolving Alerts With The REST API

To maintain only the most relevant information in the incident feed, it is recommended that a resolving event is sent to BigPanda when an alert is no longer active.

Reopening Incidents

Resolved incidents are reopened when any of the alerts associated with them reopen. This rule applies regardless of how the incident was resolved—manually, due to inactivity, or automatically when all associated alerts were resolved. Alerts are reopened if they reoccur within 4 hours of when they were resolved. If an alert reoccurs more than 4 hours later, it is handled as a new alert.

New alerts are not correlated to resolved incidents, and resolved incidents are reopened only when existing alerts become active again. Incidents that are more than 30 days old are never reopened. If the associated alerts reoccur, a new incident is created.

Flapping Incidents

Flapping occurs when a monitored object, such as a service or host, changes state too frequently, making the cause and severity of the incident unclear. For example, flapping can be indicative of configuration problems (such as thresholds set too low), troublesome services, or real network problems.

In BigPanda, an incident enters the flapping state when one or more of the related alerts are flapping. By default, an alert is considered to be flapping when it has changed states more than 4 times in one hour. Contact BigPanda support if you need to configure custom logic (number of state changes within a period of time) for your organization or for a specific integration.

When an incident enters the flapping state, all subscribed users are notified and no additional state change notifications are sent. Subscribed users still receive a daily email reminding them about the incident. An incident exits the flapping state when all related alerts stop flapping (no longer meet the criteria for number of state changes in a period of time). BigPanda checks the flapping criteria every 15 minutes.

Incident Titles

Incident titles give you insight into the scope and impact of an incident. They are dynamically generated based on how the alerts are correlated and how many subjects are affected.

Benefits of Incident Titles

With descriptive incident titles, you can quickly gain insights into:

  • Incident impact—see what infrastructure is affected by the incident. For example, see whether an entire cluster is down or just a single host.

  • Correlation logic—understand why the alerts are grouped together, which can help you investigate the problem.

  • Related alerts—see a summary of the individual alerts that the incident contains.

How It Works

Incident titles are generated based on the following logic:

  • Main title—shows why the alerts are correlated into an incident. For example, if an incident correlates all CPU alerts on the database cluster, the main title is: cluster: **database** · check: **CPU**

  • Subtitle—summarizes the subjects that are part of the incident. For example, if the incident is correlated by a cluster, the subtitle lists alerting hosts in the cluster. If the incident is correlated by host, the subtitle lists alerting checks. The subtitle includes the following elements:

    • Counter—number of unique subjects that are affected.

    • Type—type of subjects (for example, hosts or checks).

    • Time period— amount of time during which the alerts were first triggered.

    • Examples—list of unique subjects. The subtitle shows as many examples as can fit, depending on the screen size.

Example of Incident Titles

Checks Correlated by Host

In the example incident below, the main title indicates that all alerts belong to the api12.nyc.acme.com host. The subtitle indicates that 4 checks are alerting on the host and all of the checks were triggered within 1 minute of each other. Also, the subtitle lists a few examples of the checks.

Hosts Correlated by Cluster and Check

In the example incident below, the main title indicates that all alerts have a check of CPU load and they all occurred within the billing cluster. The subtitle indicates that a total of 6 hosts are involved in the incident. The alerts were triggered within 22 minutes of each other. Also, the subtitle lists a few examples of the hosts.

Ping Alerts Correlated Within a Data Center

This example demonstrates a type of incident that commonly occurs when a network switch fails. In the example below, the main title indicates that the Dublin data center is experiencing a ping issue. The subtitle indicates that a total of 243 hosts reported the problem over a time span of 4 minutes. Also, the subtitle lists a few examples of the hosts.

Algorithmic Correlation

Correlation is a method of grouping highly related alerts into a single, high-level incident. BigPanda uses pattern recognition to automatically process the data generated by your monitoring systems and to dynamically cluster alerts into meaningful, actionable incidents. The default correlation patterns provide excellent noise suppression and get you up and running quickly. In addition, you can add and modify patterns to increase the effectiveness of the alert correlation for your organization.