Algorithmic Correlation

Correlation is a process of grouping related alerts into a single, high-level incident. BigPanda uses pattern recognition to automatically process the data generated by your monitoring systems and to dynamically cluster alerts into meaningful, actionable incidents. BigPanda provides default correlation patterns as well as the option to tailor patterns to your organization.

Algorithmic Correlation Process

Key Features

BigPanda's automatic alert correlation provides:

  • Improved detection - Find critical issues faster. During outages, massive alert volumes are intelligently clustered into incidents, so important alerts stand out from the noise and you can stay focused on the main issue.
  • Faster remediation - Get the full context of an incident instead of just one data point. For example, you can quickly learn that the entire MongoDB cluster is having a multitude of disk issues, instead of analyzing isolated DISK IO alerts to find their commonality.
  • Better productivity - Reduce the number of tickets that operators have to handle, thereby improving their ability to effectively manage emergency situations.
  • More control - Customize how alerts are correlated to improve accuracy and efficiency. The correlation logic is fully transparent and easily configurable from right within the BigPanda UI.

How It Works

BigPanda ingests the raw event data from monitoring systems such as Nagios, CloudWatch, and systems integrated via the Alerts API. The data is normalized into standard tags and enriched with configuration information, operational categories and other custom tags. Then, the BigPanda alert correlation engine merges the events into alerts and clusters the alerts into high-level, actionable incidents by evaluating the properties against patterns in:

  • Topology - The host, host group, service, application, cloud, or other infrastructure element that emits the alerts. Alerts are more likely to be related when they come from the same area in your infrastructure.
  • Time - The rate at which related alerts occur. Alerts occurring around the same time are more likely to be related than alerts occurring far apart.
  • Context - The type of alerts. Some alert types imply a relationship between them, while others donโ€™t.

As new alerts are received, BigPanda evaluates all matching patterns, and determines whether to update an existing incident or create a new incident. With this powerful algorithm, BigPanda can effectively and accurately correlate alerts to dramatically reduce your monitoring noise by as much as 90 โ€“ 99% in some environments. Correlations occur in under 100ms so you see updates in real time for maximum visibility into critical problems.

๐Ÿ“˜

Terminology

See our Glossary for the differences and hierarchy of raw Events to merged Alerts to correlated Incidents.

Custom Correlation Patterns

Correlation patterns are high-level definitions that determine how alerts are clustered into BigPanda incidents. To increase the effectiveness of your alert correlation, you can customize the correlation pattern definitions in BigPanda based on the structure and processes of your company's production infrastructure. For example, you can create patterns that correlate:

  • Network-related connectivity issues within the same data center.
  • Application-specific checks on the same host.
  • Load-related alerts from multiple servers in the same database cluster.
  • Low memory alerts on a distributed cache.

To learn more about defining and managing correlation patterns, see our Working with Correlation Patterns guide.

Examples of Alert Correlation

Multi-Node MySQL Cluster Experiencing Loads

In the example, six different nodes in the same MySQL cluster began sending load-related alerts within 29 minutes of each other. Some of the nodes recovered momentarily before returning to a critical state, while others remained critical or in a warning state. BigPanda grouped more than 75 critical and recovery events into a single incident, and displays them together into a timeline that makes it easy to spot which alerts occurred first, and what nodes alerted later.

Multi-Node Correlation Timeline

Multiple Flapping Alerts For A Single Application

In this example, two different alerts came in for the same application. The alerts then proceeded to resolve and reopen in rapid succession. When an alert is changing states frequently, or flapping, it may generate numerous events that are not immediately actionable. BigPanda grouped each of these potential notifications into one ongoing incident to maximize visibility without overwhelming your team with duplicate notifications.

Flapping Incidents Timeline

To learn more about when incidents are resolved, reopened, or considered in a flapping state, see our Incident Life Cycle Logic documentation.

Updated 23 days ago


Algorithmic Correlation


Correlation is a process of grouping related alerts into a single, high-level incident. BigPanda uses pattern recognition to automatically process the data generated by your monitoring systems and to dynamically cluster alerts into meaningful, actionable incidents. BigPanda provides default correlation patterns as well as the option to tailor patterns to your organization.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.