Automating alert correlation with BigPanda provides:
- Improved detection - Find critical issues faster. During outages, massive alert volumes are intelligently clustered into incidents, so important alerts stand out from the noise and you can stay focused on the main issue.
- Faster remediation - Get the full context of an incident, instead of just one data point. For example, you can quickly learn that the entire MongoDB cluster is having a multitude of disk issues, instead of analyzing an isolated DISK IO alert.
- Better productivity - Reduce the number of tickets that operators have to handle, thereby improving their ability to effectively manage emergency situations.
- More control - Customize how alerts are correlated to improve accuracy and efficiency. The correlation logic is fully transparent and easily configurable, without writing any code and with only a handful of high-level patterns.
BigPanda ingests the raw event data from monitoring systems such as Nagios, CloudWatch, and systems integrated via the Alerts API. The data is normalized into standard tags and enriched with configuration information, operational categories and other custom tags. Then, the BigPanda alert correlation engine merges the events into alerts and clusters the alerts into high-level, actionable incidents by evaluating the properties against patterns in:
- Topology - The host, host group, service, application, cloud, or other infrastructure element that emits the alerts. Alerts are more likely to be related when they come from the same area in your infrastructure.
- Time - The rate at which related alerts occur. Alerts occurring around the same time are more likely to be related than alerts occurring far apart.
- Context - The type of alerts. Some alert types imply a relationship between them, while others don’t.
As new alerts are received, BigPanda evaluates all matching patterns, and determines whether to update an existing incident or create a new incident. With this powerful algorithm, BigPanda can effectively and accurately correlate alerts to dramatically reduce your monitoring noise by as much as 90 – 99% in some environments. Correlations occur in under 100ms so you see updates in real time for maximum visibility into critical problems.
Correlation patterns are high-level definitions that determine how alerts are clustered into BigPanda incidents. To increase the effectiveness of your alert correlation, you can customize the correlation pattern definitions in BigPanda based on the structure and processes of your company's production infrastructure. For example, you can create patterns that correlate:
- Network-related connectivity issues within the same data center.
- Application-specific checks on the same host.
- Load-related alerts from multiple servers in the same database cluster.
- Low memory alerts on a distributed cache.
In the example timeline shown below, each row represents a separate alert for a node in the same MySQL cluster. From the incident summary at the top of the timeline, you can see that six different hosts began sending load-related alerts within 29 minutes of each other. The timeline shows that some of the nodes recover momentarily before returning to a critical state. In this example, BigPanda grouped more than 75 critical and recovery events into a single incident.
In this example, BigPanda effectively grouped the various alerts for a single host into one incident, despite time differences between events. From the timeline shown below, you can see how the connectivity issues escalated into multiple, related service failures.
When an alert is changing states frequently, or flapping, it may generate numerous events that are not immediately actionable. In the example timeline shown below, you can see how hundreds of potential notifications are grouped into one incident for the application that is flapping.
To learn more about how BigPanda merges events into alerts and clusters alerts into incidents, see Alert Correlation Logic.
To learn more about defining and managing correlation patterns, see our Working with Correlation Patterns guide.
To learn more about when incidents are resolved, reopened, or considered in a flapping state, see our Incident Life Cycle Logic guide.
Updated about a year ago