BigPanda's automatic alert correlation provides:
- Improved detection - Find critical issues faster. During outages, massive alert volumes are intelligently clustered into incidents, so important alerts stand out from the noise and you can stay focused on the main issue.
- Faster remediation - Get the full context of an incident instead of just one data point. For example, you can quickly learn that the entire MongoDB cluster is having a multitude of disk issues, instead of analyzing isolated DISK IO alerts to find their commonality.
- Better productivity - Reduce the number of tickets that operators have to handle, thereby improving their ability to effectively manage emergency situations.
- More control - Customize how alerts are correlated to improve accuracy and efficiency. The correlation logic is fully transparent and easily configurable from right within the BigPanda UI.
BigPanda ingests the raw event data from monitoring systems such as Nagios, CloudWatch, and systems integrated via the Alerts API. The data is normalized into standard tags and enriched with configuration information, operational categories and other custom tags. Then, the BigPanda alert correlation engine merges the events into alerts and clusters the alerts into high-level, actionable incidents by evaluating the properties against patterns in:
- Topology - The host, host group, service, application, cloud, or other infrastructure element that emits the alerts. Alerts are more likely to be related when they come from the same area in your infrastructure.
- Time - The rate at which related alerts occur. Alerts occurring around the same time are more likely to be related than alerts occurring far apart.
- Context - The type of alerts. Some alert types imply a relationship between them, while others don’t.
As new alerts are received, BigPanda evaluates all matching patterns, and determines whether to update an existing incident or create a new incident. With this powerful algorithm, BigPanda can effectively and accurately correlate alerts to dramatically reduce your monitoring noise by as much as 90 – 99% in some environments. Correlations occur in under 100ms so you see updates in real time for maximum visibility into critical problems.
Correlation patterns are high-level definitions that determine how alerts are clustered into BigPanda incidents. To increase the effectiveness of your alert correlation, you can customize the correlation pattern definitions in BigPanda based on the structure and processes of your company's production infrastructure. For example, you can create patterns that correlate:
- Network-related connectivity issues within the same data center.
- Application-specific checks on the same host.
- Load-related alerts from multiple servers in the same database cluster.
- Low memory alerts on a distributed cache.
To learn more about defining and managing correlation patterns, see our Working with Correlation Patterns guide.
In the example, six different nodes in the same MySQL cluster began sending load-related alerts within 29 minutes of each other. Some of the nodes recovered momentarily before returning to a critical state, while others remained critical or in a warning state. BigPanda grouped more than 75 critical and recovery events into a single incident, and displays them together into a timeline that makes it easy to spot which alerts occurred first, and what nodes alerted later.
In this example, two different alerts came in for the same application. The alerts then proceeded to resolve and reopen in rapid succession. When an alert is changing states frequently, or flapping, it may generate numerous events that are not immediately actionable. BigPanda grouped each of these potential notifications into one ongoing incident to maximize visibility without overwhelming your team with duplicate notifications.
To learn more about when incidents are resolved, reopened, or considered in a flapping state, see our Incident Life Cycle Logic documentation.
Updated 5 months ago