Correlation Best Practices
Learn best practices for creating correlation patterns in BigPanda.
Correlation allows your team to easily cut through unnecessary clutter and visual noise with event aggregation and alert groupings. By combining related alerts across tools and systems, correlation helps you understand the scope and impact of issues faster.
Correlation patterns are high-level definitions that determine how alerts are clustered into BigPanda incidents. To increase the effectiveness of your alert correlation, you can customize the correlation pattern definitions in BigPanda based on the structure and processes of your company's production infrastructure.
As new alerts are received, BigPanda evaluates all matching patterns and determines whether to update an existing incident or create a new one. With this powerful algorithm, BigPanda can effectively and accurately correlate alerts to reduce your monitoring noise by as much as 90 – 99%. Correlation occurs in under 100ms so you see updates in real time.
For example, you can create patterns that correlate:
- Network-related connectivity issues within the same data center.
- Application-specific checks on the same host.
- Load-related alerts from multiple servers in the same database cluster.
- Low memory alerts on a distributed cache.
This guide uses the functional and contextual tags from the Data Normalization Best Practices and utilizes the newly standard metadata into proper groupings based on the most critical services and applications.
Correlation basics
Looking for just the basics? Check out the Correlation one page document in BigPanda University!
Benefits of Correlation
Alert correlation provides the following key benefits:
- Reduction in alert and incident noise allowing for a consolidated view of related impacts
- Reduction in tickets as a result of reducing one-to-one alert to ticket volume
- Better understanding of cross-source relationships and workflows
Key Principles of Correlation
- Good data normalization and enrichment are prerequisites for successful correlation. If you have not completed those steps, we recommend you start there first. For more information, check out the Data Normalization Best Practices.
- Your business work streams, operational process, and operational maturity sets the requirements for correlation. Understanding how teams operate today and building that information in tags is vital to proper correlation.
- Understanding alerting behavior from your monitoring sources helps set the proper configuration for correlation including source system, tags, time window, and filter condition through BigPanda Query Language.
- Consider infrastructure, network, and application relationships, then identify common alerting metadata to correlate on.
Best Practice and Tips
- Use a minimal number of powerful correlation patterns that cover many use cases. If you find the need to build several filters for patterns, consider building composition tags under a correlation-specific enrichment to meet these requirements. See the Advanced Correlation section for more information.
- Correlation is an evolutionary approach. Iterate upon existing patterns as metadata, enrichment, and use cases change.
- As you build enrichment tags, you can correlate using generic tags such as category_type, alert_type, and parent/child.
- There are typically two types of correlation: pattern-based (general) and rules-based (targeted) correlation. It can be difficult to know when to use certain types of correlation patterns. Here are some tips as you build your correlation strategy:
- We recommend using an 80/20 ratio for pattern-based and rules-based correlation patterns. Rules-based correlation is typically built first. As you understand the alert landscape, you’ll want to consolidate to a more generic pattern covering several use cases. This has less overhead and a larger impact.
- Use Functional tags (host, check, support_group, datacenter) as the primary correlation tag since they can be more generalized. A contextual tag or query can refine and focus on specific workflows unique to your organization.
Actionability
An important concept within monitoring and observability is actionability. Consider what operators or automation should do to meet the service level indicator/objective (SLI/SLO). Your tags will tell the story.
If we use a tag as an actionability utility (make a ticket, chat with someone, page someone in PagerDuty/OpsGenie, send an email) we can then use it to guide the work. This is called a Workflow Tag.
Actionability can be remediated through automation, or by a user (L1, L2, SRE, SME, or any other resolver). Knowing how many times an alert was actioned is a useful metric.
Ask yourself what functional and contextual information you need to create a unit of work. Be sure to qualify what is truly actionable. Creating a ticket doesn’t necessarily deem an alert actionable. If the ticket is automatically closed or ignored without any meaningful resolution, either the alert needs a modification with its SLI/SLO, or the alert should be deemed as noise.
Noise is telemetry without adequate metadata. To reduce noise, we can enrich or eliminate low-quality alerts from our ecosystem and focus only on alerts that support monitoring the top applications.
Correlation Pattern Configuration
Correlation can be configured in many ways. This section will focus on implementing simple patterns and then identifying advanced-level correlations.
For more information about how correlation works, see the BigPanda Alert Correlation Logic documentation.
Correlation Properties
The BigPanda alert correlation engine clusters high-quality alerts into actionable incidents by looking at 4 properties:
- Source system
- Tags
- Time window
- Filter (optional)
Source System
The integrated monitoring systems for which the pattern applies. For example, you can set up a pattern to show alerts that come from Nagios, Datadog, etc.
Best practice
- Single-source correlation can be a good way to segregate alerts that need a specific workflow.
- Cross-source correlation is helpful if two different sources monitor components of the same service or host. This should be selected if the alerts will support the same workstream. Common tags could be a shared host, check, or assignment_group.
Tags
Tag names correlate alerts with matching values. For example, enter ‘cluster’ and ‘check’ to correlate all alerts from the same cluster with the same check.
You can enter up to five tags in any order.
Time Window
The time window is the amount of time between when the alerts started. For example, network-related alerts may start within a short time from one another, while load issues may develop over a longer period of time.
Best practice
Understanding alert trigger behaviors from various monitoring sources can guide time windows. BigPanda will set the default window as 120 min (2 hours).
Consider the following scenarios when setting the time window:
- Known failure intervals: Some alerting will require very specific groupings. If a known subset of alerts should be grouped, set the appropriate time window.
- Noisy alerts: For services with many alerts and high volume, you could reduce the default to a 30-minute or even 5-minute time window based on how often the alert triggers.
- Long interval alerts: If an alert triggers once every 24 hours, a specific correlation pattern can be set to a 1440 minute (24 hour) time window.
Filter (optional)
The filter is the conditions that further refine which alerts the relationship applies to. For example, you could correlate only network-related alerts by data center.
Simple Correlation
Functional tags can be used in simple correlation immediately. However, we suggest keeping these patterns minimal and avoiding over-correlating.
Generally, you shouldn’t need to set up a blanket correlation pattern. There are times where correlating on one functional tag will be more confusing than helpful.
We recommend starting simple. However, these patterns are not “set and forget.” Focus on continuing conversations with your team and making updates as your processes change.
Examples of simple correlations:
- Single tag correlation: host, site or location, ip_class, or application/service
- Integration correlation: Alerts from a specific integration could be correlated, especially if it only monitors one thing. However, you will still need a single tag to correlate on.
- For example, you could do an integration correlation with a timestamp tag based on set time intervals.
- Assignment group correlation: This is based only on groups who want to see their alerts grouped into an ITSM ticket. This can be paired with a functional tag such as host, ci, location, or an application the team owns. Using this can narrow the scope of the correlation grouping, so we recommend you use this for smaller subsets of alerting that require a very accurate assignment group.
Advanced Correlation
When creating patterns, you are building rules based on your functional and contextual tags that describe what is happening and what actions need to be taken. We recommend using your business goals and high-level services to guide groupings of related alert subsets.
Reading the grouped data can quickly tell a story about what is happening. You should be able to easily understand what the alert means and how to begin solving the issue.
Tribal knowledge from operators and artifacts like runbooks and metadata within contextual tags can also be transferred into correlation patterns. Accurately using this information marks successful contextual correlation.
Here are some examples of advanced correlations to group workflows:
location_type
location_type is a regional-based correlation that can be useful for tracking geographic impacts.
Example tags include location, datacenter, site, and interconnect.
category_type
category_type can loosely follow an Open Systems Interconnection (OSI) model to segregate layers tags.
Example tags include network, infrastructure, db_instance, etc.
alert_type / check_type
alert_type or check_type can be useful for rule-based system checks. A common example of this is a specific type of alerting within a category tag such as disk, cpu, memory, firewall, loadbalancer, BGP, OSPF, etc.
For example, if you have CPU utilization that is important for a specific set of servers, you can use the query condition to filter on specific server types. Or perhaps storage disk-all is more critical to watch. A composition tag under check_type can help keep a library of important check types.
We recommend using check_type in an extraction/composition mapping to account for various use cases and then use check-type as an option for correlation.
parent/child
parent/child is a relational correlation. Complete enrichment is required for this to work well.
A common use case for this correlation type is when a network device goes down, and downstream impacts occur. In this situation, you need to have a tag that shows commonality to the device that affects the downstream alerting. Knowing root cause information helps guide strategy around this type of correlation when determining which tag to use.
For example, you could use router = device01 for the network router. Adding the router tag to enrich alerts from downstream objects ties together all servers or dependencies and provides a way to group all related devices or objects.
Validate Correlation Patterns
Validating correlation patterns ensures that the groupings created in BigPanda are accurate. This step is short, but should happen on a continuous basis.
Be sure to get the right stakeholders to review this, such as management, user groups, technical teams, and leadership. Service group owners need to provide feedback on how the correlation is working. Determine what needs to be changed and iterate upon the existing configurations.
Best practices
Use the preview pane and adjust the time window, tags, and filter queries to understand how the correlation will work with the pattern in place
Use Split/Merge as a direct feedback mechanism in the tool for immediate feedback - review this in a dashboard
Configuration Drift
As you change monitoring tools, alert/cmdb enrichments, and naming conventions, you must ensure you account for correlation patterns that rely on tagging that comes from any of those.
Best practice
Please ensure you audit and keep your patterns up to date. Ensure the tags and alerts that are removed from your environment are also removed from your correlation pattern
Unified Analytics
BigPanda provides dashboards to help measure the effectiveness of your correlation patterns. See the Correlation Patterns Dashboard documentation to learn how to measure your correlation effectiveness in Unified Analytics.
Resources
Updated 2 months ago