Root Cause Changes (RCC)

BigPanda’s Root Cause Changes (RCC) feature highlights changes related to incidents, helping you find and fix change-related incidents faster.

Identifying the root cause of an outage or a poorly performing application is one of the biggest challenges that IT organizations face today.

Most enterprises experience thousands of incidents every week. IT Ops, NOC and DevOps teams must be able to quickly understand how each incident is impacting the business and prioritize response accordingly, before users and customers are affected. However, because operational and business context is often missing from monitoring data, operators must manually search for context before taking action, a process that wastes precious time and is prone to human error.

  • Modern IT environments can experience thousands of changes every week
  • “Over 85% of outages impacting mission-critical services can be traced back to changes” - Gartner
  • Code or configuration changes can account for over 50% of an organization’s incidents
  • The relationship between a change and its effect is often indirect, even domain experts might only guess the right cause
  • Manually investigating changes related to an incident is often the longest step in detecting the root cause of an incident
    To help combat this struggle, BigPanda provides a single pane of glass for Ops teams to view, manage, and triage incidents, complete with in-line root cause change suggestions.

BigPanda’s Root Cause Change capability uses Open Box Machine Learning to mark the “mostly likely” suspects right in the UI, helping teams identify changes in infrastructure and applications. By pinpointing the root cause of incidents and outages in real-time, BigPanda helps enterprises and their IT Ops, NOC and DevOps/SRE teams rapidly investigate and resolve those incidents and outages.

The Changes TabThe Changes Tab

The Changes Tab

Key Features

  • Integration - Funnel all your change integrations into BigPanda's Open Integrations Hub to see all your changes organized and correlated in one place.
  • Visualization - See a consolidated list of all the system changes related to each incident.
  • Correlation - Use BigPanda's OBML or manually correlate changes to incidents to enable Root Cause Analysis.
  • Collaboration - Collaborate with other users to investigate which change is the Root Cause of the incident.

BigPanda’s Root Cause Changes feature streamlines the cause investigation process for your team, dramatically reducing the troubleshooting phase of incident resolution. By giving your team instant visibility and easy collaboration, BigPanda’s RCC dramatically reduces MTTR for change related incidents.

How It Works

BigPanda integrates with your change feeds to collect change data such as CI/CD pipelines, Change Management tools, auditing systems, and orchestration tools. Change data such as managed changes, code deployments, software updates, configuration changes, upgrades, and more is stored and organized into the Related Changes table within the Incidents tab. Changes and incidents are updated systematically so that the changes to each incident remain current.

Change data is normalized with searchable, correlatable tags. By bringing change data together from across the different layers of your environment, BigPanda helps your Ops teams get visibility on the system as a whole.

Once integrated with all your change feeds/tools, BigPanda's OBML (Open Box Machine Learning) algorithms detect connections between changes made to the system and incidents in real-time, identifying changes that may have caused the outage.
Changes that are correlated strongly enough to imply causation are floated up onto the Incident Overview as suggested related changes, with a comment from BigPanda explaining why the change was suggested.

Team members can review change data and investigate root cause right in BigPanda, marking changes as matches and collaborating with their team using BigPanda’s deep integrations and sharing capabilities.

By automatically suggesting changes as being suspect of incidents, without the need for operators to manually sift through the changes to guess root cause changes, RCC helps enterprises and their IT Ops, NOC and DevOps/SRE teams rapidly investigate and resolve those incidents and outages and speeds up MTTR.

BigPanda Machine Learning

At its core, BigPanda’s Root Cause Analysis relies on pattern recognition.

The Root Cause Analysis algorithms run calculations on key connections between incidents and changes, including

  • Categories: The machine learning engine sorts alerts and changes into matching categories based on specific keys and values
  • Time Factor: Each change is evaluated on whether it occurred before or during an incident start, and how closely the timelines match
  • Alerts Coverage - All of the alerts in an incident are weighed to see how many of the alerts match the change
    In complex modern systems, the root cause may have been tied to unexpected systems or architecture, so BigPanda uses a dual-pronged algorithm to help you spot even the most unusual root cause changes.

Text-Based and Causal Algorithms

BigPanda’s RCC feature looks for related changes using two different algorithms: text-based, or causal. These algorithms are built to complement and strengthen each other.

  • Text Similarity - A customizable algorithm that looks for specific matches in the text or tags of a change and incident
  • Causality - A machine learning algorithm that uses historical data to discover if-then connections beyond simple matching

Text Similarity

Causality

Configurable?

Yes

No

How it works:

Changes/Alerts are broken down into tokens which are compared to find exact matches

Create incident type-clusters from Change and Alert data, then compares the behavior of clusters over time

Works best when:

There are a lot of details and specifics in the Change and the Alert tags

There is a large amount of Change/Alert data and user-marked Matches

Timing

Algorithm can be manually adjusted at any time through BigPanda Support

Type-clusters are adjusted once a month using the previous 3 months of data

Example description:

The matcher ‘XYZ1’ has similar text on both the incident tag ‘host’ and the change tag ‘machine’

Changes with ‘VIP’ in the description often cause incidents with ‘heartbeat’, ‘center’ in the description

Together, the two algorithms ensure that all potentially related changes are caught and highlighted for your Ops team.

BigPanda is configured to suggest up to 2 related changes per algorithm, but only changes that are highly correlated will be suggested. An incident may have 0-4 recommendations at any time.

Examples of Suggested Related Changes

Text-Based Suggestion

Deep enrichment allows BigPanda to pull dozens of alert tags and metadata together into one incident. For a human operator, finding a single matching value between changes and incidents is a time consuming and tedious process.

This is where the text-based algorithm comes in.

Sample Text-Based SuggestionSample Text-Based Suggestion

Sample Text-Based Suggestion

In this example, the incident was enriched with Configuration Item (CI) data from the CMDB. This CI value was also found in the change information - meaning this change was affecting the exact configuration item that was now encountering trouble.

As changes occurring on the same item or system are likely linked, the algorithm highlighted this as a suspected change.

Causal Suggestion

In the modern integrated world of computing, a change in one system can trigger effects down the line through multiple other systems, eventually causing problems on what initially appears to be unrelated systems.

This is where the causality algorithm excels.

Sample Causal SuggestionSample Causal Suggestion

Sample Causal Suggestion

In this example the incident appeared to be occurring due to an expired security certificate. In the past, a similar incident had come in that was due to a change updating the gitlab registry.

Based on the past match, the algorithm looked through recent changes to see if there had been another change updating the gitlab registry. There had been, and the algorithm highlighted this as a suspected change.

👍

Once enabled, both algorithms run together to help you find the root cause of even the most complex issues

To learn more about how to use BigPanda’s Root Cause Changes feature, see the Correlating Changes With Incidents documentation.


Recommended Reading

To learn more about working with the Related Changes section and to see relevant integrations, see: