Glossary
Reserved words
Terms with an asterisk (*) in the title are reserved for system use and cannot be changed or redefined for custom enrichment.
A
Glossary
- Actionable incident
An Actionable Incident is an incident that contains high-quality alerts enriched with both technical and business context.
Unified Analytics uses the following criteria to determine if an incident is actionable:
Incident was explicitly defined as actionable using
bp_v_actionable
tagIncident was enriched with business context using the
bp_v_business_segment
tagIncident was acted upon Incident was not defined as noise using the
bp_v_alert_noise
tag
The default field for actionable incidents is
bp_v_actionable
.See Unified Analytics Key Metrics for more information.
- Actioned Incident
An incident is considered actioned in BigPanda when it has had one of the following actions taken on it:
Comment
Assignment to a user
Manual share
Automated share
As actioned incidents represent outages and system issues that your team was able to assign, share, and resolve, they are a key metric in determining the efficacy of BigPanda configuration and workflows.
For organizations on a consumption pricing model, actioned incidents are one of the two credit-metrics. As BigPanda functionality continues to improve and evolve, additional actions may apply in the future. See the Usage Data Dashboard documentation for more information.
- Activity (Action)
In BigPanda reporting, an activity is any user action performed on an incident in the BigPanda console. Activities include automated actions like AutoShare.
Activities include:
Assign
Comment
Share
Merge
Split
Snooze
Manual-resolve
Entering the flapping state
- Alert ()
An alert is the combined life cycle of a single system issue.
Monitoring tools generate events when potential problems are detected in your infrastructure. Over time status updates and repeat events may occur from the same system issue. In BigPanda, raw event data is merged into a singular alert so that you can visualize the life cycle of a detected issue over time.
For example, a CPU load alert may start with a warning event, then increase in severity with a critical event, and finally get resolved with a resolution event. All three of these events will be merged into a single alert. Common events that are sent to BigPanda include: “CPU > 95% for more than 5 minutes” and “Port X on Router ABC down”
BigPanda correlates related alerts into incidents for visibility into high-level, actionable problems.
Alert terminology
Some monitoring tools refer to ‘events’ as ‘alarms’ or ‘alerts.’ In BigPanda documentation ‘alert’ is always used to refer to the complete lifecycle of an event.
- Alert quality
Alert quality is the categorization of alerts by applying concrete rules to check for defined attributes contributing to actionability.
High Quality - alerts meet criteria for high actionability by support teams, meaning that technical, business context data and resolution steps are included.
Ownership and routing to the assignment group who should respond
Business impact of the alert to the business, which can be priority level, application tiers, etc.
Runbooks and URLs on how the alert should be resolved
Dependency to understand which services and applications are impacted
Enrichment
For an alert to be high-quality, it must include ownership and routing information, business impact and either runbooks, dependency or enrichment context.
Medium Quality - alerts indicate the minimum level of information and context within alerts to support operator action, while lacking some valuable elements such as business context, dependencies or resolution steps. For an alert to be considered medium quality, it must include both:
The configuration item (CI)
Symptom of the problem (Check)
Low Quality - alerts are either misconfigured or lack meaningful information required to support any action by the response team. They present overhead without value.
For more information, see the Unified Analytics Key Metrics documentation.
- Alert_updates*
New updates only
The alert_updates tag is a new feature as of May 20th, 2024. Only alert updates received after that date will be counted in the tag.
In BigPanda, raw event data from the same system issue is marshaled into a singular alert so that you can visualize the life cycle of a detected issue over time. BigPanda’s built-in deduplication process intelligently parses these incoming raw events to reduce noise. Exact duplicates are filtered out of the UI, but updates to existing alerts are accumulated rather than creating a brand-new alert.
The alert_updates tag tracks the number of updates made to an alert between its creation and the latest update. This tag gives you visibility into the number of events merged into a single alert, and allows you to prioritize incidents based on update frequency. The alert_updates tag is also included in AutoShares.
Reserved word
This term is reserved for system use and cannot be changed or redefined for custom enrichment.
- API
Application Program Interfaces (APIs) are software intermediary tools that allow applications to talk to each other. BigPanda has several APIs available that allow you to integrate with external tools and manage incidents and BigPanda elements in bulk. They are core tools for self-service driven customers, and empower custom solutions and deep 2-way integrations.
BigPanda API specifications can be found in the API Reference.
With each request to the BigPanda API, you must include an HTTP header with the authentication token for your organization. BigPanda APIs use two different types of authentication tokens, an organization-wide bearer token or a user-specific API Key.
The Alerts API builds a custom integration between BigPanda and your monitoring system. The Alerts API allows you to easily integrate a monitoring system with BigPanda. Monitoring systems generally send out events when problems are detected and when problems have been resolved (fixed).
- Artificial intelligence
Also known as machine intelligence, artificial intelligence(AI) is the ability for machine systems to mimic human cognitive functions such as learning and problem solving. The goal of artificial intelligence is to create machines or programs that can work, react, and respond to complex situations.
For most business initiatives, the focus of artificial intelligence is to design programs that can develop and progress in a specific task without using explicit instructions, allowing the program to rely on patterns and inference instead. Machine learning allows for a machine or program to develop and create a solution on its own, once limitations and standards are set, rather than simply following programming.
BigPanda’s Pragmatic AI combines the power of AI with transparency and customization through explainable AI. With BigPanda Pragmatic AI, the logic is explained to IT Operations teams in plain English. Teams can then edit this logic to add situational and tribal knowledge to strengthen it on their own, without requiring expert data scientists. From there, teams can test and run what-if experiments on real live production data to make sure their changes work as intended, before deploying them, promoting higher trust and adoption of machine learning throughout the organization.
Learn more about how BigPanda uses machine learning in the Advanced Insights Module documentation.
D
Glossary
- Deduplication (deduping)
BigPanda’s built-in deduplication process reduces noise by intelligently parsing incoming raw events. Also known as event deduping and event marshalling, this process eliminates redundant data to reduce noise and simplify incident investigation.
Exact duplicate matches add clutter to the system and are not actionable. If BigPanda receives two or more event payloads where the entire payload exactly matches, the event will be deduplicated and not shown in the UI. However, updates to existing alerts are merged rather than creating a brand new alert.
Events that have passed through the BigPanda deduplication process are considered deduped events.
The following three scenarios can occur if BigPanda receives two or more events with similar payloads.
Scenario
Action
The event payload (including the application key, timestamp, and primary and secondary properties) exactly matches an event that was already received.
The event is dropped
The timestamp (or any other value in the event payload) has changed.
The event is merged with the previous event, updating the tag values from the new event.
The event payload status has changed from the previous event.
The event is merged with the previous event, updating the status from the new event.
- Description*
Description of an event sent by a monitoring tool. Description can be used in BPQL to create extraction enrichment tags.
Reserved word
This term is reserved for system use and cannot be changed or redefined for custom enrichment.
E
Glossary
- Event
An event is a point in time that represents the state of a service, application or infrastructure component. The state can be from a specific component of a service, application or infrastructure. Monitoring tools can generate events when potential problems are detected in your infrastructure.
Alert terminology
Some monitoring tools refer to ‘events’ as ‘alarms’ or ‘alerts.’ In BigPanda documentation ‘alert’ is always used to refer to the complete lifecycle of an event.
Events can be ingested as two unique types:
Monitoring Events
Webhook Calls
Emails
SNMP Traps
Change Events
Service Changes
Deploys
Builds
Over time status updates and repeat events may occur from the same system issue. In BigPanda, raw event data is merged into a singular alert so that you can visualize the life cycle of a detected issue over time.
For example, a CPU load alert may start with a warning event, then increase in severity with a critical event, and finally get resolved with a resolution event. All three of these events will be merged into a single alert. Common events that are sent to BigPanda include: “CPU > 95% for more than 5 minutes” and “Port X on Router ABC down”
BigPanda correlates related alerts into incidents for visibility into high-level, actionable incidents.
- Event aggregation
Event Aggregation is the process of combining multiple events that have occurred within the IT environment.
In order to aggregate disparate events, they need to go through the following steps:
The output from event aggregation in BigPanda is an alert.
F
Glossary
- Flapping
Flapping occurs when a monitored entity, such as a service or host, changes state too frequently, making the cause and severity of the incident unclear. For example, flapping can be indicative of configuration issues (such as thresholds set too low), troublesome services, or real network problems.
In BigPanda, an incident enters the flapping state when one or more of the associated alerts are flapping. By default, an alert is considered to be flapping when it has changed states more than 4 times in a one-hour time window. If you need to configure custom logic (updates to the number of state changes within a period of time or the time window) for your organization, contact BigPanda Support and request a product change.
BigPanda checks the flapping criteria every 15 minutes. When an incident enters the flapping state, a notification will be sent that an incident has started flapping, and another notification when the incident is no longer flapping. An incident exits the flapping state when all related alerts stop flapping (no longer meet the criteria for number of state changes in a period of time). Any activity between these two notifications will be filtered.
I
Glossary
- Incident
An incident is the correlation of one or more alerts that represent an issue that can impact the business through a service disruption. It represents a high-level issue in your system.
A single production issue often manifests itself in multiple alerts. For example, a disk issue can trigger a disk IO alert that, in turn, triggers a series of CPU, memory, database, and application alerts. Additionally, each alert may change as an issue progresses. An alert may start as a warning, and then increase in severity to a critical status. In these cases, diagnosing and fixing the issue requires up-to-date information from multiple sources, which is very difficult to gather and maintain manually.
BigPanda digests all of the raw data from your integrated monitoring systems and automatically correlates this complex data into single issue incidents, which gives you the visibility you need to investigate and resolve issues quickly.
All active and recently resolved incidents appear on the Incidents tab, where you can manage incidents through the operations workflow with BigPanda as your unified console. You can also escalate incidents through external ticketing and/or collaboration systems—manually as needed, or automatically as a smart ticketing solution—and BigPanda will keep the external systems up to date with the latest information.
The life cycle of an incident is defined by the life cycle of the alerts it contains. See Incidents in BigPanda for more information.
- Incident enrichment
Incident Enrichment is a process for applying contextualized business logic through enrichment to an incident (or group of alerts) with varying quality. The Incident Enrichment process provides you with additional contextual information on your incidents enabling you to accurately and quickly detect, understand and resolve system issues. Incident enrichment also enables powerful automation so that you can respond to issues faster.
Incident Enrichment in BigPanda is powered by incident metadata and incident tags. Incident tags are created by taking raw data from your systems and normalizing it into key-value pairs. Each tag has two parts: the tag name and the tag value. Tags are the fundamental data model for your alerts and incidents and provide vital incident enrichment.
Incident tags allow you to quickly see summary information for a particular incident rather than needing to review all of the related alerts. Incident tags can leverage any available information that may aid in resolution, such as the cluster and data center where an object resides or links to relevant time series metrics and runbooks.
Read more about how incident enrichment drives monitoring success in the Incident Intelligence documentation.
Incident tags need to be set up by a BigPanda Admin within the BigPanda UI at Settings > Incident Enrichment. See Manage Incident Enrichment for more information.
- Incident_identifier*
During alert correlation, BigPanda assigns correlated events an incident identifier. This identifier is used throughout the BigPanda system to recognize if two events are related to each other and is critical to ensure that BigPanda events can be resolved. Incident identifiers are created based on the tags and event data sent to BigPanda for each event.
By default, the incident identifier is a combination of the correlating events’ primary and secondary properties.
The incident_identifier may also be called the incident_key. The value for the incident_key can be overridden by explicitly setting a property in an alert payload, such as
"incident_identifier": "${field1}${field2}".
Reserved word
This term is reserved for system use and cannot be changed or redefined for custom enrichment.
Lowercase only
When sending this field to BigPanda ensure that it is lowercase only.
L
Glossary
- Live dashboard
BigPanda Live Dashboards provide easy-to-read operational health metrics in a consolidated view. Ideal for NOC displays and status monitoring, each Dashboard is made up of a series of widgets showing color-coded key information on incident severity and status.
Each widget shows information for a single environment, making it easy to track incident metrics by region, team, or infrastructure types. For example, you might have environments for each business service so that you can track metrics on each separately.
Open the Dashboards by clicking the Dashboards tab at the top of the screen. You can configure each Dashboard to meet your business needs. Learn more about dashboards in the Manage Live Dashboards documentation.
M
Glossary
- Machine learning
Machine learning is an important element of artificial intelligence. Machine learning focuses on the ability of a program to develop and progress in a specific task without using explicit instructions, allowing the program to rely on patterns and inference instead. Machine learning allows for a machine or program to develop and create a solution on its own once limitations and standards are set, rather than simply following programming.
Learn more about BigPanda’s machine learning processes in the Advanced Insights Module documentation.
- Mean time between failures (MTBF)
The average amount of time between failures, or the time between when an incident is resolved and when/how often it reoccurs. MTBF is a crucial maintenance metric to measure performance and reliability, especially for critical or complex assets.
Monitoring MTBF allows you to see how often certain issues occur and helps you measure the reliability of services. By highlighting systems that may have low MTBF, BigPanda helps you assess when maintenance or replacement is required and to improve the overall performance of the system.
For more information about how to calculate MTBF, see the Unified Analytics Key Metrics documentation.
- MTTx (Mean time to)
Mean Time metrics measure the effectiveness and burden on IT Ops systems and teams. In BigPanda reporting, we often focus on four key MTTx measurements.
Mean Time to Assign (MTTA) - The average amount of time it takes the IT Ops team to assign the incident. In BigPanda, MTTA is calculated based on the time until the assign action is used.
The calculation for MTTA is (First assigned time - Start time)/60. First assigned time comes from the activity_type assigned, and the time is from the created_time field.
Mean Time to Engage (MTTE) - The average amount of time it takes the IT Ops team to engage in handling the incident. In BigPanda, this is measured by the time it takes to perform an action other than assign. Activities can include the activity_type comment, snooze, or share.
The calculation for MTTE is (First activity time - Start time)/60.
Mean Time to Fix (MTTF) - The average amount of time between engagement and resolution. In BigPanda, MTTF is automatically calculated from the time someone performs an action on the incident, to the resolution of the incident.
The calculation for MTTF is MTTR - MTTE - MTTA (when the action is earlier than the resolution time).
Mean Time to Resolve (MTTR) - The average amount of time it took to get back to service. MTTR looks at the repair of alert symptoms as opposed to the complete resolution of the incident. In BigPanda, it is calculated from when the first event was received, to the resolution of the last alert.
The calculation for MTTR is (End time - Start time)/60. End time is the end_time from Raw Incidents and Start time is the start_time from Raw Incidents.
For more information about MTTx, see Unified Analytics Key Metrics.
N
Glossary
- Noise
Noise occurs when monitoring systems send alerts that are misconfigured, duplicates, or irrelevant. Alerts that are defined as noise are not actionable and are given the bp_v_alert_noise tag.
Noise often results in low quality alerts.
BigPanda reduces noise by deduplicating redundant alerts and silencing noisy alerts via filtering, maintenance plans, and environment views.
P
Glossary
- Primary_property*
The primary and secondary properties are key fields used for normalization, deduplication and correlation of events in BigPanda.
To eliminate redundant data and reduce noise, BigPanda creates an incident identifier for each incoming event. By default, this identifier is created using the primary and secondary properties. These two properties are important through the whole BigPanda pipeline.
Primary and Secondary properties are key data fields that drive correlation, event normalization, and deduplication. See the Open Integration Manager documentation for more information.
BigPanda uses both properties to identify matches during the deduplication process.
In the UI, when there is no correlation pattern for the incident, BigPanda uses the primary property to construct the title and the secondary property to construct the subtitle of an incident.
As an incident progresses, the primary and secondary properties are key to ensuring that an incident’s severity, scope, and status are updated to match the ongoing outage.
If a
"primary_property"="tag_name"
is not specified, the primary property will be defined as one of the following:host
,service
,application
, ordevice
. Some integrations may allow you to customize which field a tool uses as the primary property. See the integration-specific instructions for details on primary property field defaults and customization.Required property
BigPanda cannot receive events without a primary property. The secondary property is optional. If the event does not contain a value for the secondary property, BigPanda uses only the primary property to process the event.
Reserved word
This term is reserved for system use and cannot be changed or redefined for custom enrichment.
Lowercase only
When sending this field to BigPanda ensure that it is lowercase only.
R
Glossary
- Raw event
BigPanda receives event data from your monitoring applications. Monitoring integrations allow BigPanda to receive alerts from systems such as Nagios, SolarWinds, and AppDynamics.
When events first enter BigPanda and have not yet gone through the BigPanda deduplication or correlation process, they are considered raw events. Raw events are normalized during the ingestion process.
Raw event payloads are not stored in BigPanda.
- Resolved
Incidents are marked Resolved when BigPanda believes the system event to be repaired.
An incident in BigPanda is resolved in three ways:
When all associated alerts have an OK status.
When an external collaboration tool sends a resolve message back to BigPanda to resolve an incident.
When a user in the BigPanda console manually resolves an incident.
When the last open alert reaches the automatic resolution time set by the sending integration. (Not currently in use by all organizations. See Time Based Alert Resolution for more information.)
Once resolved, incidents may be reopened if the problem reoccurs within a short time frame. To learn more about incident lifecycle, see the Alert/Incident Status documentation.
- Root cause analysis
Root cause analysis is the process of identifying the root causes of system errors or problems. Identifying the root cause of a poorly performing application is one of the biggest challenges for enterprise IT Ops, NOC, DevOps and SRE teams. Rapid Root Cause Analysis dramatically condenses the time it takes to resolve incidents/outages. BigPanda includes several key features to help your root cause analysis efforts.
Aggregate and correlate alerts from every monitoring tool in your environment
Enrich alerts with changes and topology data from every change and topology tool in your environment
Correlate all of this data together to identify the probable root cause of a problem, incident, or outage
In BigPanda, root cause analysis is done in real-time and can go a long way in helping IT Ops teams resolve incidents/outages.
- Root cause changes (RCC)
Advanced Insight Module
This feature is part of the Advanced Insight Module. If your organization has not purchased this module, you may not have access to the feature.
If you are interested in upgrading to the Advanced Insight Module, contact your BigPanda account team.
BigPanda’s Root Cause Changes (RCC) feature integrates a customer’s change information into BigPanda, to highlight changes that might be related to incoming incidents. BigPanda integrates with your change feeds to collect change data such as managed changes, code deployments, software updates, configuration changes, and upgrades, and organizes them in the Changes table within the Incidents tab.
Key Features:
Integration - Funnel all your change integrations into BigPanda's Open Integration Manager to see all your changes organized and correlated in one place.
Visualization - See a consolidated list of all the system changes related to each incident.
Correlation - Correlate changes to incidents to enable Root Cause Analysis.
Collaboration - Collaborate with other users to investigate which change is the Root Cause of the incident.
For more information, see Root Cause Change Prediction and Changes Tab.
S
Glossary
- Secondary_property*
The primary and secondary properties are key fields used for normalization, deduplication, and correlation of events in BigPanda.
To eliminate redundant data and reduce noise, BigPanda creates an incident identifier for each incoming event. By default, this identifier is created using the primary and secondary properties. These two properties are important through the whole BigPanda pipeline.
Primary and Secondary properties are key data fields that drive correlation, event normalization, and deduplication. See the Open Integration Manager documentation for more information.
During correlation, BigPanda uses both properties to identify which events are part of the same alert.
In the UI, when there is no correlation pattern for the incident, BigPanda uses the primary property to construct the title and the secondary property to construct the subtitle of an incident.
As an incident progresses, the primary and secondary properties are key to ensuring that an incident’s severity, scope, and status are updated to match the ongoing outage.
If a
"secondary_property"="tag_name"
is not specified, the secondary property is defined as one of the following: check or sensor. Some integrations may allow you to customize which field a tool uses as the secondary property. See the integration-specific instructions for details on secondary property field defaults and customization.Required field
BigPanda cannot receive events without a primary property. The secondary property is optional. If the event does not contain a value for the secondary property, BigPanda uses only the primary property to process the event.
Reserved word
This term is reserved for system use and cannot be changed or redefined for custom enrichment.
Lowercase only
When sending this field to BigPanda ensure that it is lowercase only.
- Severity*
Incident severity determines the seriousness and urgency of a BigPanda incident. Severity determines incident priority within BigPanda, and helps your team triage and focus on the most important outages first. Incident severity is determined by the highest severity status of any of the active alerts within an incident. As each alert enters BigPanda, it will include a status for the event, from: critical, warning, ok, or acknowledged. The highest status in the incident will set the severity.
Severity is a useful tool and can be configured with the priority tag to help your team work on the most important incidents first.
Reserved word
This term is reserved for system use and cannot be changed or redefined for custom enrichment.
Lowercase only
When sending this field to BigPanda ensure that it is lowercase only.
See Prioritize Incidents for more information.
- Source* and Source_System*
For each incoming alert, BigPanda records the name of the integrated tool as part of the alert data. Source_System is a particularly useful tag for creating environments, searching incidents, and creating reports.
Unique reserved word
source
,_source
, andsource_system
are reserved system words within BigPanda and cannot be used as the name of a custom tag, or defined as part of the API payload. BigPanda will automatically calculate source and source_system values based on the name of the sending system in the<source type>.<integration name>
format.source_system is a unique reserved word - it can be used as a filter condition when creating correlation patterns, other custom tags, and unified searches.
- Status*
As each alert enters BigPanda, it will include a status for the event, from: critical, warning, ok, or unknown. An additional acknowledged status may be configured during integration setup. This status is based on specific requirements set within your monitoring tools and the normalization between your tool and BigPanda. The highest status alert will set the severity for an incident. Status is a useful tool that can be configured with the priority tag to help your team work on the most important incidents first.
Some integrations may allow you to customize which fields and values a tool uses to determine status. See the integration-specific instructions for details on status field defaults and customization.
Reserved word
This term is reserved for system use and cannot be changed or redefined for custom enrichment.
Lowercase only
When sending this field to BigPanda ensure that it is lowercase only.
T
Glossary
- Tag
BigPanda normalizes alert data from integrated monitoring systems into standard key-value pairs, called tags. BigPanda leverages both enrichment alert level tags, and incident tags, which are added at different points.
Each tag has two parts: the tag name and the tag value. Tags are the fundamental data model for your alerts and incidents and provide vital incident enrichment. The values of these tags are used to correlate highly related alerts into incidents and to allow BigPanda to find connections between alerts and changes through your system.
Tags enable your team to:
Correlate alerts into incidents
Filter alerts
Create maintenance plans
Search the incident feed
Define filter conditions for environments
Search with BigPanda Query Language (BPQL)
View incident information in the UI
Collect analytics
Incident tags are key-value pairs that can be added to incidents for additional incident enrichment, giving your teams insight into priority and business impacts of incidents. To learn more about how incident tags work, please see the Incident Tags documentation.
- Timestamp*
The time a monitoring tool triggered an event or an action happened in BigPanda in unix epoch format.
All timestamps are stored in unix epoch format within BigPanda. When displayed in the UI, timestamps will be automatically converted to date-time format in the timezone of the user.
Reserved word
This term is reserved for system use and cannot be changed or redefined for custom enrichment.
Lowercase only
When sending this field to BigPanda ensure that it is lowercase only.
V
Glossary
- Virtualization
Virtualization is the development of a virtual version of an IT resource, such as a server, storage, device, or even operating system. It simulates software and hardware that allows software to run. Virtualization gives rise to virtual machines where you can run programs very much like on a physical machine.
Virtual machine is the process of running another operating system on a machine using virtualization software. The virtual system is segregated from the main system. Reasons to run a virtual machine include trying a new operating system before installing it, running old or incompatible software, and testing suspicious files.
Cloud virtualization enables companies to unlock scalability, business continuity, and cost saving measures, but dramatically increases the difficulty of monitoring and management for IT Ops teams. The added complexity, layers, and additional tooling needed to manage cloud and hybrid systems can rapidly overwhelm Ops teams.
BigPanda is designed with the complexities of modern virtualization tools in mind. Learn more about how BigPanda can help your teams make sense of the complexity of modern IT Ops in our Getting Started documentation.
W
Glossary
- Widget
Widgets are easy to read, individually managed components within a larger interface. Widgets enable you to visualize specific data sets individually and together. Widgets make up both Live Dashboards and Unified Analytics.
Live Dashboards provide easy-to-read operational health metrics in a consolidated view. Ideal for NOC displays and status monitoring, each Dashboard is made up of a series of widgets showing color-coded key information on incident severity and status for incidents in a specific environment.
Each widget shows information for a single environment, making it easy to track incident metrics by region, team, or infrastructure types. For example, you might have environments for each business service so that you can track metrics on each separately.
In addition, Unified Analytics dashboards are made up of individual widgets that display visualizations of your monitoring data. Reporting widgets may include charts, graphs, and tables. Each dashboard is configured to show a specific group of widgets to make visualizing business impact easy.
Reporting widgets may be configured to hone in on tags, resources, and KPIs of special interest to your team.