An event is a point in time that represents the state of a service, application, or infrastructure component. Monitoring tools can generate events when potential problems are detected in your infrastructure.
BigPanda aggregates, normalizes, and enriches events collected from fragmented tools and correlates that data into actionable insights. The platform allows you to detect incidents as they form, in real time, before they escalate into outages.
The BigPanda event lifecycle includes the following event types and actions:
- Event Ingestion
- Event Deduplication
- Event Filtering
- Keep-Alive Events
- Post Dedupe Events
- Alert Formation
- Incident Formation
- Incident Enrichment
- Incident Classification
View the steps below to learn more about how incidents are formed from events in BigPanda.
The process starts when BigPanda receives event data from your monitoring applications. Monitoring integrations allow BigPanda to receive alerts from systems such as Nagios, SolarWinds, and AppDynamics. See Integrate with BigPanda for more information.
BigPanda’s built-in deduplication process reduces noise by intelligently parsing incoming events. Also known as event deduplication, deduping is the process by which BigPanda eliminates redundant data to reduce noise and simplify incident investigation.
Precise duplicates of existing events are immediately discarded. However, updates to existing alerts are merged rather than creating a brand new alert.
Exact duplicate matches add clutter to the system and are not actionable. If BigPanda receives two or more event payloads where the entire payload exactly matches, the event will be deduplicated and not shown in the UI.
BigPanda gives users the ability to filter out or suppress events generated for nodes or CIs that are under maintenance, in non-production environments, or that match other special circumstances where operators don’t need to be notified of potential outages or incidents.
Event Filtering in BigPanda is used to filter for events that are unactionable and would only add clutter in BigPanda. The event filter uses BigPanda Query Language to define criteria for events that will be dropped upon ingestion and never be visible in the Incident feed.
Below are some examples of events that you might want filtered:
- Misconfiguration (events that are missing tags that are critical for assignment and prioritization, making it impossible to triage)
- Lowest severity (events that signal system issues that don’t need to be actioned)
- Non-Prod (events from Dev/QA environments)
- Non-alerts (info events, logs, etc.)
See the Manage Alert Filtering documentation for more information.
Keep-Alives are events received by BigPanda with an OK status that were not deduplicated or filtered, but were never correlated into an incident. Keep-Alives do not indicate a system issue; instead, they indicate that the connection between the two systems is working as expected. Keep-alives cannot be viewed in the incident feed.
Post-dedupe events are the number of events that exist after deduplication, event filtering, and keep-alives have been taken into account. This is the number of events prior to alert formation, and incident creation.
After these steps take place, events are then aggregated and formed into alerts. The number of post-dedupe events is generally higher than the number of alerts because the alert creation process includes the aggregation of update events into single alerts.
Post-dedupe events are clustered into alerts, which represent a single issue within your environment.
Monitoring tools generate events when potential problems are detected in your infrastructure. Over time status updates and repeat events may occur from the same system issue. In BigPanda, raw event data is merged into a singular alert so that you can visualize the life cycle of a detected issue over time.
Alerts in BigPanda are the events that are ingested following deduplication, event filtering, keep-alive separation, and event aggregation.
BigPanda uses alert tags to add key contextual information to your alerts. Tags drive alert normalization and deduplication, correlation into incidents, incident enrichment, and automation. For more information, see Manage Alert Enrichment.
An incident is the correlation of one or more alerts that represent an issue that can impact the business through a service disruption. It represents a high-level issue in your system.
A single production issue often manifests itself in multiple alerts. For example, a disk issue can trigger a disk IO alert that, in turn, triggers a series of CPU, memory, database, and application alerts. Additionally, each alert may change as an issue progresses. An alert may start as a warning, and then increase in severity to a critical status. In these cases, diagnosing and fixing the issue requires up-to-date information from multiple sources, which is very difficult to gather and maintain manually.
BigPanda digests all the raw data from your integrated monitoring systems and automatically correlates this complex data into single-issue incidents, giving you the visibility you need to investigate and resolve issues quickly.
After an incident is formed, operators can quickly view and take action within the Incident Feed. For more information about the actions that can be taken on incidents, see Triage Incidents and Remediate Incidents.
BigPanda uses correlation patterns to group similar alerts into the same incident. Correlation patterns cluster alerts together based on source system, tags, time window, and (optionally) a customizable query filter. Alert correlation patterns can be created and customized to fit the needs of your organization. For more information, see Manage Alert Correlation.
Once alerts have been normalized and enriched, the process of correlating them into incidents begins. The following steps are taken as part of the alert correlation process:
BigPanda checks to see if the new event matches an existing alert in an incident. The system checks the event incident key to determine if the event is a match.
If the event properties match an alert in an active or recently resolved incident, the event is added to that incident as an alert.
If the new event status changes (For example, Warning to Critical), the tag values will be merged into the last event.
BigPanda checks to see if the event matches any active correlation patterns. If the event matches, it is added as an alert to an existing incident. If more than one correlation pattern matches, the pattern with the largest correlation window is selected. The incident’s active correlation patterns are updated to include only patterns that apply to all active alerts.
If the event does not match an active correlation pattern, a new incident is created.
The incident’s active correlation pattern matches are updated to include only the patterns that apply to all active alerts. Any of the incident’s pattern matches that are no longer in the correlation window are deactivated.
It’s possible for an alert to be added to an incident even if it does not match some of the active matched patterns. When this happens, the patterns that do not match are deactivated. Deactivated patterns remain attached to the incident, but alerts are no longer correlated into the incident based on these matches.
The incident title is updated based on the active matched pattern with the largest correlation window.
If the incident has multiple active alerts, the title is generated based on the correlation tags of the pattern with the largest correlation window.
Incident titles can change as alerts join the incident and change status.
The incident status is updated based on the status of the most severe active alert.
Incident Enrichment in BigPanda is powered by incident metadata and incident tags. Incident tags are created by taking raw data from your systems and normalizing it into key-value pairs. Each tag has two parts: the tag name and the tag value. Tags are the fundamental data model for your alerts and incidents and provide vital incident enrichment.
Incident tags allow you to quickly see summary information for a particular incident rather than needing to review all of the related alerts. Incident tags can leverage any available information that may aid in resolution, such as the cluster and data center where an object resides or links to relevant time series metrics and runbooks.
For more information, see Manage Incident Enrichment.
After incidents are formed, BigPanda classifies them into related groups for visibility and automation.
A single incident can appear in multiple environments in BigPanda.
Environments filter incidents on properties such as source and priority and group them together for easy visibility and action. Environments make it easy for your team to focus on the incidents relevant to their role and responsibilities. Environments can be used to filter the incident feed, or can be used to create dashboards, set up sharing rules, and simplify incident search.
See Manage Environments for more information.
The life cycle of an incident is defined by the life cycle of the alerts it contains. An incident remains active if at least one of the alerts is active, is automatically resolved when all the alerts are resolved, and is reopened when a resolved alert becomes active again.
Time Based Alert Resolution
The Time Based Alert Resolution feature will automatically resolve orphaned or outstanding alerts, allowing for easy noise reduction and increased MTTR. For more information, see the Time-Based Alert Resolution documentation.
The incident timeline shows Status Changes along the Status line.
|Orange dot||The incident was created, or the incident was reopened.|
|Green dot||The incident was resolved.|
|Orange and green dot||The incident was marked as flapping.|
Additional incident actions are shown along the Activities line.
|Orange bust with plus||The incident was assigned to a user.|
|Blue arrow||The incident was manually or automatically shared.|
|Grey up and down arrows||The incident priority was manually updated.|
|Grey up and down arrows with a line through them||The incident priority was manually removed.|
|Yellow dialogue bubble||A comment was added to the incident.|
|Orange bell||The incident was snoozed.|
|Grey paragraph lines||A value was manually changed for a single-value tag.|
|Grey bullet point lines||A value was manually added, changed, or removed for a multi-value tag.|
|Green checkmark||The incident was manually resolved, or one of the included alerts was manually or automatically resolved.|
An incident remains open as long as at least one of the alerts associated with it is open. When BigPanda receives an event with a status of ok, the related alert is automatically resolved.
Alerts that have not been resolved remain open in BigPanda. The corresponding incident also remains open and continues to appear in the Incident Feed.
Resolving Alerts with the Alerts REST API
To maintain only the most relevant information in the incident feed, send a resolving event to BigPanda using the Alerts REST API when an alert is no longer active.
Resolved incidents are reopened when any of the alerts associated with them reopen. This rule applies regardless of how the incident was resolved—manually, due to inactivity, or automatically when all associated alerts were resolved. Alerts are reopened if they reoccur within 60 minutes of when they were resolved. If an alert reoccurs more than 60 minutes later, it is handled as a new alert.
Snoozed Incidents Exception
The incident will also reopen if a new event that matches one of the correlation patterns of the incident comes in. Incidents that are more than 30 days old are never reopened. If the associated alerts reoccur, a new incident is created.
The time frame of the reopen window can be customized to fit your monitoring needs if necessary. Keep in mind this is a global setting that impacts all incidents. Please contact your BigPanda Support and request a product change if you'd like to change the time frame for incident reopening.
Flapping occurs when a monitored object (ie: a service or host) changes state too frequently, making the cause and severity of the incident unclear. For example, flapping can be indicative of configuration problems (ie: thresholds set too low), troublesome services, or real network problems.
When an alert changes states frequently, it may generate numerous events that are not immediately actionable. In the example timeline shown below, you can see how hundreds of potential notifications are grouped into one incident for the application that is flapping.
In BigPanda, an incident enters the flapping state when one or more of the associated alerts are flapping. By default, an alert is considered to be flapping when it has changed states more than 4 times in a one-hour time window. If you need to configure custom logic (updates to the number of state changes within a period of time or the time window) for your organization, contact BigPanda Support and request a product change.
When an incident enters the flapping state, all subscribed integrations are notified and no additional state change notifications are sent. Email integrations will send a daily email reminding users about the incident. An incident exits the flapping state when all related alerts stop flapping (no longer meet the criteria for number of state changes in a period of time). BigPanda checks the flapping criteria every 15 minutes.
Learn more about Incidents in BigPanda.
Find steps to Triage Incidents in BigPanda.
Learn how to Remediate Incidents.
Updated 14 days ago