Key Concepts

Download this manual as a PDF file

The following video explains how Zebrium can automatically show you the root cause of any kind of software or infrastructure problem, without any manual training or rules:

Zebrium Root Cause as a Service (RCaaS)

Zebrium Root Cause as a Service (RCaaS) uses unsupervised machine learning on logs to automatically find the root cause of software problems. It does not require manual rules or training, and it typically achieves accuracy within 24 hours.

As Zebrium ingests logs, the Zebrium artificial-intelligence machine-learning (AI/ML) engine analyzes the logs, looking for abnormal log line clusters that resemble problems, such as abnormally correlated rare and error events from across all log streams.

When the AI/ML engine detects one of these "abnormal" clusters, it generates a suggestion, which appears on the Alerts page (the home page) of the Zebrium user interface along with the existing alerts:

On the Alerts page, the summary report for a suggestion and an alert contains the following main elements:

  • AI-generated title. Displaying at the top of the summary pane, this title is generated using GPT Services that use new Generative AI models. You can enable or disable GPT services for a specific deployment of Zebrium by using the GPT Services column on the Deployments page (Settings ()> Deployments).
  • Word Cloud. A set of relevant words chosen by the AI/ML engine from the log lines contained in the alert. Click a word in the cloud to highlight that word in the list of logs on the left.
  • Significance icon. Since not all suggestions that the AI/ML engine generates will relate to problems that actually impact users, the engine attempts to reason over the data and assess whether a problem actually requires attention. Hover over this icon at the top of the list of logs to view the confidence level of the AI/ML engine for this suggestion. A red icon () means "High" confidence, and a yellow icon () means "Medium" confidence.
  • AI Assessment . Since not all suggestions that the AI/ML engine generates will relate to problems that actually impact users, the AI/ML engine attempts to reason over the data and assess whether a problem actually requires attention. Depending on the quality of the data, some suggestions might not include an AI Assessment. This value is shown in the Zebrium user interface as an AI Assessment value of one of the following:
    • "No Attention Needed" for content that the AI/ML engine assesses as unlikely to require immediate attention.
    • "Needs Your Attention" for content that the AI/ML engine believes should be looked into.
  • Root Cause (RCA) Report Summary. The report contains the actual cluster of anomalous log lines that was identified by the AI/ML engine. Up to eight of these log lines are shown in the summary view. You can click anywhere in the summary to view the full Root Cause report.
  • Alert Key. One or two log lines, denoted with a key icon (), that are used to identify the suggestion if this type of suggestion occurs again. The alert keys make up an alert rule.

You can click anywhere in the summary report for a suggestion or an alert to view a more detailed Root Cause Report page for that suggestion or alert. For more information, see Root Cause Reports.

Suggestions are generated when the AI/ML engine finds a cluster of correlated anomalies in your logs that resembles a problem. However, this does not mean that all suggestions relate to actual important problems. This is especially true during the first few days of using Zebrium, as the AI/ML engine learns the normal patterns in your logs.

When you start getting suggestions on the Alerts page, you can review the word clouds and event logs that display in the summary views for the Root Cause reports for the suggestions. As a best practice, identify a specific time frame when a possible problem occurred, and then start looking at the reports that have the most interesting or relevant information related to the possible root cause of the problem.

You can choose to "accept" or "reject" a suggestion. For more information, see Assessing Suggestions.

You can also decide on the action to take if the same kind of alert type occurs again, such as sending a notification to Slack, email, or another type of notification. For more information, see Notification Channels.

If you currently use SL1 from ScienceLogic, you can configure an integration that lets you view Zebrium suggestions in SL1 dashboards as well as on the SL1 Events page. For more information, see ScienceLogic Integrations.

Root Cause Reports (RCA Reports)

A Root Cause Report or RCA Report is a report generated by the AI/ML engine that consists of a group of log events that the AI/ML engine identified as being part of a problem.

A full RCA Report page (below) appears after you click the summary view for that report on the Alerts page:

Image of an RCA Report page.

The RCA report contains the actual cluster of anomalous log lines that was identified by the AI/ML engine. There are typically between ten and 100 log events in a report. Up to eight of these log lines are shown in the summary view. Clicking a summary on the Alerts page takes you to the full RCA report.

Each RCA report matches a particular "fingerprint" of log events. You can add notes, summaries, Jira links, and alert preferences to the alert rules for the RCA report so that future occurrences of the same type of problem will reflect these preferences and notes.

For more information, see Suggestions and Root Cause Reports.

Alert Rules and Alert Keys

An alert rule is made up of one or two log events that best represent a specific type of problem that caused the event, and these events often provide clues as to the nature of the problem. These notable log events are called alert keys, and the AI/ML engine uses these keys to trigger an alert when new log data is ingested.

A key icon () appears next to an alert key in the list of log events on the Alerts page and on the RCA Report page:

Image of two  alert keys.

The AI/ML engine also uses the alert keys as a "signature" for a particular type of alert. There are typically two hallmark events:

  • The first event in the sequence, which is usually a rare event or anomaly and often relates the root cause.
  • A high severity event, either as determined by log severity, or other indicators, such as certain words or phrases indicating a problem, like "exception", "failed", "could not restart", and so on.

You can edit the alert keys of any Root Cause (RCA) report to select different log events if you believe those log events are more useful. Future matches of this type of RCA report will match against your user-defined alert keys, and carry forward your notes, summaries, Jira links, and alert preferences.

For more information, see Editing Alert Keys.

Log Collectors

When you are setting up your Zebrium system, one of the first tasks you need to do is configure a method for gathering log data to send to Zebrium so the AI/ML engine can begin to analyze the log data.

You would typically configure one or more log collectors to gather logs and send those logs to Zebrium for automated incident detection. For example, the following dialog explains how to set up a Linux log collector:

Image of the Linux log collector dialog.

You can also use a file upload method using ze, the Zebrium command-line interface for uploading log events from files or streams.

For more information, see Log Collectors and File Uploads.

Service Groups

A Service Group is the collection of log types, pods, hosts, and other items that are all part of a "failure domain". In other words, logs from the micro-services and processes that could all interact with each other to contribute to an incident should be part of a service group. The AI/ML engine will only attempt to correlate anomalies and errors across logs that fall within a service group. For more complex applications, you can have multiple service groups if there is more than one failure domain.

For example, in the following image, sockshop and shop2 are two separate service groups where the same event occurred:

Image of Service Groups on a report.

You can view a list of service groups by clicking the Filtering button on the Alerts page. The Selected Filter dialog contains a list of service groups in the Service Groups filter.

Using a service group allows you to collect logs from multiple applications or support cases and isolate the logs of one from another so as not to mix these in a RCA report.

If omitted, the service group is set to "default", which means that the service group represents shared services. For example, a database that is shared between two otherwise distinctly separate applications would be considered a shared service. In this example scenario, you would set the service group to "app01" for one application and "app02" for the other application. For the database logs, you would either omit the service group setting, or you could explicitly set it to "default".

With this configuration, RCA reports will consider correlated anomalies across the following:

"app01" log events and default (i.e. database logs) and

"app02" log events and default (i.e. database logs) but not across:

"app01" and "app02

For more information, see Suggestions and Root Cause Reports .

Notification Channels

Notification Channels provide a mechanism to define the methods that Zebrium will use to send notifications from RCA reports. The supported types of notification channels include email, as well as Slack, Microsoft Teams, and Webex Teams notifications.

Image of the Create Slack Notification dialog.

After you have created one or more notification channels, you can link any number of these to any RCA report created by the AI/ML engine. Linking a set of notification channels to a RCA report will send notifications of future RCA reports of the same type to those channels.

For more information, see Notification Channels.

ScienceLogic Integrations

You can integrate the Zebrium Root Cause service with the SL1 platform from ScienceLogic to send suggestions and alerts to the SL1 dashboards or to SL1 Events, Devices, and Services pages.

The following image shows the interactive Root Cause Timeline widget in an SL1 dashboard:

To enable a ScienceLogic integration, go to the Integrations & Collectors page (Settings () > Integrations & Collectors), select an integration type, and follow the instructions for setting up that dashboard.

For more information, see ScienceLogic Integrations.

Incident Management Integrations

You can configure an integration between Zebrium and your third-party Incident Management application to automatically add Root Cause (RCA) reports to your incidents in the third-party application. Each Zebrium RCA report includes a summary, word cloud, and a set of log events display symptoms and root cause, along with a link to the full report in the Zebrium user interface.

After you complete the configuration, you can can view details of root cause and direct the incident to the appropriate team. All of these features lead to faster Mean Time to Repair (MTTR) and less time manually hunting for root cause.

Image of an incident management integration in the Zebrium user interface

For more information, see Incident Management Integrations.

Integrations Using Webhooks

Zebrium provides support for using webhooks so you can build your own custom integrations.

Image of a webhook integration in the Zebrium user interface

Zebrium provides the following webhooks:

  • Outgoing Root Cause Report Webhook
  • Incoming Root Cause Report Incoming Webhook

For more information, see Using Webhooks to Create Integrations .

Zebrium On Prem

In additional to the standard option of a cloud configuration for Zebrium, you also have the option for a Zebrium on-premises (On Prem) configuration that is not located in the cloud.

For more information, see Zebrium On Prem.