Getting Started

Download this manual as a PDF file

This chapter provides an overview of how Zebrium works, and how to get started using Zebrium.

Before you can start watching for suggestions and reviewing Root Cause reports, you will need to configure a method for gathering log data to send to Zebrium. For more information, see Log Collectors and File Uploads.

How Zebrium Works

When skilled engineers troubleshoot software, they typically ask the following questions:

  1. Where are the problems or events occurring? The events could be clusters of errors, warnings, stack traces, or other indicators of bad outcomes.
  2. Were there unusual events upstream that could help explain these bad outcomes? This might be configuration changes, a new deployment, user actions, and so on.

In modern software, these events are often generated by different micro-services or software components, so you might have to switch between many log streams and then mentally correlate the events across them.

The Zebrium AI/ML engine emulates the workflow of a skilled engineer by performing the following actions:

  1. Automatically build a catalog of all of the event types generated by the software.
  2. Track the patterns of each event type in each log stream, such as the logs generated by a specific container, pod, or host. 
  3. Automatically identify unusual and "bad" events.
  4. Identify unusually correlated clusters of rare and bad events that appear to be due to the same incident. The AI/ML engine scores each such collection based on a combination of how rare the underlying events are, and how bad the events are, such as how many warnings or errors are generated.
  5. "Fingerprint" each cluster of such events as a unique type of issue. The events that rise above a specified threshold can be considered a potential Root Cause report, and they are summarized using Natural Language Processing (NLP) for Machine Learning.

When the AI/ML engine detects one of these "abnormal" clusters, it generates a suggestion, which appears on the Alerts page (the home page) of the Zebrium user interface along with the existing alerts:

On the Alerts page, the summary report for a suggestion and an alert contains the following main elements:

  • AI-generated title. Displaying at the top of the summary pane, this title is generated using GPT Services that use new Generative AI models. You can enable or disable GPT services for a specific deployment of Zebrium by using the GPT Services column on the Deployments page (Settings ()> Deployments).
  • Word Cloud. A set of relevant words chosen by the AI/ML engine from the log lines contained in the alert. Click a word in the cloud to highlight that word in the list of logs on the left.
  • Significance icon. Since not all suggestions that the AI/ML engine generates will relate to problems that actually impact users, the engine attempts to reason over the data and assess whether a problem actually requires attention. Hover over this icon at the top of the list of logs to view the confidence level of the AI/ML engine for this suggestion. A red icon () means "High" confidence, and a yellow icon () means "Medium" confidence.
  • AI Assessment . Since not all suggestions that the AI/ML engine generates will relate to problems that actually impact users, the AI/ML engine attempts to reason over the data and assess whether a problem actually requires attention. Depending on the quality of the data, some suggestions might not include an AI Assessment. This value is shown in the Zebrium user interface as an AI Assessment value of one of the following:
    • "No Attention Needed" for content that the AI/ML engine assesses as unlikely to require immediate attention.
    • "Needs Your Attention" for content that the AI/ML engine believes should be looked into.
  • Root Cause (RCA) Report Summary. The report contains the actual cluster of anomalous log lines that was identified by the AI/ML engine. Up to eight of these log lines are shown in the summary view. You can click anywhere in the summary to view the full Root Cause report.
  • Alert Key. One or two log lines, denoted with a key icon (), that are used to identify the suggestion if this type of suggestion occurs again. The alert keys make up an alert rule.

You can click anywhere in the summary report for a suggestion or an alert to view a more detailed Root Cause Report page for that suggestion or alert. For more information, see Root Cause Reports.

Suggestions are generated when the AI/ML engine finds a cluster of correlated anomalies in your logs that resembles a problem. However, this does not mean that all suggestions relate to actual important problems. This is especially true during the first few days of using Zebrium, as the AI/ML engine learns the normal patterns in your logs.

When you start getting suggestions on the Alerts page, you can review the word clouds and event logs that display in the summary views for the Root Cause reports for the suggestions. As a best practice, identify a specific time frame when a possible problem occurred, and then start looking at the reports that have the most interesting or relevant information related to the possible root cause of the problem.

You can choose to "accept" or "reject" a suggestion. For more information, see Assessing Suggestions.

You can also decide on the action to take if the same kind of alert type occurs again, such as sending a notification to Slack, email, or another type of notification. For more information, see Notification Channels.

If you currently use SL1 from ScienceLogic, you can configure an integration that lets you view Zebrium suggestions in SL1 dashboards as well as on the SL1 Events page. For more information, see ScienceLogic Integrations.

Consuming Root Cause Reports

You can consume the AI/ML engine-generated Root Cause reports in one of the following ways:

  1. Recommended. Connect Zebrium to a ScienceLogic integration, such as the SL1 Enhanced (12.x) integration on the Integrations & Collectors page (Settings () > Integrations & Collectors). After you configure the integration, data from the Root Cause reports from Zebrium will display in SL1 and you can correlate the reports with any spikes or alerts occurring at the same time. For more information, see ScienceLogic Integrations.

    For more details, or to take action on one of these reports, click the URL to go directly to the detailed Root Cause report in the Zebrium user interface. For more information, see Working with Suggestions and Root Cause Reports.

  2. Connect Zebrium to your incident management tool, such as Opsgenie, PagerDuty, or Slack. After you configure the incident management tool, an RCA report is automatically created and sent back to the incident management tool.

  3. Evaluate the feed of auto-detected incident Root Cause reports on the Alerts page in the Zebrium user interface, particularly around times where you know things went wrong. You can also force the AI/ML engine to do a deep scan and create a report on demand by clicking the Scan for RC button on the Settings menu (). Any Root Cause reports generated by that scan include a lightning bolt icon and the text "Result of RC Scan". For more information, see Working with Suggestions and Root Cause Reports.

Customizing Your ZebriumResults

You can customize your Zebrium results on the Alerts page (the Zebriumhome page) by selecting one or more filters at the top of the page. You can use these filters to manage the number of suggestions and alerts that display on the Alerts page.

For example, by default only the First occurrence of each incident type is visible on dashboards and alert channel, unless you create filters that specify that the incident deserves an alert or suggestion.

You can also filter the list of suggestions by Significance: the AI/ML engine assigns a value of Low, Medium, or High to each alert. Significance is a cumulative score for each suggestion, based on the rareness and "badness" (log severity level) of the log events within that alert. If you have a high Significance setting, the Root Cause events will have to be more rare and more "bad" to show up in the list of suggestions.

By default, only suggestions with a significance of Medium and High are shown on the Alerts page, so if you want to also see alerts with Low significance, select Low or greater for this filter. You can edit the default Significance setting by editing the Root Cause Significance setting on the Report Settings page (Settings () > Root Cause Settings.

These filters appear on the Selected Filter dialog, which displays when you click the Filtering button () on the Alerts page:

There is also a Search bar at the top of the Alerts page that you can use for text or regular expression (regex) searches, and a toggle for Core Events and All Events.

For more information about filtering, see Using the Filters on the Alerts Page in Zebrium.

Evaluating Zebrium

The best way to try Zebrium is on a system that is experiencing an actual problem. If there are no real problems, Zebrium will not find anything useful.

As an alternative, you can try Zebrium in an environment where you can simulate a real problem. You can also use this step-by-step guide to set up a demonstration online shopping application and cause a failure by using an open source chaos tool.

Signing Up for a New Account

To sign up for a new account and start sending your logs to Zebrium, watch this five-minute "Getting Started" video:

The video covers how to :

  1. Sign up for a new account by visiting https://www.zebrium.com/ and clicking the blue Get Started Free button.

  2. Installing the Kubernetes log collector by using the customized Helm command found on the Welcome page. After you have configured the log collector, Zebrium can being reviewing your logs.

    You will need to set your Timezone and Service Group (zebrium.deployment) when installing the collector.

What does Zebrium Do with Your Logs?

As logs are received by Zebrium , the AI/ML engine automatically structures and categorizes each type of log event. This allows the AI/ML engine to identify anomalous log events. Many factors are used for anomaly detection, but the two most important are the rareness and the severity of each log line.

The AI/ML engine then looks for abnormal clusters of correlated anomalies across all the logs within a Service Group, also known as a failure domain. These clusters usually occur because of an actual problem.

If the AI/ML engine finds one of these clusters, it generates a Suggestion. The suggestion contains a payload that includes the cluster of log lines.

Other than the log events that are contained in alerts, all other log data is discarded after a few hours.