This chapter provides an overview of how Skylar Automated RCA works, and how to get started using Skylar Automated RCA.
Before you can start watching for suggestions and reviewing Root Cause reports, you will need to configure a method for gathering log data to send to Skylar Automated RCA. For more information, see Log Collectors and File Uploads.
How Skylar Automated RCA Works
When skilled engineers troubleshoot software, they typically ask the following questions:
- Where are the problems or events occurring? The events could be clusters of errors, warnings, stack traces, or other indicators of bad outcomes.
- Were there unusual events upstream that could help explain these bad outcomes? This might be configuration changes, a new deployment, user actions, and so on.
In modern software, these events are often generated by different micro-services or software components, so you might have to switch between many log streams and then mentally correlate the events across them.
The Skylar AI emulates the workflow of a skilled engineer by performing the following actions:
- Automatically build a catalog of all of the event types generated by the software.
- Track the patterns of each event type in each log stream, such as the logs generated by a specific container, pod, or host.
- Automatically identify unusual and "bad" events.
- Identify unusually correlated clusters of rare and bad events that appear to be due to the same incident. The Skylar AI scores each such collection based on a combination of how rare the underlying events are, and how bad the events are, such as how many warnings or errors are generated.
- "Fingerprint" each cluster of such events as a unique type of issue. The events that rise above a specified threshold can be considered a potential Root Cause report, and they are summarized using Natural Language Processing (NLP) for Machine Learning.
When the Skylar AI detects one of these "abnormal" clusters, it generates a suggestion, which appears on the Alerts page (the home page) of the Skylar Automated RCA user interface along with the existing alerts:
On the Alerts page, the summary report for a suggestion and an alert contains the following main elements:
- AI-generated title. Displaying at the top of the summary pane, this title is generated using GPT Services that use new Generative AI models. You can enable or disable GPT services for a specific deployment of Skylar Automated RCA by using the GPT Services column on the Deployments page (Settings ()> Deployments).
- Word Cloud. A set of relevant words chosen by the Skylar AI from the log lines contained in the alert. On the RCA report page, you can click a word in the cloud to highlight that word in the list of logs.
- Significance icon. Since not all suggestions that the Skylar AI generates will relate to problems that actually impact users, the engine attempts to reason over the data and assess whether a problem actually requires attention. Hover over this icon at the top of the list of logs to view the confidence level of the Skylar AI for this suggestion:
- A red icon () means "High" confidence.
- A yellow icon () means "Medium" confidence.
- A blue icon () means "Low" confidence.
- AI Assessment . Since not all suggestions that the Skylar AI generates will relate to problems that actually impact users, the Skylar AI attempts to reason over the data and assess whether a problem actually requires attention. Depending on the quality of the data, some suggestions might not include an AI Assessment. This value is shown in the Skylar Automated RCA user interface as an AI Assessment value of one of the following:
- "Your Attention Needed" for content that the Skylar AI believes should be looked into.
- "No Attention Needed" for content that the Skylar AI assesses as unlikely to require immediate attention.
- Root Cause (RCA) Report Summary. The report contains the actual cluster of anomalous log lines that was identified by the Skylar AI. Up to eight of these log lines are shown in the summary view. You can click anywhere in the summary to view the full Root Cause report.
- Alert Key. One or two log lines, denoted with a key icon (), that are used to identify the suggestion if this type of suggestion occurs again. The alert keys make up an alert rule.
You can click anywhere in the summary report for a suggestion or an alert to view a more detailed Root Cause Report page for that suggestion or alert. For more information, see Root Cause Reports.
Suggestions are generated when the Skylar AI finds a cluster of correlated anomalies in your logs that resembles a problem. However, this does not mean that all suggestions relate to actual important problems. This is especially true during the first few days of using Skylar Automated RCA, as the Skylar AI learns the normal patterns in your logs.
When you start getting suggestions on the Alerts page, you can review the word clouds and event logs that display in the summary views for the Root Cause reports for the suggestions. As a best practice, identify a specific time frame when a possible problem occurred, and then start looking at the reports that have the most interesting or relevant information related to the possible root cause of the problem.
You can choose to "accept" or "reject" a suggestion. For more information, see Assessing Suggestions.
You can also decide on the action to take if the same kind of alert type occurs again, such as sending a notification to Slack, email, or another type of notification. For more information, see Notification Channels.
If you currently use SL1 from ScienceLogic, you can configure an integration that lets you view Skylar Automated RCA suggestions in SL1 dashboards as well as on the SL1 Events page. For more information, see ScienceLogic Integrations.
Consuming Root Cause Reports
You can consume the Skylar AI-generated Root Cause reports in one of the following ways:
-
Recommended. Connect Skylar Automated RCA to a ScienceLogic integration, such as the SL1 Enhanced (12.x) integration on the Integrations & Collectors page (Settings () > Integrations & Collectors). After you configure the integration, data from the Root Cause reports from Skylar Automated RCA will display in SL1 and you can correlate the reports with any spikes or alerts occurring at the same time. For more information, see ScienceLogic Integrations.
For more details, or to take action on one of these reports, click the URL to go directly to the detailed Root Cause report in the Skylar Automated RCA user interface. For more information, see Working with Suggestions and Root Cause Reports.
-
Connect Skylar Automated RCA to your incident management tool, such as Opsgenie, PagerDuty, or Slack. After you configure the incident management tool, an RCA report is automatically created and sent back to the incident management tool.
-
Evaluate the feed of auto-detected incident Root Cause reports on the Alerts page in the Skylar Automated RCA user interface, particularly around times where you know things went wrong. You can also force the Skylar AI to do a deep scan and create a report on demand by clicking the button on the Settings menu (). Any Root Cause reports generated by that scan include a lightning bolt icon and the text "Result of RC Scan". For more information, see Working with Suggestions and Root Cause Reports.
Customizing Your Skylar Automated RCA Results
You can customize your Skylar Automated RCA results on the Alerts page (the Skylar Automated RCA home page) by selecting one or more filters at the top of the page. You can use these filters to manage the number of suggestions and alerts that display on the Alerts page.
For example, by default only the First occurrence of each incident type is visible on dashboards and alert channel, unless you create filters that specify that the incident deserves an alert or suggestion.
You can also filter the list of suggestions by Significance: the Skylar AI assigns a value of Low, Medium, or High to each alert. Significance is a cumulative score for each suggestion, based on the rareness and "badness" (log severity level) of the log events within that alert. If you have a high Significance setting, the Root Cause events will have to be more rare and more "bad" to show up in the list of suggestions.
By default, only suggestions with a significance of Medium and High are shown on the Alerts page, so if you want to also see alerts with Low significance, select Low or greater for this filter. You can edit the default Significance setting by editing the Root Cause Significance setting on the Report Settings page (Settings () > Root Cause Settings.
These filters appear on the Selected Filter dialog, which displays when you click the button () on the Alerts page:
There is also a Search bar at the top of the Alerts page that you can use for text or regular expression (regex) searches, and a toggle for Core Events and All Events.
For more information about filtering, see Using the Filters on the Alerts Page in Skylar Automated RCA.
What does Skylar Automated RCA Do with Your Logs?
As logs are received by Skylar Automated RCA, the Skylar AI automatically structures and categorizes each type of log event. This allows the Skylar AI to identify anomalous log events. Many factors are used for anomaly detection, but the two most important are the rareness and the severity of each log line.
The Skylar AI then looks for abnormal clusters of correlated anomalies across all the logs within a Service Group, also known as a failure domain. These clusters usually occur because of an actual problem.
If the Skylar AI finds one of these clusters, it generates a Suggestion. The suggestion contains a payload that includes the cluster of log lines.
Other than the log events that are contained in alerts, all other log data is discarded after a few hours.