SL1 Agent Troubleshooting

This section contains troubleshooting processes that you can use to address issues with an SL1 agent.

On a Windows agent, you can run the following diagnostic command to generate a *.tar.gz file that contains useful information for troubleshooting issues: ...\SiloAgent.exe --diag

On a Linux agent, use –-diag for the diagnostic option.

Use the following menu options to navigate the SL1 user interface:

To view a pop-out list of menu options, click the menu icon ().
To view a page containing all of the menu options, click the Advanced menu icon ().

To troubleshoot potential issues with SL1 agents, perform the following procedures, in the following order:

Was the Agent Download Successful?

If the Windows agent download failed with a "500 Internal Server Error", restart the uwsgi service on the SL1 system from which you are downloading the agent.

From an administrator command prompt in Windows, run the following command on the SL1 system:

systemct restart uwsgi

Is the Windows Installation or Upgrade Failing?

See Installing a Windows Agent for more information.

Is the Agent Process Running?

When running the agent as a dedicated user on Windows, the "Log on as a service" user right is required for the service to start.

As a first step, always locate the following logs from the Message Collectors when troubleshooting:

/var/log/streamer_prime/streamer_prime.log
/var/log/uwsgi/streamer_prime.log

To determine if the agent process is running:

Check the Windows Task Manager or run the "tasklist" or "top" command.
Look for SiloAgent.exe (Windows) or scilogd (Linux):

Windows: If SiloAgent.exe is not running, check the "Application" event log for events with source=SiloAgent.
Linux: If scilogd is not running, check /var/log/messages or /var/log/syslog for relevant log messages.

If you are using the SL1 Extended Architecture, determine if the agent was deleted from the Agents tab instead of uninstalling the agent.

If the agent was deleted in the SL1 user interface, SL1 shuts down the agent instead of uninstalling the agent. You should re-install the agent that you deleted in SL1.

To re-install the agent that was shut down:

Uninstall the agent that you shut down.
Delete that agent's configuration from one of the following locations:

Windows: C:\Program Files\ScienceLogic\SiloAgent\conf\scilog.conf
Linux: /etc/scilog/scilog.conf

Install a new agent.

If the agent was not deleted, then the issue could be with the agent. You should generate diagnostics information to share with your ScienceLogic contact.

To generate diagnostics information for an agent:

From an administrator command prompt, run one of the following commands:

Windows: C:\Program Files\ScienceLogic\SiloAgent\bin\SiloAgent.exe -diag
Linux: /usr/bin/scilogd --diag

Share the contents of the newly created diagnostic file in the current directory with your ScienceLogic contact. Depending on your operating system, the file name is:

Windows: scilog-<current date>.diag.tgz
Linux: sl-diag.tar.gz

Is the Agent Configuration File Valid?

Check the agent configuration file in one of the following locations:
- Windows: C:\Program Files\ScienceLogic\SiloAgent\conf\scilog.conf
- Linux: /etc/scilog/scilog.conf
Check the configuration item CollectorID:

If there is no CollectorID tag, then the agent has not been able to reach the stream or message collector.
The CollectorID should be a GUID similar to "4179b06ef502129c3023a0f8d58f3c37".

Check the configuration item URLfront, which is where the agent attempts to get the configuration file:

Determine if you can ping the URLfront.
If you are using the SL1 Distributed Architecture, URLfront should be the URL of the Message Collector.
If you are using the SL1 Extended Architecture, URLfront should be the URL of the Streamer service.
If the URL for URLfront is not correct, then re-install the agent.
If the URL for URLfront is correct, then determine if you can ping the host portion of URLfront. If you cannot ping, then there are customer firewall or NAT issues.

Has SL1 Discovery Completed?

To check if SL1 discovery finished:

Check the agent record in the SL1 Distributed Architecture by using SSH to communicate with the Message Collector and running the following commands:

redis-cli -p 6380

keys *

hgetall agent_<GUID>

where <GUID> is the specific value from scilog.conf.
Look for a field "did". The following line is the EM7-device-id.
If the EM7-device-id is not present, then discovery has not completed.
Alternately, you can look on the Devices page for the device to compare the shown device-id to the value in the agent record.

Is the Agent Able to Upload Data?

Check the Agent Upload Directory

Check the upload directory for the agent for directories and files in one of the following locations:

Windows: C:\Program Files\ScienceLogic\SiloAgent\data
Linux: opt/scilog/data

If there are many items, then the agent is unable to upload.

If the number of items is decreasing, the agent might have an issue. The agent is slowly catching up, but this situation indicates that a previous issue existed.

If the number of items continues to increase overall, check the configuration item URL:

The URL is the location where the agent attempts to upload files.
Determine if the host portion of the URL is reachable. If the host portion is reachable, the name of the oldest item indicates the approximate time of the issue.

NOTE: To prevent consuming the disk with backed-up data, the agent limits the size and count of items in the upload directory.

A procedural note regarding backed-up data:

For a new installation, the agent reaches out to the streamer for a configuration file. If the configuration file can’t reach the streamer, the streamer goes into a slow poll mode, waiting for a good configuration file. In the meantime, the streamer does nothing else (it does not generate data or log files). As a result, even through it looks like there is no backup of data files, in reality, there are no data files.

After the Streamer service receives a valid configuration file:

After a restart, the agent reaches out to the Streamer service for a new configuration file.
If the agent can't reach the Streamer, the agent will still generate data files, because it has a valid configuration file from a previous run. In this situation, you will see data files backing up if the Streamer is unreachable.

NOTE: You have the ability to set the RequireWebCert to true in the configuration file the streamer sends to the agent so the validation process is successful.

In summary, if you have a valid configuration, you will get data files. If you do not have a backup, Streamer can be reached.

Run the Agent in Debug Mode (Linux)

You might need to preface the following commands with sudo if you are in root-privileged mode.

Stop the agent daemon by running the following command:

service scilogd stop
Start the agent from the command line:

scilogd -d 2>&1 | tee /tmp/scilogd.log
Let the agent run for about five minutes.
Press Ctrl+C and examine the output file.
Restart the agent by running the following command:

service scilogd start

Is SL1 Receiving Agent Data?

If you are using the SL1 Distributed Architecture:

SSH into the Message Collector and run the following command:

sudo tail -n 100 /var/log/uwsgi/streamer_prime_uwsgi.log

Look for lines containing the hostname of the monitored device, such as the following:

10.2.16.40 - - [19/Apr/2018:17:04:55 +0000] "POST /SaveData.py/save_data HTTP/1.1" 200 59 "-" "Windows SiloAgent : aym-win2012r2-0"
If there are no matching lines, then the Streamer service is not getting data from that agent.

If you are using the SL1 Extended Architecture:

SSH into the Management Node and view the logs.
Look for lines containing the HOSTNAME of the monitored device.
If there are no matching lines, then the Streamer service is not getting data from that agent.

Is the Agent Not Reporting Vital Data and Metrics?

If the agent is not reporting Windows vitals or PowerShell Dynamic Applications are not being collected:

Ensure the agent is running as dedicated user.
Ensure the user is enrolled in the "Performance Monitor Users" group to be able to collect the desired metrics

Can SL1 Process Agent Data?

Check the Message Collector log files or SL1 Streamer log files:

If you are using the SL1 Distributed Architecture, locate the following files from the SL1 Message Collector and provide the files to your ScienceLogic contact:

/var/log/uwsgi/streamer_prime_uwsgi.log
/var/log/streamer_prime/streamer_prime.log

If you are using the SL1 Extended Architecture, check all logs for ERR or Except:

for p in $(kubectl get pods | cut -f 1 -d ' '); do echo $p; kubectl logs $p | grep -E "(ERR|Except)"; done

Search for "ERROR", "Exception", and "HARAKIRI".
Contact your ScienceLogic contact with any error messages you find in the log files.

If you do not find any error messages, then the issue is most likely with the Dynamic Application that runs on the Data Collector.

Is the Number of Processes Inconsistent with Other Applications?

On Linux, many outputs from the ps command list the kernel threads (the processes listed in square brackets). Because the agent is not in the kernel, it will not list kernel threads.
Be aware that the agent reports processes that are running as well as processes that started and may have stopped, while top or ps commands show processes that exist when they are executed.
Check the agent configuration. Due to back-end space limitations, many configuration combinations can limit what data the agent sends. A combination of parameters to get all processes include the following:

NIPD True. The agent library can not get into all processes at times, often on install. Non-intercepted process discovery reports processes that are not intercepted via the library.
SLPAggregation. This parameter takes short-lived processes that exist for less than 80 seconds and rolls information about the processes into the information for their parents. As a result, the short-lived processes will not be seen.

Is the Agent Unable to Connect to the Streamer Endpoint?

Some SL1 agents on Windows 2012 R2 might have issues connecting with the streamer endpoint if there is not a match with the default TLS ciphers.

To address this issue, add the ssl-ciphers configuration to the existing Kubernetes ConfigMap nginx-configuration.

To add ssl-ciphers in on-premises environments:

SSH to the Manager Node and enter to the Ansible shell.
Run the following commands to install vim:

sudo apt update

sudo apt install vim -y
Run the following command:

kubectl edit configmaps -n ingress-nginx nginx-nginx-ingress-controller

In the nginx-configuration file, add the missing ciphers after apiVersion:

Line breaks were added to the following list of ciphers to make the code more readable.

data:
  ssl-ciphers: ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:
ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:
ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:
DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:
DHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:
ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:
ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:
DHE-RSA-AES128-SHA256:DHE-RSA-AES256-SHA256:AES128-GCM-SHA256:
AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:
DES-CBC3-SHA
  ssl-protocols: TLSv1 TLSv1.1 TLSv1.2 TLSv1.3
  use-proxy-protocol: "false"

Exit with :wq. In a few minutes, the agent will be automatically recognized. If the agent is not recognized, try restarting the agent.

To add ssl-ciphers in AWS environments:

SSH Bastion/JH and enter to the Ansible shell.
Run the following command:

kubectl edit configmaps -n ingress-nginx nginx-nginx-ingress-controller

In the nginx-configuration file, add the missing ciphers after apiVersion:

data:
  ssl-ciphers: ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:
ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:
ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:
DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384:
DHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256:
ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA384:
ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES256-SHA:
DHE-RSA-AES128-SHA256:DHE-RSA-AES256-SHA256:AES128-GCM-SHA256:
AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:
DES-CBC3-SHA
  ssl-protocols: TLSv1 TLSv1.1 TLSv1.2 TLSv1.3
  use-proxy-protocol: "true"

Exit with :wq. In a few minutes, the agent will be automatically recognized. If the agent is not recognized, try restarting the agent.

Validating Agent TLS Connections to the SL1 Streamer Service

As of SL112.1.1, customers who use the SL1 Gen 3 agent with on-premises Extended Architecture systems have the option to turn on TLS certificate validation when deploying the Streamer service. This provides additional security to confirm that the agent's connection to SL1 is valid.

To enable this TLS validation, the extended cluster must be configured with a valid TLS certificate and the "requireTls" setting in the Streamer helm chart must be set to "true" when deploying the Streamer, such as in the following command:

helm upgrade --version 1.2.13 streamer sl1/sl1-streamer -f output-files/steamer-values.yml --set requireTls=true

If you update this setting, the Streamer pods will restart and the agent will download the new configuration upon its next communication with the cluster.

This TLS validation is currently disabled by default for on-premises Extended Architecture deployments.

If you want to enable this feature, it is important to first ensure that the Streamer end point that is provided via the URLFRONT installation option is configured with a valid TLS certificate. If the agent is configured to validate the TLS connection but the cluster it is trying to communicate with does not have a valid TLS certificate, the agent will be unable to communicate with that cluster.

If this occurs, you can disable the validation by updating the Streamer deployment to disable the "requireTls" setting, updating the scilog.conf file to remove or alter the "RequireWebCert true" line, and then restarting the agent.

This feature can be enabled on SaaS SL1 deployments by submitting a Service Request case to the SRE queue at the ScienceLogic Support site at https://support.sciencelogic.com/s/ , or by contacting your ScienceLogic customer service manager.

Troubleshooting Examples

Example /var/log/streamer_prime/streamer_prime.log for successful discovery

2019-01-04T17:07:42.355291+00:00 amateen-em7 journal: SCILO_SP:6954|logger:log_info:132|INFO|Agent config request received with init flag set to True. Generated Temp AID: 2ae22a6b4489457abb14373cd3816076. Request: <WSGIRequest: GET '/api/collector/config/?collector_key=aEf34$aq3TGSDdf&tenant_id=0&host_name=aym-win2012r2-1&init=&os=windows&collector_id=0'>

2019-01-04T17:07:42.619082+00:00 amateen-em7 journal: SCILO_SP:6954|logger:log_info:132|INFO|Calling Agent version with: <QueryDict: {'collector_id': ['2ae22a6b4489457abb14373cd3816076'], 'type': ['windows_64'], 'tenant_id': ['0'], 'host_name': ['aym-win2012r2-1'], 'collector_key': ['aEf34$aq3TGSDdf'], 'version': ['115']}>

2019-01-04T17:07:43.028457+00:00 amateen-em7 journal: SCILO_SP:16717|logger:log_warning:127|WARNING|System file received from aym-win2012r2-1

2019-01-04T17:07:43.032897+00:00 amateen-em7 journal: SCILO_SP:16717|logger:log_info:132|INFO|Making discovery call for agent 2ae22a6b4489457abb14373cd3816076

2019-01-04T17:07:43.746284+00:00 amateen-em7 journal: SCILO_SP:30843|logger:log_warning:127|WARNING|System file received from aym-win2012r2-1

2019-01-04T17:07:43.750553+00:00 amateen-em7 journal: SCILO_SP:30843|logger:log_warning:127|WARNING|Discovery call within time threshold, sleeping.

2019-01-04T17:07:46.676827+00:00 amateen-em7 journal: SCILO_SP:16717|logger:log_info:132|INFO|Update agent request did: 4, oid: 0, ip: 10.7.6.119, agent id: 2ae22a6b4489457abb14373cd3816076

2019-01-04T17:07:46.677114+00:00 amateen-em7 journal: SCILO_SP:16717|logger:log_info:132|INFO|Discovery complete, getting new agent device id. Downloading new config for device: 2ae22a6b4489457abb14373cd3816076.

2019-01-04T17:07:47.420509+00:00 amateen-em7 journal: SCILO_SP:6954|logger:log_warning:127|WARNING|Agent id: 2ae22a6b4489457abb14373cd3816076 being given a return code: 2

Example /var/log/uwsgi/streamer.log for successful discovery in the SL1 Distributed Architecture

10.234.196.19 - - [29/Sep/2017:14:04:52 +0000] "POST /api/update_agent/agent/ HTTP/1.1" 200 89 "-" "python-requests/2.26.0"

Save incoming data for a specific device ID in the SL1 Distributed Architecture

PYTHONPATH=/opt/em7/lib/python3:/opt/streamer_prime python3 /opt/streamer_prime/streamer_prime/manage.py agent_save_xml -d <agent guid> -e true

Save incoming data for a specific device ID in the SL1 Extended Architecture

kubectl exec -it $(kubectl get pods -l app=streamer -o jsonpath="{.items[0].metadata.name}") -- python -m streamer agent_save_data --host_id <host id> --enable true

You can find the host id from the ADS url, such as https://<sl1_address>/ads/servers/13/system). You can locate the files in the /tmp/save_agent_data directory.

Additional Troubleshooting Situations and Best Practices

The following situations might occur while configuring or working with agents:

Situation	Cause / Resolution
Two device records exist for the same device on the Devices page in SL1.	This situation occurs when SL1 first discovered this device with SNMP, and then the agent was installed and started polling that device. This duplication of records also occurs if the agent was installed first, and then you ran an SNMP discovery. To address this issue, you can merge the device records using the classic user interface. For more information, see Merging Devices.
The SNMP device record has IPv4, but the agent device record has IPv6.	The agent reports all network interfaces to the message collector. The Message Collector uses the first "bound" IP address reported by the agent. To address this issue, you can manually edit the agent device record in the "classic" user interface and update the IP address.
If you uninstall an agent and then run a different installation executable file, you still see the same organization ID for the agent record.	After you uninstall the agent, the scilog.conf file is left on the server in case the agent is reinstalled. SL1 can reuse the same device record and maintain historical performance data for that agent. To address this issue, delete the scilog.conf file after you run the uninstallation. If you install this agent again, SL1 assigns a new organization ID to the agent and creates a new device record.

Agent Communication with SL1

This section covers the various codes that are sent to the SL1 agent when additional actions are needed.

Return Codes

The core method of communication between the SL1 agent and SL1 involves the data files the agent uploads to the Streamer. When the agent uploads a file to the Streamer, a number of conditions are checked and the return code given to the agent determines what action should be taken, if any.

Return Code	Meaning
200	No action needed.
409	The data file uploaded by the agent has a problem. The Streamer did not accept it and it should be deleted from the host.
503	Data uploads are occurring more quickly than normal. The agent should refrain from sending its next file until a specified amount of time has passed.
505	The agent should stop streaming.

Sub-codes

The Streamer communicates another set of actions in a JSON response included with the return code to the SL1 agent.

JSON Response Field	Type	Meaning
send_system_file	boolean	Send a system config file as the next upload from this agent.
get_agent_config	boolean	Retrieve an updated agent config from the Streamer.
get_agent_update	boolean	Download a new version of the agent to update.
get_pd_config	boolean	Retrieve an updated polled data config.
get_log_config	boolean	Retrieve an updated log config.
purge_uploads	boolean	Delete any backed-up data and only upload new files.
set_data_backoff	integer	Do not upload a data file until the set number of seconds have passed.
set_log_backoff	integer	Do not upload a log file until the set number of seconds have passed.