Troubleshooting Business Services

Download this manual as a PDF file

This section covers some of the issues you might encounter while working with services and policies on the Business Services page, and how to resolve those issues.

Use the following menu options to navigate the SL1 user interface:

  • To view a pop-out list of menu options, click the menu icon ().
  • To view a page containing all of the menu options, click the Advanced menu icon ().

Business Services Have Empty Values

All Business Services Have Empty Values

If all of your business services show empty values, as shown in the figure below, ensure that you have given your admin processes adequate time to complete. To populate these values, both the "Business Services: Service Management Engine" and "Business Services: Service Topology Engine" processes must run once. With default settings, it could take up to 30 minutes to see your first results.

In SL1 platform version 10.1.0 and later, services are not evaluated if they have an empty filter. For more information on using a filter, see the section on Creating a Service. The figure below shows the results of using a filter to find all devices for which the IP address contains "10".

Some Business Services Have Empty Values

If only some of your business services are missing values, troubleshoot using the following procedure.

To troubleshoot a business services missing values:

  1. Ensure that your business service has some constituents:
    1. Go to the Business Services page ().
    2. Click on the service that is missing values.
    3. Click on the Devices or Services tab and review the devices or services listed. Modify your query as needed.
  2. Ensure that your service filter results in some constituents. Click on the Status Policy tab and modify your service filter as needed.
    • Rule filters select a subset of the devices or services defined by the service filter. For example, if a device service filter results in five devices, the rule filter will select some subset of those five devices. A rule filter might exclude all devices or services for a given business service, resulting in no metric values.
    • Example. The following rule filter will select only the devices that have a state of "4", meaning "Critical". If no devices have a state of "4", the resulting list of devices will be empty; therefore, it will be impossible to get device metric values back. In this example, we are counting the devices, so the count will be zero. Values are produced based on the condition table. If the metric had been a normal device metric, such as latency, the result would have been null, because gathering the average latency on zero devices results in null.

Services Missing Up-to-Date Values

If you have disabled the default administrator account ("em7admin"), you will need to identify another account to use for running business services and run a database query to change the account used for internal communication in SL1.

To change the internal account:

  1. Go to the User Accounts page (Registry > Accounts > User Accounts).
  2. Identify the account you want SL1 to use for internal communication. In this example, notice that the "em7admin" account is suspended. We want to use the account with ID "5" instead.

  3. Update the internal account.
    1. Go to the Database Tool page (System > Tools > DB Tool).

    2. Select "master" as the database.

    3. Enter the following SQL Query and then click Go:

      UPDATE
         master.system_settings_core 
      SET
         api_internal_account =<account_id>

      Where <account_id> is the ID number of the account you want to use. In the example, we use "5".

Some Services Fail to Generate Health, Availability, or Risk Values

In this situation, some services in SL1 do not generate any values for Health, Availability, or Risk. For example, a dash might appear instead of a value in one of the widgets on the Service Investigator page:

Image of the Service Status table

To address this issue, review the following settings and suggestions:

Step 1: Turn up the log level to trace:

  1. Either go to the console of the SL1 server or use SSH to access the SL1 appliance.
  2. Log in as user em7admin.
  3. Open the file /usr/local/silo/nextui/nextui.env with vi or another text editor:

sudo vi /usr/local/silo/nextui/nextui.env

  1. Change the log setting to the following: NEXT_UI_LOG_LEVEL=all:trace
  2. Restart SL1 and GraphQL with the following command:

sudo systemctl restart nextui

  1. Tail the log with the following command:

sudo journalctl -u nextui -f

Step 2: Ensure that your service policy is valid:

  1. In SL1, navigate to your service on the Business Services page.
  2. Review the policy used by that service for any validation errors, as in the following example:

Image of a Service Policy error

  1. Address any errors in the service policy.

Step 3: Ensure that your service contains at least one service or device:

  1. Navigate to the Business Services page.
  2. Navigate to the Devices or Services tab for the service or services that are not displaying values.

Image of an empty Device Policy

  1. Ensure that at least one device or service appears in the Preview section. If not, create a new filter to search for devices or services.

Step 4: Ensure that your service policy rules contain at least one service or device:

  1. Rule filters select a subset of the devices or services defined by the service filter. If a device service filter results in five devices, the rule filter selects some subset of those five devices. You might create rule filters that exclude all devices or services in the service, resulting in no metric values.
  2. The following rule filter only selects the devices with a state of 4, or Critical. If no devices have a state of 4, the resulting list of devices for that filter will be empty, and you cannot get any device metric values:

Image of a Device Policy

  1. In this case, we are counting devices, so the count is zero and produces a value based in the condition table.
  2. If the metric had been a normal device metric like latency, the result would have been "null," because getting the average latency from zero devices results in null.

Step 5: Generate audit data by running onDemandProcessing with the GraphiQL interface:

  1. In a browser, type the URL or IP address for the new user interface, and then type /gql at the end of the URL or IP address. The GraphiQL interface appears.

  2. On the left side of the GraphiQL editor, type the following query:

                        query onDemand {
      harProviderOnDemandProcessing(ids: []) {
        results { serviceId timestamp health availability risk }
        auditHistory { serviceId ruleSetId ruleId timestamp sequence message }
      }
    }
                    

  1. Click the Execute Query (Play) button to tell GraphiQL to send the query to the GraphQL server and get the results:

  1. Review the resulting audit information on the right side of the GraphiQL editor:
  2. If you know the service ID you are looking for, search for it by clicking inside the right pane and entering Ctrl+f. The GraphiQL interface highlights the services that match the ID you searched for:

Image of the highlighted results of a GraphiQL Query

  1. Scroll down to see the audit information for this service (look for the highlighted information):

Image of highlighted audit results

  1. After running onDemandProcessing with the GraphiQL interface and updating the log settings on the server to do all:trace, you can now see trace-level log messages in the terminal where you ran sudo journalctl -u nextui -f.

  1. Review the log messages for errors and warnings:

Image of Log Errors and Warnings

All Services Fail to Generate Health, Availability, and Risk Values

In this situation, all of your services in SL1 fail to generate any values for Health, Availability, or Risk.

To address this issue, review the following settings and suggestions.

Step 1: Confirm that the Business Services processes exist:

  1. Go to the Process Manager page (System > Settings > Admin Processes) and start typing "Business" in the Process Name filter.
  2. Ensure that the "Business Services: Service Management Engine" and "Business Services: Service Topology Engine" processes appear and are enabled.

Step 2: Follow the steps in Generate audit data using the GraphiQL user interface, above. If the process times out, then the processing has taken more than two minutes to complete, and no computed results are stored.

Step 3: Look for logs from the Python process:

  1. The Python process calls the onDemandProcessing GraphQL query. If Python is having trouble connecting to GraphQL, it could be an authentication problem or some other code-related issue.
  2. Look in /var/log/em7 for newly created logs, and ls -lrt to see if any new error logs were created with "business" in the file name.
  3. Also check the silo.log for messages related to the business_service_management process by using the following command:

grep service /var/log/em7/silo.log

Device Services Fail to Load After an Upgrade

If you have upgraded your appliance from an earlier version of SL1 and your device services are not loading on Business Service pages, you might have outdated device class filters in your user preferences.

To clear the older device class filters:

  1. Open the GraphiQL interface on your appliance by appending "/gql" to your appliance name (or IP address) in a browser window.

  2. Enter the following in the left side of the GQL interface and execute the mutation by pressing the Execute Query button:

    mutation deletePreference{
      deletePreference(preferenceId: "services.detaildevices.table.sort.order") {
        id
        preferenceValue
      }
    }

502, 503, or 504 Errors: Health, Availability, and Risk Values are All the Same or are Inaccurate

Step 1: Check the number of services you have configured. If you are seeing 503 errors in the nextui log or within the SL1user interface, use the following procedure to check the number of services you have configured on your ScienceLogic SL1 system.

To determine the number of services you have:

  1. Open the GraphiQL editor on your system:

http://<SL1_IP_address>/gql

  1. Enter the following query:

    query harProviders {
      harProviders {
        pageInfo {
          matchCount
        }
      }
    }
  1. Click [Execute Query] (Play) to see the number of services. In this example, the results shows that 10 services are configured.

    "data": {
      harProviders {
        pageInfo {
          matchCount: 10 
        }
      }
    }

Step 2: (503 Errors) Confirm that the nginx configuration has an appropriate limit set. In some cases, the limit_conn value might be set to 20. Increase the value to 200.

To address this issue:

  1. Either go to the console of the SL1 server or use SSH to access the SL1 appliance.

  2. Log in as user em7admin.

  3. Confirm that the nginx config file has the limit_conn perip value set to 200 instead of 20:

    sudo vi /etc/nginx/conf.d/em7_limits.conf

  4. If needed, update the line to say:

    limit_conn perip 200;

  5. Run the following command:

    sudo systemctl restart nginx

Step 3: (503 Errors) Check to see if the nginx server is rate-limiting you.

  1. Either go to the console of the SL1 server or use SSH to access the SL1 appliance.
  2. Log in as user em7admin.
  3. Enter the following command:

    sudo grep excess /var/log/em7/ngx.log

  4. If you see any results from the above command, then the nginx proxy is rate-limiting requests to your database. In that case, you should increase the rate limit to 100 requests per second. Edit the em7_limits.conf file:

    sudo vi /etc/nginx/conf.d/em7_limits.conf

  5. Change the following line to 100r/s from the default 5 r/s.

    limit_req_zone $binary_remote_addr zone-addr_req:10m rate=100r/s;

  6. Restart your SL1 system.

    sudo systemctl restart nextui

Step 4: (502 Errors) Check node memory usage.

  1. Either go to the console of the SL1 server or use SSH to access the SL1 appliance.

  2. Log in as user em7admin.
  3. Enter the following command:

    sudo journalctl -u nextui|grep "JavaScript heap out of memory"

  4. If you see any results form the above command, the node.js process is running out of memory. In that case, you should increase the space limit allocated. Edit the nextui.service to increase memory to 4096 or 8192 MB, depending on how much memory you have at your disposal.

    ExecStart=/usr/bin/node --max-old-space-size=4096 /usr/local/silo/nextui/index.js

  5. Restart your SL1 system.

    sudo systemctl restart nextui

Step 5: (504 Errors) Check Nginx timeout.

  1. Either go to the console of the SL1 server or use SSH to access the SL1 appliance.
  2. Log in as user em7admin.
  3. Edit the nextui.fragment file:

    sudo vi /opt/em7/share/config/nginx.d/nextui.fragment

  4. Change the proxy_read_timeout under "location /gql" to 900 as follows:

    proxy_read_timeout 900;

  5. Restart your SL1 system.

    sudo systemctl restart nextui

Advanced Troubleshooting

Customization for Environments with More Than 2,500 Services

If you have an environment that has more than 2,500 services, you might need to modify some default settings in SL1, as described in this section.

Update Settings and Increase Default Values

To update your settings and increase your default values:

  1. Either go to the console of the SL1 server or use SSH to access the SL1 appliance.
  2. Log in as user em7admin.
  3. Increase the maximum service count variable. (The default value is 2500.)
    1. At the command line, enter

      sudo vi /opt/em7/nextui/nextui.env

    2. Add the following line (or modify it, if it already exists), where "new_service_limit" is the maximum number of services you need in your environment:

      BUSINESS_SERVICES_MAX_SERVICES=new_service_limit

  4. Increase the Node.js memory limit.
    1. At the command line, enter

      sudo vi /etc/systemd/system/multi-user.target.wants/nextui.service

    2. Change the ExecStart line to the following, where the size is either 4096 or 8192, depending on how much memory you have available:

      ExecStart=/usr/bin/node --max-old-space-size=size /usr/local/silo/nextui/index.js

  5. Restart nextui by entering the following at the command line:

    sudo systemctl restart nextui

Modify NGINX Rate Limit

If you have a large number of services in your environment and are seeing 503 errors, you might need to increase your NGINX rate limit.

To increase your NGINX rate limit:

  1. Either go to the console of the SL1 server or use SSH to access the SL1 appliance.
  2. Log in as user em7admin.
  3. At the command line, enter the following:

    sudo grep excess /var/log/em7/ngx.log

  4. If you see any results from this command, consider increasing your NGINX rate limit to 100 requests per second.

    1. Enter the following at the command line to edit the limit file:

      sudo vi /etc/nginx/conf.d/em7_limits.conf

    2. Change the value in the following line to "300r/s" from the default value of "100r/s":

      limit_req_zone $binary_remote_addr zone=addr_req:10m rate=value

      If this value is set too high, the database will begin seeing errors for too many connections.

  5. Restart NGINX.

    sudo systemctl restart nginx