|   Release Notes   |   ScienceLogic Support   |   Tips for Using the Online Documentation   |   Contact Documentation

 

Appendix A: Integration Service for Multi-tenant Environments

Download this manual as a PDF file

This appendix describes the best practices and troubleshooting solutions for deploying the Integration Service in a multi-tenant environment that supports multiple customers in a highly available fashion. This document also covers how to perform an upgrade of the Integration Service with minimal downtime.

This document covers the following topics:

Quick Start Checklist for Deployment

  1. Deploy and cluster the initial High Availability stack. Label these nodes as "core".
  2. For a desired customer, create the Integration Service configuration for the customer systems. This configuration information includes the SL1 IP address, the ServiceNow user and domain, and other related information.
  3. Deploy and cluster the worker node or nodes for the customer.
  4. Label the worker node or nodes specifically for the customer.
  5. Update the docker-compose.yml file on a core node:
  1. Schedule the required integrations for this customer :
  1. Modify the Run Book Automations in SL1 to trigger the integration to run on the queue for this customer:

Deployment

The following sections describe how to deploy the Integration Service in a multi-tenant environment. After the initial High Availability (HA) core services are deployed, the multi-tenant environment differs in the deployment and placement of workers and use of custom queues.

Core Service Nodes

For a multi-tenant deployment, ScienceLogic recommends that you dedicate at least three nodes to the core Integration Service services. These core Integration Service services are shared by all workers and customers. As a result, it is essential that these services are clustered to handle failovers.

Because these core services are critical, ScienceLogic recommends that you initially allocate a fairly large amount of resources to these services. Allocating more resources than necessary to these nodes allows you to further scale workers in the future. If these nodes become overly taxed, you can add another node dedicated to the core services in the cluster.

These core services nodes are dedicated to the following services:

It is critical to monitor these core service nodes, and to always make sure these nodes have enough resources for new customers and workers as they are on-boarded.

To ensure proper failover and persistence of volumes and cluster information, the core services must be pinned to each of the nodes. For more information, see Configuring Core Service Nodes, below.

Requirements

3 nodes (or more for additional failover support) with 6 CPUs and 56 GB memory each

Configuring Core Service Nodes

Critical Elements to Monitor on Core Nodes

Worker Service Nodes

Separate from the core services are the worker services. These worker services are intended to be deployed on nodes separate from the core services, and other workers, and these worker services aim to provide processing only for specified dedicated queues. Separating the VMs or modes where worker services are deployed will ensure that one customer's workload, no matter how heavy it gets, will not negatively affect the other core services, or other customer workloads.

Requirements

The resources allocated to the worker nodes depends on the worker sizing chosen, the more resources provided to a worker, the faster their throughput. Below is a brief guideline for sizing. Please note that even if you exceed the number of event syncs per minute, events will be queued up, so the sizing does not have to be exact. The below sizing just provides a suggested guideline.

Event Sync Throughput Node Sizing

CPU Memory Worker count Time to sync a queue full of 10,000 events Events Synced per second

2

16 GB

6

90 minutes

1.3

4

32 GB

12

46 minutes

3.6

8

54 GB

25

16.5 minutes

10.1

Test Environment and Scenario

  1. Each Event Sync consists of Integration Service workers reading from the pre-populated queue of 10000 events. The sync interprets, transforms, and then POSTS the new event as a correlated ServiceNow incident into ServiceNow. This process goes on to then query ServiceNow for the new sysID generated for the incident, transforms it, and then POSTs it back to SL1 as an external ticket to complete the process.

Configuring the Worker Node

Initial Worker Node Deployment Settings

It is required that there is always at least one worker instance listening on the default queue for proper functionality. The default worker can run in any node.

Worker Failover Considerations and Additional Sizing

When deploying a new worker, especially if it is going to be a custom queue dedicated worker, it is wise to consider deploying an extra worker listening on the same queues. If you have on a single worker node listening to a dedicated customer queue, there is potential for that queue processing to stop completely if that single node worker fails.

For this reason, ScienceLogic recommends that for each customer dedicated worker you deploy, you deploy a second one as well. This way there are two nodes listening to the customer dedicated queue, and if one node fails, the other node will continue processing from the queue with no interruptions.

When deciding on worker sizing, it's important to take this into consideration. For example, if you have a customer that requires a four-CPU node for optimal throughput, an option would be to deploy two nodes with two CPUs, so that there is failover if one node fails.

Knowing When More Resources are Necessary for a Worker

Monitoring the memory, CPU and pending integrations in queue can give you an indication of whether more resources are needed for the worker. Generally, when queue times start to build up, and tickets are not synced over in an acceptable time frame, more workers for task processing are required.

Although more workers will process more tasks, they will be unable to do so if the memory or CPU required by the additional workers is not present. When adding additional workers, it is important to watch the memory or CPU utilization, so long as the utilization is under 75%, it should be okay to add another worker. If utilization is consistently over 80%, then you should add more resources to the system before addling additional workers.

Keeping a Worker Node on Standby for Excess Load Distribution

Even if you have multiple workers dedicated to a single customer, there are still scenarios in which a particular customer queue spikes in load, and you'd like an immediate increase in throughput to handle this load. In this scenario you don't have the time to deploy a new IS node and configure it to distribute the load for greater throughput, as you need increased load immediately.

This can be handled by having a node on standby. This node has the same IS RPM version installed, and sits idle in the stack (or is turned off completely). When a spike happens, and you need more resources to distribute the load, you can then apply the label to the corresponding to the customer who's queues spiked. After setting the label on the standby node, you can scale up the worker count for that particular customer. Now, with the stand-alone node labeled for work for that customer, additional worker instances will be distributed to and started on the standby node.

When the spike has completed, you can return the node to standby by reversing the above process. Decrease the worker count to what it was earlier, and then remove the customer specific label from the node.

Critical Elements to Monitor in a Steprunner

Advanced RabbitMQ Administration and Maintenance

This section describes how multi-tenant deployments can use separate virtual hosts and users for each tenant.

Using an External RabbitMQ Instance

In certain scenarios, you might not want to use the default RabbitMQ queue that is prepackaged the Integration Service. For example, you might already have a RabbitMQ production cluster available that you just want to connect with the Integration Service. You can do this by defining a new virtual host in RabbitMQ, and then you configure the Integration Service broker URL for contentapi, steprunner, scheduler services so that they point to the new virtual host.

Any use of an external RabbitMQ server will not be officially supported by ScienceLogic if there are issues in the external RabbitMQ instance.

Setting a User other than Guest for Queue Connections

By default RabbitMQ contains the default credentials guest/guest.The Integration Service uses these credentials by default when communicating with RabbitMQ in the swarm cluster. All communication is encrypted and secured within the overlay Docker network.

To add another user, or to change the user that the Integration Service uses when communicating with the queues:

  1. Create a new user in RabbitMQ that has full permissions to a virtual host. For more information, see the RabbitMQ documentation.
  2. Update the broker_url environment variable with the new credentials in the docker-compose file and then re-deploy.

Configuring the Broker (Queue) URL

When using an external RabbitMQ system, or if you are using credentials other than guest/guest to authenticate, you need to update the broker_url environment variable in the contentapi, steprunner, and scheduler services. You can do this by modifying the environment section of the services in docker-compose and changing broker_url. The following line is an example:

broker_url: 'pyamqp://username:password@rabbitmq-hostname/v-host'

Onboarding a Customer

When a new SL1 system is to be onboarded into the Integration Service, by default their integrations are executed on the default queue. In large multi-tenant environments, ScienceLogic recommends separate queues for each customer. If desired, each customer can also have specific queues.

Create the Configuration

The first step to onboarding a new customer is to create a configuration with variables that will satisfy all integrations. The values of these should be specific to the new customer you're on boarding (such as SL1 IP address, username, password).

See the example configuration for a template you can fill out for each customer.

Because integrations might update their variable names from EM7 to SL1 in the future, ScienceLogic recommends to cover variables for both em7_ and sl1_. The example configuration contains this information.

Label the Worker Node Specific to the Customer

For an example label, if you want a worker node to be dedicated to a customer called "acme", you could create a node label called "customer" and make the value of the label "acme". Setting this label now makes it easier to cluster in additional workers and distribute load dynamically in the future.

Creating a Node Label

This topic outlines creating a label for a node. Labels provide the ability to deploy a service to specific nodes (determined by labels) and to categorize the nodes for the work they will be performing. Take the following actions to set a node label:

# get the list of nodes available in this cluster (must run from a manager node)

docker node ls

 

# example of adding a label to a docker swarm node

docker node update --label-add customer=acme <node id>

 

Placing a Service on a Labeled Node

After you create a node label, refer to the example below for updating your docker-compose-override.yml file and ensuring the desired services deploy to the matching labeled nodes:

# example of placing workers on a specific labeled node

steprunner-acme:

...

deploy:

placement:

constraints:

- node.labels.customer == acme

resources:

limits:

memory: 1.5G

replicas: 15

...

Dedicating Queues Per Integration or Customer

Dedicating a separate queue for each customer is beneficial in that work and events created from one system will not affect or slow down work created from another system, provided the multi-tenant system has enough resources allocated. In the example below, we created two new queues in addition to the default queue, and allocated workers to it. Both of these worker services use separate queues as described below, but run on the same labeled worker node.

Example Queues to Deploy:

Add Workers for the New Queues

First, define additional workers in our stack that are responsible for handling the new queues. All modifications are made in docker-compose-override.yml:

  1. Copy an existing steprunner service definition.
  2. Change the steprunner service name to something unique for the stack:
  1. Modify the "replicas" value to specify how many workers should be listening to this queue:
  1. Add a new environment variable labeled "user_queues". This environment variable tells the worker what queues to listen to:
  1. To ensure that these workers can be easily identified for the queue to which they are assigned, modify the hostname setting :
  1. After the changes have been made, run /opt/iservices/script/compose-override.sh to validate that the syntax is correct.
  2. When you are ready to deploy, re-run the docker stack deploy with the new compose file.

After these changes have been made, your docker-compose entries for the new steprunners should look similar to the following:

steprunner-acme-catchup:

image: sciencelogic/is-worker:latest

depends_on:

- couchbase

- rabbitmq

- redis

hostname: "acme-catchup-{{.Task.ID}}"

deploy:

placement:

constraints:

- node.labels.customer == acme

resources:

limits:

memory: 2G

replicas: 3

environment:

user_queues: 'acmequeue-catchup'

..

..

..

 

steprunner-acme:

image: sciencelogic/is-worker:latest

depends_on:

- couchbase

- rabbitmq

- redis

hostname: "acmequeue-{{.Task.ID}}"

deploy:

placement:

constraints:

- node.labels.customer == acme

resources:

limits:

memory: 2G

replicas: 15

environment:

user_queues: 'acmequeue'

..

..

..

 

Once deployed via docker stack deploy, you should see the new workers in Flower, as in the following image:

You can verify the queues being listened to by looking at the "broker" section of Flower, or by clicking into a worker and clicking the Queues tab:

Create Integration Schedules and Automation Settings to Utilize Separate Queues

After the workers have been configured with specific queue assignments, schedule your integrations to run on those queues, and configure Run Book Automations (RBAs) to place the integrations on those queues.

Scheduling an Integration with a Specific Queue and Configuration

To run an integration on a specific queue using a configuration for a particular system, you can use the "params" override available in the scheduler. Below is an example of the scheduled integration which utilizes the acmecqueue-catchup queue:

In the example above, the cisco_correlation_queue_manager is scheduled to run every 300 seconds, using the acme configuration, and will run on the acmequeue. You can have any number of scheduled integration runs per integration. If we were to add additional customers, we would add a new schedule entry with differing configurations, and queues for each.

Configuring Automations to Utilize a Specific Queue and Configuration

The last step to ensuring integrations for your newly onboarded SL1 system is to update the Run Book Automations in SL1 to provide the configuration and queue to use when the Run Book Automation triggers an event.

Modify the Event-correlation policy with the following changes:

  1. IS4_PASSTHROUGH= {"queue":"acmequeue"}
  2. CONFIG_OVERRIDE= 'acme-scale-config'

Failure Scenarios

This topic cover how the Integration Service handles situations where certain services fail.

Worker Containers

In case of failure, when can the worker containers be expected to restart?

What happens when a worker container fails?

What processing is affected when service is down?

What data can be lost?

API

When can the API be expected to restart?

What happens when it fails?

What processing is affected when service is down?

What data can be lost?

Couchbase

If a core service node running Couchbase fails, the database should continue to work normally and continue processing events, as long as a suitable number of clustered nodes are still up and running. Three core service nodes provides automatic failover handling of one node, five core service nodes provides automatic failover handling of two nodes, and so on. See the High Availability section for more information.

If there are enough clustered core nodes still running, the failover will occur with no interruptions, and the failing node can be added back at any time with no interruptions.

NOTE: For optimal performance and data distribution after rejoining a cluster, you can click the Re-balance button from the Couchbase user interface, if needed.

If there are not enough clustered core nodes still running, then you will manually have to fail over the Couchbase Server. In this scenario, since automatic failover could not be performed (due to too few nodes available), there will be disruption in event processing. For more information, see the Manual Failover section.

In case of failure, when can Couchbase be expected to restart?

What happens when it fails?

What processing is affected when service is down?

What data can be lost?

RabbitMQ

RabbitMQ clustered among all core service nodes provides full mirroring to each node. So long as there is at least one node available running RabbitMQ, the queues should exist and be reachable. This means that a multiple node failure will have no effect on the RabbitMQ services, and it should continue to operate normally.

In case of failure, when can RabbitMQ be expected to restart?

What happens when RabbitMQ fails?

What processing is affected when service is down?

What data can be lost?

Integration Service User Interface

In case of failure, when can the user interface be expected to restart?

What happens when it fails?

What data can be lost?

Redis

If the Redis service fails, it will automatically be restarted, and will be available again in a few minutes. The impact of this happening, is that task processing in the Integration Service is delayed slightly, as the worker services pause themselves and wait for the Redis service to become available again.

Consistent Redis failures

Consistent failures and restarts in Redis typically indicate your system has too little memory, or the Redis service memory limit is set too low, or not low at all. By default the Integration Service version 1.8.1 and later ships with a default memory limit of 8 GB to ensure that the Redis service only ever uses 8 GB of memory, and it ejects entries if it is going to go over that limit. This limit is typically sufficient, though if you have enough workers running large enough integrations to overfill the memory, you may need to increase the limit.

Before increasing Redis memory limit, be sure that there is suitable memory available to the system.

Known Issue for Groups of Containers

If you see multiple containers restarting at the same time on the same node, it indicates an over-provisioning of resources on that node. This only occurs on Swarm manager nodes, as the nodes are not only responsible for the services they are running, but also for maintaining the Swarm cluster and communicating with other manager nodes.

If resources become over-provisioned on one of those manager nodes (as they were with the core nodes when we saw the failure), the Swarm manager will not be able to perform its duties and may cause a docker restart on that particular node. This failure is indicated by “context deadline exceeded”, and “heartbeat failures” in the journalctl –no-page |grep docker |grep err logs.

This is one of the reasons why docker recommends running “manager-only” nodes, in which the manager nodes are only responsible for maintaining the Swarm, and not responsible for running other services. If any nodes that are running Integration Service services are also a Swarm manager, make sure that the nodes are not over-provisioned, otherwise the containers on that node may restart. For this reason, ScienceLogic recommends monitoring and placing thresholds at 80% utilization.

To combat the risk of over-provisioning affecting the docker Swarm manager, apply resource constraints on the services for the nodes that are also Swarm managers, so that docker operations always have some extra memory or CPU on the host to do what they need to do. Alternatively, you can only use drained nodes, which are not running any services, as Swarm managers, and not apply any extra constraints.

For more information about Swarm management, see https://docs.docker.com/engine/Swarm/admin_guide/.

Examples and Reference

Example of an Integration Service Configuration Object

[

{

"encrypted": false,

"name": "em7_host",

"value": "<ip address>"

},

{

"encrypted": false,

"name": "sl1_host",

"value": "${config.em7_host}"

},

{

"encrypted": false,

"name": "sl1_id",

"value": "${config.em7_id}"

},

{

"encrypted": false,

"name": "sl1_db_port",

"value": 7706

},

{

"encrypted": false,

"name": "snow_host",

"value": "<arecord>.service-now.com"

},

{

"encrypted": true,

"name": "em7_password",

"value": "<password>"

},

{

"encrypted": false,

"name": "sl1_user",

"value": "${config.em7_user}"

},

{

"encrypted": false,

"name": "sl1_password",

"value": "${config.em7_password}"

},

{

"encrypted": false,

"name": "sl1_db_user",

"value": "${config.em7_db_user}"

},

{

"encrypted": false,

"name": "sl1_db_password",

"value": "${config.em7_db_password}"

},

{

"encrypted": false,

"name": "em7_user",

"value": "<username>"

},

{

"encrypted": false,

"name": "em7_db_user",

"value": "root"

},

{

"encrypted": false,

"name": "em7_db_password",

"value": "<password>"

},

{

"encrypted": false,

"name": "snow_user",

"value": "<username>"

},

{

"encrypted": true,

"name": "snow_password",

"value": "<password>"

},

{

"encrypted": false,

"name": "Domain_Credentials",

"value": {

"c9818d2c4a36231201624433851894bb": {

"password": "3m7Admin!",

"user": "is4DomainUser2"

}

}

},

{

"name": "region",

"value": "ACMEScaleStack"

},

{

"encrypted": false,

"name": "em7_id",

"value": "${config.region}"

},

{

"encrypted": false,

"name": "generate_report",

"value": "true"

}

]

Example of a Schedule Configuration

[

{

"application_id": "device_sync_sciencelogic_to_servicenow",

"entry_id": "dsync every 13 hrs acme",

"last_run": null,

"params": {

"configuration": "acme-scale-config",

"mappings": {

"cmbd_ci_ip_router": [

"Cisco Systems | 12410 GSR",

"Cisco Systems | AIR-AP1141N",

"Cisco Systems | AP 1200-IOS",

"Cisco Systems | Catalyst 5505"

],

"cmdb_ci_esx_resource_pool": [

"VMware | Resource Pool"

],

"cmdb_ci_esx_server": [

"VMware | ESXi 5.1 w/HR",

"VMware | Host Server",

"VMware | ESX(i) 4.0",

"VMware | ESX(i) w/HR",

"VMware | ESX(i) 4.0 w/HR",

"VMware | ESX(i)",

"VMware | ESX(i) 4.1 w/HR",

"VMware | ESXi 5.1 w/HR",

"VMware | ESXi 5.0 w/HR",

"VMware | ESX(i) 4.1",

"VMware | ESXi 5.1",

"VMware | ESXi 5.0"

],

"cmdb_ci_linux_server": [

"ScienceLogic, Inc. | EM7 Message Collector",

"ScienceLogic, Inc. | EM7 Customer Portal",

"ScienceLogic, Inc. | EM7 All-In-One",

"ScienceLogic, Inc. | EM7 Integration Server",

"ScienceLogic, Inc. | EM7 Admin Portal",

"ScienceLogic, Inc. | EM7 Database",

"ScienceLogic, Inc. | OEM",

"ScienceLogic, Inc. | EM7 Data Collector",

"NET-SNMP | Linux",

"RHEL | Redhat 5.5",

"Virtual Device | Content Verification"

],

"cmdb_ci_vcenter": [

"VMware | vCenter",

"Virtual Device | Windows Services"

],

"cmdb_ci_vcenter_cluster": [

"VMware | Cluster"

],

"cmdb_ci_vcenter_datacenter": [

"VMware | Datacenter"

],

"cmdb_ci_vcenter_datastore": [

"VMware | Datastore",

"VMware | Datastore Cluser"

],

"cmdb_ci_vcenter_dv_port_group": [

"VMware | Distributed Virtual Portgroup"

],

"cmdb_ci_vcenter_dvs": [

"VMware | Distributed Virtual Switch"

],

"cmdb_ci_vcenter_folder": [

"VMware | Folder"

],

"cmdb_ci_vcenter_network": [

"VMware | Network"

],

"cmdb_ci_vmware_instance": [

"VMware | Virtual Machine"

]

},

"queue": "acmequeue",

"region": "ACMEScaleStack"

},

"schedule": {

"schedule_info": {

"run_every": 47200

},

"schedule_type": "frequency"

},

"total_runs": 0

},

{

"application_id": "device_sync_sciencelogic_to_servicenow",

"entry_id": "dsync every 12 hrs on .223",

"last_run": null,

"params": {

"configuration": "em7-host-223",

"mappings": {

"cmdb_ci_esx_resource_pool": [

"VMware | Resource Pool"

],

"cmdb_ci_esx_server": [

"VMware | ESXi 5.1 w/HR",

"VMware | Host Server",

"VMware | ESX(i) 4.0",

"VMware | ESX(i) w/HR",

"VMware | ESX(i) 4.0 w/HR",

"VMware | ESX(i)",

"VMware | ESX(i) 4.1 w/HR",

"VMware | ESXi 5.1 w/HR",

"VMware | ESXi 5.0 w/HR",

"VMware | ESX(i) 4.1",

"VMware | ESXi 5.1",

"VMware | ESXi 5.0"

],

"cmdb_ci_linux_server": [

"ScienceLogic, Inc. | EM7 Message Collector",

"ScienceLogic, Inc. | EM7 Customer Portal",

"ScienceLogic, Inc. | EM7 All-In-One",

"ScienceLogic, Inc. | EM7 Integration Server",

"ScienceLogic, Inc. | EM7 Admin Portal",

"ScienceLogic, Inc. | EM7 Database",

"ScienceLogic, Inc. | OEM",

"ScienceLogic, Inc. | EM7 Data Collector",

"NET-SNMP | Linux",

"RHEL | Redhat 5.5",

"Virtual Device | Content Verification"

],

"cmdb_ci_vcenter": [

"VMware | vCenter",

"Virtual Device | Windows Services"

],

"cmdb_ci_vcenter_cluster": [

"VMware | Cluster"

],

"cmdb_ci_vcenter_datacenter": [

"VMware | Datacenter"

],

"cmdb_ci_vcenter_datastore": [

"VMware | Datastore",

"VMware | Datastore Cluser"

],

"cmdb_ci_vcenter_dv_port_group": [

"VMware | Distributed Virtual Portgroup"

],

"cmdb_ci_vcenter_dvs": [

"VMware | Distributed Virtual Switch"

],

"cmdb_ci_vcenter_folder": [

"VMware | Folder"

],

"cmdb_ci_vcenter_network": [

"VMware | Network"

],

"cmdb_ci_vmware_instance": [

"VMware | Virtual Machine"

]

}

},

"schedule": {

"schedule_info": {

"run_every": 43200

},

"schedule_type": "frequency"

},

"total_runs": 0

},

{

"application_id": "cisco_correlation_queue_manager",

"entry_id": "acme catchup events",

"last_run": {

"href": "/api/v1/tasks/isapp-a20d5e08-a802-4437-92ef-32d643c6b777",

"start_time": 1544474203

},

"params": {

"configuration": "acme-scale-config",

"queue": "acmequeue-catchup"

},

"schedule": {

"schedule_info": {

"run_every": 300

},

"schedule_type": "frequency"

},

"total_runs": 33

},

{

"application_id": "cisco_incident_state_sync",

"entry_id": "incident sync every 5 mins on .223",

"last_run": {

"href": "/api/v1/tasks/isapp-52b19097-e0bf-450b-948c-487aff33fc3b",

"start_time": 1544474203

},

"params": {

"configuration": "em7-host-223"

},

"schedule": {

"schedule_info": {

"run_every": 300

},

"schedule_type": "frequency"

},

"total_runs": 2815

},

{

"application_id": "cisco_incident_state_sync",

"entry_id": "incident sync every 5 mins acme",

"last_run": {

"href": "/api/v1/tasks/isapp-dde1dba5-2343-4026-8801-35a02e4e57a1",

"start_time": 1544474202

},

"params": {

"configuration": "acme-scale-config",

"queue": "acmequeue"

},

"schedule": {

"schedule_info": {

"run_every": 300

},

"schedule_type": "frequency"

},

"total_runs": 1587

},

{

"application_id": "cisco_correlation_queue_manager",

"entry_id": "qmanager .223",

"last_run": {

"href": "/api/v1/tasks/isapp-cb7cc2e5-eab1-474a-907a-055f26dbc36d",

"start_time": 1544474203

},

"params": {

"configuration": "em7-host-223"

},

"schedule": {

"schedule_info": {

"run_every": 300

},

"schedule_type": "frequency"

},

"total_runs": 1589

}

]

 

Test Cases

Load Throughput Test Cases

Event throughput testing with the Integration Service only:

The following test cases can be attempted with any number of dedicated customer queues. The expectation is that each customer queue will be filled with 10,000 events, and then you can time how long it takes to process through all 10,000 events in each queue.

  1. Disable any steprunners.
  2. Trigger 10,000 events through SL1, and let them build up in the Integration Service queue.
  3. After all 10,000 events are waiting in queue, enable the steprunners to begin processing.
  4. Time the throughput of all event processing to get an estimate of how many events per second and per minute the Integration Service will handle.
  5. The results from the ScienceLogic test system are listed in the sizing section of worker nodes.

Event throughput testing with SL1 triggering the Integration Service:

This test is executed in the same manner as the event throughput test described above, but in this scenario you never disable the steprunners, and you let the events process through the Integration Service as they are alerted to by SL1.

  1. Steprunners are running.
  2. Trigger 10,000 events through SL1, and let the steprunners handle the events as they come in.
  3. Time the throughput of all event processing to get an estimate of how many events per second and per minute the Integration Service will handle.

The difference between the timing of this test and the previous test can show how much of a delay the SL1 is taking to alert the Integration Service about an event, and subsequently sync it.

Failure Test Cases

  1. Validate that bringing one of the core nodes down does not impact the overall functionality of the Integration Service system. Also, validate that bringing the core node back up rejoins the cluster and the system continues to be operational.
  2. Validate that bringing down a dedicated worker node only affects that specific workers processing. Also validate that adding a new "standby" node is able to pick up the worker where the previous failed worker left off.
  3. Validate that when the Redis service fails on any node, it is redistributed and is functional on another node.
  4. Validate that when an integration application fails, you can view the failure in the Integration Service Timeline.
  5. Validate that you can query for and filter only for failing tasks for a specific customer.

Separated queue test scenarios

  1. Validate that scheduling two runs of the same integration application with differing configurations and queues works as expected:
  1. Validate that each SL1 triggering event is correctly sending the appropriate queue and configuration that the event sync should be run on:
  1. Validate the behavior of a node left "on standby" waiting for a label to start picking up work. As soon as a label is assigned and workers are scaled, that node should begin processing the designated work.

Backup Considerations

This section covers the items you should back up in your Integration Service system, and how to restore backups.

What to Back Up

When taking backups of the Integration Service environment, collect the following information from the host level of your primary manager node (this is the node from which you control the stack):

Files in /opt/iservices/scripts:

All files in /etc/iservices/

In addition to the above files, make sure you are storing Couchbase dumps somewhere by using the cbbackup command, or the "Integration Service Backup" integration application.

Fall Back and Restore to a Disaster Recovery (Passive) System

You should do a data-only restore if:

You should do a full restore if:

Once failed over, be sure to disable the "Integration Service Backup" integration application from being scheduled.

Resiliency Considerations

The RabbitMQ Split-brain Handling Strategy (SL1 Default Set to Autoheal)

If multiple RabbitMQ cluster nodes are lost at once, the cluster might enter a "Network Partition" or "Split-brain" state. In this state, the queues will become paused if there is no auto-handling policy applied. The cluster will remain paused until a user takes manual action. To ensure that the cluster knows how to handle this scenario as the user would want, and not pause waiting for manual intervention, it is essential to set a partition handling policy.

For more information on RabbitMQ Network partition (split-brain) state, how it can occur, and what happens, see: http://www.rabbitmq.com/partitions.html.

By default, ScienceLogic sets the partition policy to autoheal in favor of continued service if any nodes go down. However, depending on the environment, you might wish to change this setting.

For more information about the automatic split-brain handling strategies that RabbitMQ provides, see: http://www.rabbitmq.com/partitions.html#automatic-handling.

autoheal is the default setting set by SL1, and as such, queues should always be available, though if multiple nodes fail, some messages may be lost.

If you are using pause_minority mode and a "split-brain" scenario occurs for RabbitMQ in a single cluster, when the split-brain situation is resolved, new messages that are queued will be mirrored (replicated between all nodes once again).

ScienceLogic Policy Recommendation

ScienceLogic's recommendations for applying changes to the default policy include the following:

Changing the RabbitMQ Default Split-brain Handling Policy

The best way to change the SL1 default split-brain strategy is to make a copy of the RabbitMQ config file from a running rabbit system, add your change, and then mount that config back into the appropriate place to apply your overrides.

  1. Copy the config file from a currently running container:

docker cp <container-id>:/etc/rabbitmq/rabbitmq.conf /destination/on/host

 

  1. Modify the config file:

change cluster_partition_handling value

 

  1. Update your docker-compose.yml file to mount that file over the config for all rabbitmq nodes:

mount "[/path/to/config]/rabbitmq.conf:/etc/rabbitmq/rabbitmq.conf"

 

Using Drained Managers to Maintain Swarm Health

To maintain Swarm health, ScienceLogic recommends that you deploy some swarm managers that do not take any of the workload of the application. The only purpose for these managers is to maintain the health of the swarm. Separating these workloads ensures that a spike in application activity will not affect the swarm clustering management services.

ScienceLogic recommends that these systems have 2 CPU and 4 GB of memory.

To deploy a drained manager node:

  1. Add your new manager node into the swarm.
  2. Drain it with the following command:

docker node update --availability drain <node>

 

Draining the node ensures that no containers will be deployed to it.

For more information, see https://docs.docker.com/engine/swarm/admin_guide/.

Updating the Integration Service Cluster with Little to No Downtime

There are two potential update workflows for updating the Integration Service cluster. The first workflow involves using a Docker registry that is connectable to swarm nodes on the network. The second workflow requires manually copying the Integration Service RPM or containers to each individual node.

Updating Offline (No Connection to a Docker Registry)

  1. Copy the Integration Service RPM over to all swarm nodes.
  2. Only install the RPM on all nodes. Do not stack deploy. This RPM installation automatically extracts the latest Integration Service containers, making them available to each node in the cluster.
  3. From the primary manager node, make sure your docker-compose file has been updated, and is now using the appropriate version tag: either latest for the latest version on the system, or 1.x.x.
  4. If all swarm nodes have the RPM installed, the container images should be runnable and the stack should update itself. If the RPM was missed installing on any of the nodes, it may not have the required images, and as a result, services might not deploy to that node.

Updating Online (All Nodes Have a Connection to a Docker Registry)

  1. Install the Integration Service RPM only onto the master node.
  2. Make sure the RPM doesn't contain any host-level changes, such as Docker daemon configuration updates. If there are host level updates, you might want to make that update on other nodes in the cluster
  3. Populate your Docker registry with the latest Integration Service images.
  4. From the primary manager node, make sure your docker-compose file has been updated, and is now using the appropriate version tag: either latest for the latest version on the system, or 1.x.x.
  5. Docker stack deploy the services. Because all nodes have access to the same Docker registry, which has the designated images, all nodes will download the images automatically and update with the latest versions as defined by the docker-compose file.

Additional Sizing Considerations

This section covers the sizing considerations for the Couchbase, RabbitMQ, Redis, contentapi, and GUI services.

Sizing for Couchbase Services

The initial sizing provided for Couchbase nodes in the multi-tenant cluster for 6 CPUs and 56 GB memory should be more than enough to handle multiple customer event syncing workloads.

ScienceLogic recommends monitoring the CPU percentage and Memory Utilization percentage of the Couchbase nodes to understand when a good time to increase resources is, such as when Memory and CPU are consistently above 80%.

Sizing for RabbitMQ Services

The only special considerations for RabbitMQ sizing is how many events you will plan for in the queue at once.

Every 10,000 events populated in the Integration Service queue will consume approximately 1.5 GB of memory.

This memory usage is drained as soon as the events leave the queue.

Sizing for Redis Services

The initial sizing deployment for redis should be sufficient for multiple customer event syncing.

The only time memory might need to be increased to redis is if you are attempting to view logs from a previous run, and the logs are not available. A lack of run logs from a recently run integration indicates that the redis cache does not have enough room to store all the step and log data from recently executed runs.

Sizing for contentapi Services

The contentapi services sizing should remain limited at 2 GB memory, as is set by default.

If you notice timeouts, or 500s when there is a large load going through the Integration Service system, you may want to increase the number of contentapi replicas.

For more information, see placement considerations, and ensure the API replicas are deployed in the same location as the redis instance.

Sizing for the GUI Service

The GUI service should not need to be scaled up at all, as it merely acts as an ingress proxy to the rest of the Integration Service services.

Sizing for Workers: Scheduler, Steprunner, Flower

Refer to the worker sizing charts provided by ScienceLogic for the recommended steprunner sizes.

Flower and Scheduler do not need to be scaled up at all.

Node Placement Considerations

Preventing a Known Issue: Place contentapi and Redis services in the Same Physical Location

An issue exists where if there latency exists between the contentapi and redis, the integrations page may not load. This issue is caused by the API making too many calls before returning. The added latency for each individual call can cause the overall endpoint to take longer to load than the designated timeout window of thirty seconds.

The only impact of this issue is the applications/ page won't load. There is no operational impact on the integrations as a whole, even if workers are in separate geos than redis.

There is also no risk to High Availability (HA) by placing the API andRredis services on the same geo. If for whatever reason that geo drops out, the containers will be restarted automatically in the other location.

Common Problems, Symptoms, and Solutions

Tool Issue Symptoms Cause Solution
Docker Visualizer Docker Visualizer shows some services as "undefined".

When viewing the Docker Visualizer user interface, some services are displayed as "undefined", and states aren't accurate.

 

Impact:

Cannot use Visualizer to get the current state of the stack.

Failing docker stack deployment: https://github.com/
dockersamples/docker-swarm-visualizer/issues/110

Ensure your stack is healthy, and services are deployed correctly. If no services are failing and things are still showing as undefined, elect a new swarm leader.

 

To prevent:

Ensure your configuration is valid before deploying.

RabbitMQ RabbitMQ queues encountered a node failure and are in a "Network partition state" (split-brain scenario).

The workers are able to connect to the queue, and there are messages on the queue, but the messages are not being distributed to the workers.

 

Log in to the RabbitMQ admin user interface, which displays a message similar to "RabbitMQ experienced a network partition and the cluster is paused".

 

Impact:

The RabbitMQ cluster is paused and waiting for user intervention to clean the split-brain state.

 

Multi-node failure occurred, and rabbit wasn't able to determine who the new master should be. This also will only occur if there is NO partition handling policy in place (see the resiliency section for more information)

 

Note: ScienceLogic sets the autoheal policy by default

Handle the split-brain partition state and resynchronize your RabbitMQ queues.

 

Note: This is enabled by default.

 

To prevent:

Set a partition handling policy.

 

See the Resiliency section for more information.

RabbitMQ, continued  

Execing into the RabbitMQ container and running rabbitmqcli cluster-status shows nodes in a partition state like the following:

 

[{nodes,

[{disc,

['rabbit@rabbit_node1.isnet','rabbit@rabbit_node2.isnet', 'rabbit@rabbit_node3.isnet','rabbit@rabbit_node4.isnet', 'rabbit@rabbit_node5.isnet','rabbit@rabbit_node6.isnet']}]},

{running_nodes,['rabbit@rabbit_node4.isnet']},

{cluster_name,<<"rabbit@rabbit_node1">>},

{partitions,

[{'rabbit@rabbit_node4.isnet', ['rabbit@rabbit_node1.isnet','rabbit@rabbit_node2.isnet', 'rabbit@rabbit_node3.isnet','rabbit@rabbit_node5.isnet', 'rabbit@rabbit_node6.isnet']}]},

{alarms,[{'rabbit@rabbit_node4.isnet',[]}]}]

   
Integration Service steprunners and RabbitMQ Workers constantly restarting, no real error message.

Workers of a particular queue are not stable and constantly restart.

 

Impact:

One queue's workers will not be processing.

Multi-node failure in RabbitMQ, when it loses majority and can not failover.

 

 

Queues go out of sync because of broken swarm.

Recreate queues for the particular worker.

 

Resynchronize queues.

 

To prevent:

Deploy enough nodes to ensure quorum for failover.

Couchbase Couchbase node is unable to restart due to indexer error.

This issue can be monitored in the Couchbase logs:

 

Service 'indexer' exited with status 134. Restarting. Messages:

sync.runtime_Semacquire(0xc4236dd33c)

 

 

Impact:

One couchbase node becomes corrupt.

Memory is removed from the database while it is in operation (memory must be dedicated to the VM running Couchbase).

 

The Couchbase node encounters a failure, which causes the corruption.

Ensure that the memory allocated to your database nodes is dedicated and not shared among other VMs.

 

To prevent:

Ensure that the memory allocated to your database nodes is dedicated and not shared among other VMs.

Couchbase Couchbase is unable to rebalance.

Couchbase nodes will not rebalance, usually with an error saying "exited by janitor".

 

Impact:

Couchbase nodes cannot rebalance and provide even replication.

Network issues: missing firewall rules or blocked ports.

 

The Docker swarm network is stale because of a stack failure.

Validate that all firewall rules are in place, and that no external firewalls are blocking ports.

 

Reset the Docker swarm network status by electing a new swarm leader.

 

To prevent:

Validate the firewall rules before deployment.

 

Use drained managers to maintain swarm

Integration Service steprunners to Couchbase Steprunners unable to communicate to Couchbase

Steprunners unable to communicate to Couchbase database, with errors like "client side timeout", or "connection reset by peer".

 

Impact:

Steprunners cannot access the database.

Missing Environment variables in compose:

 

Check the db_host setting for the steprunner and make sure they specify all Couchbase hosts available .

 

Validate couchbase settings, ensure that the proper aliases, hostname, and environment variables are set.

 

Stale docker network.

Validate the deployment configuration and network settings of your docker-compose. Redeploy with valid settings.

 

In the event of a swarm failure, or stale swarm network, reset the Docker swarm network status by electing a new swarm leader.

 

To prevent:

Validate hostnames, aliases, and environment settings before deployment.

 

Use drained managers to maintain swarm

Flower Worker display in flower is not organized and hard to read, and it shows many old workers in an offline state.

Flower shows all containers that previously existed, even if they failed, cluttering the dashboard.

 

Impact:

Flower dashboard is not organized and hard to read.

Flower running for a long time while workers are restarted or coming up/coming down, maintaining the history of all the old workers.

 

Another possibility is a known issue in task processing due to the --max-tasks-per-child setting. At high CPU workloads, the max-tasks-per-child setting causes workers to exit prematurely.

Restart the flower service by running the following command:

 

docker service update --force iservices_flower

 

You can also remove the --max-tasks-per-child setting in the steprunners.

All containers on a particular node All containers on a particular node do not deploy.

Services are not deploying to a particular node, but instead they are getting moved to other nodes.

 

Impact:

The node is not running anything.

One of the following situations could cause this issue:

 

Invalid label deployment configuration.

 

The node does not have the containers you are telling it to deploy.

 

The node is missing a required directory to mount into the container.

Make sure the node that you are deploying to is labeled correctly, and that the services you expect to be deployed there are properly constrained to that system.

 

Go through the troubleshooting steps of "When a docker service doesn't deploy" to check that the service is not missing a requirement on the host.

 

Check the node status for errors:

docker node ls

 

To prevent:

Validate your configuration before deploying.

All containers on a particular node All containers on a particular node periodically restart at the same time.

All containers on a particular node restart at the same time.

 

The system logs indicate an error like:

 

“error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

 

Impact:

All containers restart on a node.

This issue only occurs in single-node deployments when the only manager allocates too many resources to its containers, and the containers all restart since the swarm drops.

 

The manager node gets overloaded by container workloads and is not able to handle swarm management, and the swarm loses quorum.

Use some drained manager nodes for swarm management to separate the workloads.

 

To prevent:

Use drained managers to maintain swarm.

General Docker service Docker service does not deploy. Replicas remain at 0/3 Docker service does not deploy. There are a variety of reasons for this issue, and you can reveal most causes by checking the service logs to address the issue. Identify the cause of the service not deploying.
Integration Service user interface The Timeline or the Integrations page do not appear in the user interface. The Timeline is not showing accurate information, or the Integrations page is not rendering.

One of the following situations could cause these issues:

 

Indexes do not exist on a particular Couchbase node.

 

Latency between the API and the redis service is too great for the API to collect all the data it needs before the 30-second timeout is reached.

 

The indexer can't keep up to a large number of requests, and Couchbase requires additional resources to service the requests.

Solutions:

 

Verify that indexes exist.

 

Place the API and redis containers in the same geography so there is little latency. This issue will be fixed in a future IS release

 

Increase the amount of memory allocated to the Couchbase indexer service.

Common Resolution Explanations

This section contains a set of solutions and explanations for a variety of issues.

Elect a New Swarm Leader

Sometimes when managers lose connection to each other, either through latency or a workload spike, there are instances when the swarm needs to be reset or refreshed. By electing a new leader, you can effectively force the swarm to redo service discovery and refresh the metadata for the swarm. This procedure is highly preferred over removing and re-deploying the whole stack.

To elect a new swarm leader:

  1. Make sure there at least three swarm managers in your stack.
  2. To identify which node is the current leader, run the following command:

docker node ls

 

  1. Demote the current leader with the command:

docker node demote <node>

 

  1. Wait until a new node is elected leader:

docker node ls

 

  1. After a new node is elected leader, promote the old node back to swarm leader:

docker node promote <node>

 

Recreate RabbitMQ Queues and Exchanges

If you do not want to retain any messages in the queue, the following procedure is the best method for recreating the queues. If you do have data that you want to retain, you can resynchronize RabbitMQ queues.

To recreate RabbitMQ queues:

  1. Identify the queue or queues you need to delete:
  1. Delete the queue and exchange through the RabbitMQ admin console:
  1. Delete the queue and exchange through the command line interface:

rabbitmqadmin delete queue name=name_of_queue

 

rabbitmqadmin delete exchange name=name_of_queue

 

After you delete the queues, the queues will be recreated the next time a worker connects.

Resynchronize RabbitMQ Queues

If your RabbitMQ cluster ends up in a "split-brain" or partitioned state, you might need to manually decide which node should become the master. For more information, see http://www.rabbitmq.com/partitions.html#recovering.

To resynchronize RabbitMQ queues:

  1. Identify which node you want to be the master. In most cases, the master is the node with the most messages in its queue.
  2. After you have identified which node should be master, scale down all other RabbitMQ services:

docker service scale iservices_rabbitmqx=x0

 

  1. After all other RabbitMQ services except the master have been scaled down, wait a few seconds, and then scale the other RabbitMQ services back to 1. Bringing all nodes but your new master down and back up again forces all nodes to sync to the state of the master that you chose.

Identify the Cause of a Service not Deploying

Step 1: Obtain the ID of the failed container for the service

Run the following command for the service that failed previously:

docker service ps --no-trunc <servicename>

 

For example:

docker service ps --no-trunc iservices_redis

 

From the command result above, we see that one container with the ID 3s7s86n45skf failed previously running on node is-scale-03 (non-zero exit) and another container was restarted in its place.

At this point, you can ask the following questions:

At this point, the cause of the issue is not a deploy configuration issue, and it is not an entire node failure. The problem exists within the service itself. Continue to Step 2 if this is the case.

Step 2: Check for any interesting error messages or logs indicating an error

Using the ID obtained in Step 1, collect the logs from the failed container with the following command:

docker service logs <failed-id>

 

For example:

docker service logs 3s7s86n45skf

Review the service logs for any explicit errors or warning messages that might indicate why the failure occurred.

Repair Couchbase Indexes

Index stuck in “created” (not ready) state

This situation usually occurs when a node starts creating an index, but another index creation was performed at the same time by another node. After the index is created, you can run a simple query to build the index which will change it from created to “ready”:

BUILD index on 'content'('idx_content_content_type_config_a3f867db_7430_4c4b_b1b6_138f06109edb') using GSI

 

Deleting an index

If you encounter duplicate indexes, such as a situation where indexes were manually created more than once, you can delete an index:

DROP index content.idx_content_content_type_config_d8a45ead_4bbb_4952_b0b0_2fe227702260

 

Recreating all indexes on a particular node

To recreate all indexes on a particular Couchbase node, exec into the couchbase container and run the following command:

Initialize_couchbase -s

Running this command recreates all indexes, even if the indexes already exist.

Add a Broken Couchbase Node Back into the Cluster

To remove a Couchbase node and re-add it to the cluster:

  1. Stop the node in Docker.
  2. In the Couchbase user interface, you should see the node go down, failover manually, or wait the appropriate time until it automatically fails over.
  3. Clean the Couchbase data directory on the necessary host by running the following command:

rm -rf /var/data/couchbase/*

 

  1. Restart the Couchbase node and watch it get added back into the cluster.
  2. Click the Rebalance button to replicate data evenly across nodes.

Restore Couchbase Manually

Backup

  1. Exec into the couchbase container

cbbackup http://couchbase.isnet:8091 /opt/couchbase/var/backup -u [user] -p [password] -x data_only=1

 

  1. Exit the couchbase shell and then copy the backup file in /var/data/couchbase/backup to a safe location, such as /home/isadmin.

Delete Couchbase

rm -f /var/data/couchbase/*

 

Restore

  1. Copy the backup file into /var/data/couchbase/backup.
  2. Execute into the Couchbase container.
  3. The following command restores the content:

cbrestore /opt/couchbase/var/backup http://couchbase.isnet:8091 -b content -u <user> -p <password>

 

  1. The following command restores the logs:

cbrestore /opt/couchbase/var/backup http://couchbase.isnet:8091 -b logs -u <user> -p <password>

 

Integration Service Multi-tenant Upgrade Process

This section describes how to upgrade the Integration Service in a multi-tenant environment with as little downtime as possible.

Perform Environment Checks Before Upgrading

Validate Cluster states

Validate Backups exist

Clean out old container images if desired

Before upgrading to the latest version of the Integration Service, check the local file system and see if there are any older versions taking up space that you might want to remove. These containers exist both locally on the fs and the internal docker registry. To view any old container version,s check the /opt/iservices/images directory. ScienceLogic recommends that you keep at a minimum the last version of containers, so you can downgrade if necessary.

Cleaning out images is not mandatory, but it is just a means of clearing out additional space on the system if necessary.

To remove old images:

  1. Delete any unwanted versions in /opt/iservices/images.
  2. Identify any unwanted images known to Docker with docker images.
  3. Remove the images with the ID docker rmi <id>.

Prepare the Systems

Install the new RPM

The first step of upgrading is to install the new RPM on all systems in the stack. Doing so will ensure that the new containers are populated onto the system (if using that particular RPM), and any other host settings are changed. RPM installation does not pause any services or affect the docker system in any way, other than using some resources.

The Integration Service has two RPMs, one with containers and one without. If you have populated an internal docker registry with docker containers, you can install the RPM without containers built in. If no internal docker repository is present, you must install the RPM which has the containers built in it. Other than the containers, there is no difference between the RPMs.

For advanced users, installing the RPM can be skipped. However this means that the user is completely responsible for maintaining the docker-compose and host level configurations.

To install the RPM:

  1. SSH into each node.
  2. If installing the RPM which contains the container images built in, you may want to upgrade each core node one by one, so that the load of extracting the images doesn't affect all core nodes at once
  3. Run the following command:

rpm -Uvh <new-rpm-file>

 

Compare compose file changes and resolve differences

After the RPM is installed, you will notice a new docker-compose file is placed at /etc/iservices/scripts/docker-compose.yml. As long as your environment-specific changes exist solely in the compose-override file, all user changes and new version updates will be resolved into that new docker-compose.yml file.

ScienceLogic recommends that you check the differences between the two docker-compose files. You should validate that:

  1. All environment-specific and custom user settings that existed in the old docker-compose also exist in the new docker-compose file.
  2. The image tags reference the correct version in the new docker-compose. If you are using an internal docker registry, be sure these image tags represent the images from your internal registry.
  3. Make sure that any new environment variables added to services are applied to replicated services. To ensure these updates persist through the next upgrade, also make the changes in docker-compose-override.yml. In other words, if you added a new environment variable for Couchbase, make sure to apply that variable to couchbase-worker1 and couchbase-worker2 as well. If you added a new environment variable for the default steprunner, make sure to set the same environment variable on each custom worker as well.
  4. If you are using the latest tag for images, and you are using a remote repository for downloading, be sure that the latest tag refers to the images in your repository.
  5. The old docker-compose is completely unchanged, and it matches the current deployed environment. This enables the Integration Service to update services independently without restarting other services.
  6. After you resolve any differences between the compose files has been resolved, proceed with the upgrade using the old docker-compose.yml (the one that matches the currently deployed environment).

Make containers available to systems

After you apply the host-level updates, you should make sure that the containers are available to the system.

If you upgraded using the RPM with container images included, the containers should already be on all of the nodes, you can run docker images to validate the new containers are present. If this is the case you may skip to the next section.

If the upgrade was performed using the RPM which did not contain the container images, ScienceLogic recommends that you run the following command to make sure all nodes have the latest images:

docker-compose -f <new_docker_compose_file> pull

This command validates that the containers specified by your compose file can be pulled and reached from the nodes. While not required, you might to make sure that the images can be pulled before starting the upgrade. If the images are not pulled manually, they will automatically be pulled by Docker when the new image is called for by the stack.

Perform the Upgrade

To perform the upgrade on a clustered system with little downtime, the Integration Service re-deploys services to the stack in groups. To do this, the Integration Service gradually makes the updates to groups of services and re-runs docker stack deploy for each change. To ensure that no unintended services are updated, start off using the same docker-compose file that was previously used to deploy. Reusing the same docker-compose file and updating only sections at a time ensures that only the intended services to be updated are affected at any given time.

Avoid putting all the changes in a single docker-compose file, and do a new docker stack deploy with all changes at once. If downtime is not a concern, you can update all services, but updating services gradually allows you to have little or no downtime.

Before upgrading any group of services, be sure that the docker-compose file you are deploying from is exactly identical to the currently deployed stack (the previous version). Start with the same docker-compose file and update it for each group of services as needed,

Upgrade Redis, Scheduler, and Flower

The first group to update includes the Redis, Scheduler and Flower. If desired, this group can be upgraded along with any other group.

To update:

  1. Copy the service entries for Redis, Scheduler and Flower from the new compose file into the old docker-compose file (the file that matches the currently deployed environment). Copying these entries makes it so that the only changes in the docker-compose file (compared to the deployed stack) are changes for Redis, Scheduler and Flower.
  2. Run the following command:

docker stack deploy -c <old_compose_with_small_changes> iservices

 

  1. Monitor the update, and wait until all services are up and running before proceeding.

Example image definition of this upgrade group:

services:

contentapi:

image: repository.auto.sciencelogic.local:5000/is-api:1.8.1

 

couchbase:

image: repository.auto.sciencelogic.local:5000/is-couchbase:1.8.1

 

couchbase-worker:

image: repository.auto.sciencelogic.local:5000/is-couchbase:1.8.1

 

flower:

image: repository.auto.sciencelogic.local:5000/is-worker:hotfix-1.8.3

 

gui:

image: repository.auto.sciencelogic.local:5000/is-gui:1.8.1

 

rabbitmq:

image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2

 

rabbitmq2:

image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2

 

rabbitmq3:

image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2

 

redis:

image: repository.auto.sciencelogic.local:5000/is-redis:4.0.11-2

 

scheduler:

image: repository.auto.sciencelogic.local:5000/is-worker:hotfix-1.8.3

 

steprunner:

image: repository.auto.sciencelogic.local:5000/is-worker:1.8.1

 

couchbase-worker2:

image: repository.auto.sciencelogic.local:5000/is-couchbase:1.8.1

 

steprunner2:

image: repository.auto.sciencelogic.local:5000/is-worker:1.8.1

 

Redis Version

As the Redis version might not change with every release of the Integration Service, there might not be any changes needed in the upgrade for Redis. This can be expected and is not an issue.

Flower Dashboard

Due to a known issue addressed in version 1.8.3 of the Integration Service, the Flower Dashboard might not display any workers. Flower eventually picks up the new workers when they are restarted in the worker group. If this is a concern, you can perform the Flower upgrade in the same group as the workers.

Upgrade Core Services (Rabbit and Couchbase)

The next group of services to update together are the RabbitMQ/Couchbase database services, as well as the GUI. Because the core services are individually defined and "pinned" to specific nodes, upgrade these two services at the same time, on a node-by-node basis. In between each node upgrade, wait and validate that the node rejoins the Couchbase and Rabbit clusters and re-balances appropriately.

Because there will always be two out of three nodes running these core services, this group should not cause any downtime for the system.

Rabbit/Couchbase Versions

The Couchbase and RabbitMQ versions used might not change with every release of the Integration Service. If there is no update or change to be made to the services, you can ignore this section for RabbitMQ or Couchbase upgrades, or both. Assess the differences between the old and new docker-compose files to check if there is an image or environment change necessary for the new version. If not, you can move on to the next section.

Update Actions (assuming three core nodes)

To update first node services:

  1. Update just core node01 by copying service entries for couchbase, rabbitmq1 from the new compose file (compared and resolved as part of above prepare steps) into the old docker-compose file. At this point, the compose file you use to deploy should also contain the updates for the previous groups
  2. Before deploying, access the Couchbase user interface, select the first server node, and click "failover". Select graceful failover. Manually failing over before updating ensures that the system is still operational when the container comes down.
  3. For the failover command that can be run through the command-line interface if the user interface is not available, see the Manual Failover section.
  4. Run the following command:

docker stack deploy -c <compose_file>

 

  1. Monitor the process to make sure the service updates and restarts with the new version. To make sure that as little time as possible is used when updating the database, the database containers should already be available on the core nodes.
  2. After the node is back up, go back to the Couchbase UI and add the node back, and rebalance the cluster to make it whole again.
  3. For more information on how to re-add the node and rebalance the cluster if the user interface is not available, see the Manual Failover section.

First node Couchbase update considerations:

Special GUI consideration with 1.8.3

In the upgrade to version 1.8.3 of the Integration Service, the Couchbase and RabbitMQ user interface ports will be exposed through the Integration Service user interface with HTTPS. To ensure there is no port conflict between services and the Integration Service user interface, ensure that the Couchbase and RabbitMQ user interface port mappings are removed or modified from the default (8091) admin port. To avoid conflicts, make sure the new Integration Service user interface definition does not conflict with the Couchbase or RabbitMQ definitions.

The Integration Service user interface will not update until all port conflicts are resolved. You can upgrade the Integration Service user interface at any time after this has been done, but be sure to first review the Update the GUI topic, below.

You can manually remove port mappings from a service with the following command, though the command will restart the service: docker service update --publish-rm published=8091,target=8091 iservices_couchbase

Example docker-compose with images and JOIN_ON for updating the first node:

services:

contentapi:

image: repository.auto.sciencelogic.local:5000/is-api:1.8.1

 

couchbase:

image: repository.auto.sciencelogic.local:5000/is-couchbase:hotfix-1.8.3

environment:

JOIN_ON: "couchbase-worker2"

 

couchbase-worker:

image: repository.auto.sciencelogic.local:5000/is-couchbase:1.8.1

 

flower:

image: repository.auto.sciencelogic.local:5000/is-worker:hotfix-1.8.3

 

gui:

image: repository.auto.sciencelogic.local:5000/is-gui:hotfix-1.8.3

 

rabbitmq:

image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2

 

rabbitmq2:

image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2

 

rabbitmq3:

image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2

 

redis:

image: repository.auto.sciencelogic.local:5000/is-redis:4.0.11-2

 

scheduler:

image: repository.auto.sciencelogic.local:5000/is-worker:hotfix-1.8.3

 

steprunner:

image: repository.auto.sciencelogic.local:5000/is-worker:1.8.1

 

couchbase-worker2:

image: repository.auto.sciencelogic.local:5000/is-couchbase:1.8.1

 

steprunner2:

image: repository.auto.sciencelogic.local:5000/is-worker:1.8.1

 

Update second, and third node services

To update the second and third node services, repeat the steps from the first node on each node until all nodes are re-clustered and available. Be sure to check the service port mappings to ensure that there are no conflicts (as described above), and remove any HTTP ports if you choose.

Update the GUI

You can update the GUI service along with any other group, but due to the port mapping changes in version 1.8.3 of the Integration Service, you should update this service after the databases and RabbitMQ nodes have been updated, and their port mappings no longer conflict.

Since the GUI service provides all ingress proxy routing to the services, there might be a very small window where the Integration Service might not receive API requests as the GUI (proxy) is not running. This downtime is limited to the time it takes for the GUI container to restart.

To update the user interface:

  1. Make sure that any conflicting port mappings are handled and addressed.
  2. Replace the docker-compose GUI service definition with the new one.
  3. Re-deploy the docker-compose file, and validate that the new GUI container is up and running.
  4. Make sure that the HTTPS ports are accessible for Couchbase/RabbitMG.

Update Workers and contentapi

You should update the workers and contentapi last. Because these services use multiple replicas (multiple steprunner or containerapi containers running per service), you can rely on Docker to incrementally update each replica of the service individually. By default, when a service is updated, it will update one container of the service at a time, and only after the previous container is up and stable will the next container be deployed.

You can utilize additional Docker options in docker-compose to set the behavior of how many containers to update at once, when to bring down the old container, and what happens if a container upgrade fails. See the update_config and rollback_config options available in Docker documentation: https://docs.docker.com/compose/compose-file/.

Upgrade testing was performed by ScienceLogic using default options. An example where these settings are helpful is to change the parallelism of update_config so that all worker containers of a service update at the same time.

The update scenario described below takes extra precautions and only updates one node of workers per customer at a time. If you decide, you can also safely update all workers at once.

To update the workers and contentapi:

  1. Modify the docker-compose file, the contentapi, and "worker_node1" services of all customers to use the new service definition.
  2. Run a docker stack deploy of the new compose file. Monitor the update, which should update the API container one instance at a time, always leaving a container available to service requests. The process updates the workers of node1 one container instance at a time by default.
  3. After workers are back up and the API is fully updated, modify the docker-compose file and update the second node's worker's service definitions.
  4. Monitor the upgrade, and validate as needed.

Example docker-compose definition with one of two worker nodes and contentapi updated:

 

services:

contentapi:

image: repository.auto.sciencelogic.local:5000/is-api:hotfix-1.8.3

deploy:

replicas: 3

 

couchbase:

image: repository.auto.sciencelogic.local:5000/is-couchbase:hotfix-1.8.3

environment:

JOIN_ON: "couchbase-worker2"

 

couchbase-worker:

image: repository.auto.sciencelogic.local:5000/is-couchbase:hotfix-1.8.3

 

flower:

image: repository.auto.sciencelogic.local:5000/is-worker:hotfix-1.8.3

 

gui:

image: repository.auto.sciencelogic.local:5000/is-gui:hotfix-1.8.3

 

rabbitmq:

image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2

 

rabbitmq2:

image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2

 

rabbitmq3:

image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2

 

redis:

image: repository.auto.sciencelogic.local:5000/is-redis:4.0.11-2

 

scheduler:

image: repository.auto.sciencelogic.local:5000/is-worker:hotfix-1.8.3

 

steprunner:

image: repository.auto.sciencelogic.local:5000/is-worker:hotfix-1.8.3

 

couchbase-worker2:

image: repository.auto.sciencelogic.local:5000/is-couchbase:hotfix-1.8.3

 

steprunner2:

image: repository.auto.sciencelogic.local:5000/is-worker:1.8.1