| Release Notes | ScienceLogic Support | Contact Documentation | Tips for Using the Online Documentation
This appendix describes the best practices and troubleshooting solutions for deploying the Integration Service in a multi-tenant environment that supports multiple customers in a highly available fashion. This document also covers how to perform an upgrade of the Integration Service with minimal downtime.
This document covers the following topics:
The following sections describe how to deploy the Integration Service in a multi-tenant environment. After the initial High Availability (HA) core services are deployed, the multi-tenant environment differs in the deployment and placement of workers and use of custom queues.
For a multi-tenant deployment, ScienceLogic recommends that you dedicate at least three nodes to the core Integration Service services. These core Integration Service services are shared by all workers and customers. As a result, it is essential that these services are clustered to handle failovers.
Because these core services are critical, ScienceLogic recommends that you initially allocate a fairly large amount of resources to these services. Allocating more resources than necessary to these nodes allows you to further scale workers in the future. If these nodes become overly taxed, you can add another node dedicated to the core services in the cluster.
These core services nodes are dedicated to the following services:
It is critical to monitor these core service nodes, and to always make sure these nodes have enough resources for new customers and workers as they are on-boarded.
To ensure proper failover and persistence of volumes and cluster information, the core services must be pinned to each of the nodes. For more information, see Configuring Core Service Nodes, below.
3 nodes (or more for additional failover support) with 6 CPUs and 56 GB memory each
Separate from the core services are the worker services. These worker services are intended to be deployed on nodes separate from the core services, and other workers, and these worker services aim to provide processing only for specified dedicated queues. Separating the VMs or modes where worker services are deployed will ensure that one customer's workload, no matter how heavy it gets, will not negatively affect the other core services, or other customer workloads.
The resources allocated to the worker nodes depends on the worker sizing chosen, the more resources provided to a worker, the faster their throughput. Below is a brief guideline for sizing. Please note that even if you exceed the number of event syncs per minute, events will be queued up, so the sizing does not have to be exact. The below sizing just provides a suggested guideline.
CPU | Memory | Worker count | Time to sync a queue full of 10,000 events | Events Synced per second |
---|---|---|---|---|
2 |
16 GB |
6 |
90 minutes |
1.3 |
4 |
32 GB |
12 |
46 minutes |
3.6 |
8 |
54 GB |
25 |
16.5 minutes |
10.1 |
It is required that there is always at least one worker instance listening on the default queue for proper functionality. The default worker can run in any node.
When deploying a new worker, especially if it is going to be a custom queue dedicated worker, it is wise to consider deploying an extra worker listening on the same queues. If you have on a single worker node listening to a dedicated customer queue, there is potential for that queue processing to stop completely if that single node worker fails.
For this reason, ScienceLogic recommends that for each customer dedicated worker you deploy, you deploy a second one as well. This way there are two nodes listening to the customer dedicated queue, and if one node fails, the other node will continue processing from the queue with no interruptions.
When deciding on worker sizing, it's important to take this into consideration. For example, if you have a customer that requires a four-CPU node for optimal throughput, an option would be to deploy two nodes with two CPUs, so that there is failover if one node fails.
Monitoring the memory, CPU and pending integrations in queue can give you an indication of whether more resources are needed for the worker. Generally, when queue times start to build up, and tickets are not synced over in an acceptable time frame, more workers for task processing are required.
Although more workers will process more tasks, they will be unable to do so if the memory or CPU required by the additional workers is not present. When adding additional workers, it is important to watch the memory or CPU utilization, so long as the utilization is under 75%, it should be okay to add another worker. If utilization is consistently over 80%, then you should add more resources to the system before addling additional workers.
Even if you have multiple workers dedicated to a single customer, there are still scenarios in which a particular customer queue spikes in load, and you'd like an immediate increase in throughput to handle this load. In this scenario you don't have the time to deploy a new IS node and configure it to distribute the load for greater throughput, as you need increased load immediately.
This can be handled by having a node on standby. This node has the same IS RPM version installed, and sits idle in the stack (or is turned off completely). When a spike happens, and you need more resources to distribute the load, you can then apply the label to the corresponding to the customer who's queues spiked. After setting the label on the standby node, you can scale up the worker count for that particular customer. Now, with the stand-alone node labeled for work for that customer, additional worker instances will be distributed to and started on the standby node.
When the spike has completed, you can return the node to standby by reversing the above process. Decrease the worker count to what it was earlier, and then remove the customer specific label from the node.
This section describes how multi-tenant deployments can use separate virtual hosts and users for each tenant.
In certain scenarios, you might not want to use the default RabbitMQ queue that is prepackaged the Integration Service. For example, you might already have a RabbitMQ production cluster available that you just want to connect with the Integration Service. You can do this by defining a new virtual host in RabbitMQ, and then you configure the Integration Service broker URL for contentapi, steprunner, scheduler services so that they point to the new virtual host.
Any use of an external RabbitMQ server will not be officially supported by ScienceLogic if there are issues in the external RabbitMQ instance.
By default RabbitMQ contains the default credentials guest/guest.The Integration Service uses these credentials by default when communicating with RabbitMQ in the swarm cluster. All communication is encrypted and secured within the overlay Docker network.
To add another user, or to change the user that the Integration Service uses when communicating with the queues:
When using an external RabbitMQ system, or if you are using credentials other than guest/guest to authenticate, you need to update the broker_url environment variable in the contentapi, steprunner, and scheduler services. You can do this by modifying the environment section of the services in docker-compose and changing broker_url. The following line is an example:
broker_url: 'pyamqp://username:password@rabbitmq-hostname/v-host'
When a new SL1 system is to be onboarded into the Integration Service, by default their integrations are executed on the default queue. In large multi-tenant environments, ScienceLogic recommends separate queues for each customer. If desired, each customer can also have specific queues.
The first step to onboarding a new customer is to create a configuration with variables that will satisfy all integrations. The values of these should be specific to the new customer you're on boarding (such as SL1 IP address, username, password).
See the example configuration for a template you can fill out for each customer.
Because integrations might update their variable names from EM7 to SL1 in the future, ScienceLogic recommends to cover variables for both em7_ and sl1_. The example configuration contains this information.
For an example label, if you want a worker node to be dedicated to a customer called "acme", you could create a node label called "customer" and make the value of the label "acme". Setting this label now makes it easier to cluster in additional workers and distribute load dynamically in the future.
This topic outlines creating a label for a node. Labels provide the ability to deploy a service to specific nodes (determined by labels) and to categorize the nodes for the work they will be performing. Take the following actions to set a node label:
# get the list of nodes available in this cluster (must run from a manager node)
docker node ls
# example of adding a label to a docker swarm node
docker node update --label-add customer=acme <node id>
After you create a node label, refer to the example below for updating your docker-compose-override.yml file and ensuring the desired services deploy to the matching labeled nodes:
# example of placing workers on a specific labeled node
steprunner-acme:
...
deploy:
placement:
constraints:
- node.labels.customer == acme
resources:
limits:
memory: 1.5G
replicas: 15
...
Dedicating a separate queue for each customer is beneficial in that work and events created from one system will not affect or slow down work created from another system, provided the multi-tenant system has enough resources allocated. In the example below, we created two new queues in addition to the default queue, and allocated workers to it. Both of these worker services use separate queues as described below, but run on the same labeled worker node.
Example Queues to Deploy:
First, define additional workers in our stack that are responsible for handling the new queues. All modifications are made in docker-compose-override.yml:
After these changes have been made, your docker-compose entries for the new steprunners should look similar to the following:
steprunner-acme-catchup:
image: sciencelogic/is-worker:latest
depends_on:
- couchbase
- rabbitmq
- redis
hostname: "acme-catchup-{{.Task.ID}}"
deploy:
placement:
constraints:
- node.labels.customer == acme
resources:
limits:
memory: 2G
replicas: 3
environment:
user_queues: 'acmequeue-catchup'
..
..
..
steprunner-acme:
image: sciencelogic/is-worker:latest
depends_on:
- couchbase
- rabbitmq
- redis
hostname: "acmequeue-{{.Task.ID}}"
deploy:
placement:
constraints:
- node.labels.customer == acme
resources:
limits:
memory: 2G
replicas: 15
environment:
user_queues: 'acmequeue'
..
..
..
Once deployed via docker stack deploy, you should see the new workers in Flower, as in the following image:
You can verify the queues being listened to by looking at the "broker" section of Flower, or by clicking into a worker and clicking the
tab:After the workers have been configured with specific queue assignments, schedule your integrations to run on those queues, and configure Run Book Automations (RBAs) to place the integrations on those queues.
To run an integration on a specific queue using a configuration for a particular system, you can use the "params" override available in the scheduler. Below is an example of the scheduled integration which utilizes the acmecqueue-catchup queue:
In the example above, the cisco_correlation_queue_manager is scheduled to run every 300 seconds, using the acme configuration, and will run on the acmequeue. You can have any number of scheduled integration runs per integration. If we were to add additional customers, we would add a new schedule entry with differing configurations, and queues for each.
The last step to ensuring integrations for your newly onboarded SL1 system is to update the Run Book Automations in SL1 to provide the configuration and queue to use when the Run Book Automation triggers an event.
Modify the Event-correlation policy with the following changes:
This topic cover how the Integration Service handles situations where certain services fail.
In case of failure, when can the worker containers be expected to restart?
What happens when a worker container fails?
What processing is affected when service is down?
What data can be lost?
When can the API be expected to restart?
What happens when it fails?
What processing is affected when service is down?
What data can be lost?
If a core service node running Couchbase fails, the database should continue to work normally and continue processing events, as long as a suitable number of clustered nodes are still up and running. Three core service nodes provides automatic failover handling of one node, five core service nodes provides automatic failover handling of two nodes, and so on. See the High Availability section for more information.
If there are enough clustered core nodes still running, the failover will occur with no interruptions, and the failing node can be added back at any time with no interruptions.
NOTE: For optimal performance and data distribution after rejoining a cluster, you can click the button from the Couchbase user interface, if needed.
If there are not enough clustered core nodes still running, then you will manually have to fail over the Couchbase Server. In this scenario, since automatic failover could not be performed (due to too few nodes available), there will be disruption in event processing. For more information, see the Manual Failover section.
In case of failure, when can Couchbase be expected to restart?
What happens when it fails?
What processing is affected when service is down?
What data can be lost?
RabbitMQ clustered among all core service nodes provides full mirroring to each node. So long as there is at least one node available running RabbitMQ, the queues should exist and be reachable. This means that a multiple node failure will have no effect on the RabbitMQ services, and it should continue to operate normally.
In case of failure, when can RabbitMQ be expected to restart?
What happens when RabbitMQ fails?
What processing is affected when service is down?
What data can be lost?
In case of failure, when can the user interface be expected to restart?
What happens when it fails?
What data can be lost?
If the Redis service fails, it will automatically be restarted, and will be available again in a few minutes. The impact of this happening, is that task processing in the Integration Service is delayed slightly, as the worker services pause themselves and wait for the Redis service to become available again.
Consistent Redis failures
Consistent failures and restarts in Redis typically indicate your system has too little memory, or the Redis service memory limit is set too low, or not low at all. By default the Integration Service version 1.8.1 and later ships with a default memory limit of 8 GB to ensure that the Redis service only ever uses 8 GB of memory, and it ejects entries if it is going to go over that limit. This limit is typically sufficient, though if you have enough workers running large enough integrations to overfill the memory, you may need to increase the limit.
Before increasing Redis memory limit, be sure that there is suitable memory available to the system.
If you see multiple containers restarting at the same time on the same node, it indicates an over-provisioning of resources on that node. This only occurs on Swarm manager nodes, as the nodes are not only responsible for the services they are running, but also for maintaining the Swarm cluster and communicating with other manager nodes.
If resources become over-provisioned on one of those manager nodes (as they were with the core nodes when we saw the failure), the Swarm manager will not be able to perform its duties and may cause a docker restart on that particular node. This failure is indicated by “context deadline exceeded”, and “heartbeat failures” in the journalctl –no-page |grep docker |grep err logs.
This is one of the reasons why docker recommends running “manager-only” nodes, in which the manager nodes are only responsible for maintaining the Swarm, and not responsible for running other services. If any nodes that are running Integration Service services are also a Swarm manager, make sure that the nodes are not over-provisioned, otherwise the containers on that node may restart. For this reason, ScienceLogic recommends monitoring and placing thresholds at 80% utilization.
To combat the risk of over-provisioning affecting the docker Swarm manager, apply resource constraints on the services for the nodes that are also Swarm managers, so that docker operations always have some extra memory or CPU on the host to do what they need to do. Alternatively, you can only use drained nodes, which are not running any services, as Swarm managers, and not apply any extra constraints.
For more information about Swarm management, see https://docs.docker.com/engine/Swarm/admin_guide/.
[
{
"encrypted": false,
"name": "em7_host",
"value": "<ip address>"
},
{
"encrypted": false,
"name": "sl1_host",
"value": "${config.em7_host}"
},
{
"encrypted": false,
"name": "sl1_id",
"value": "${config.em7_id}"
},
{
"encrypted": false,
"name": "sl1_db_port",
"value": 7706
},
{
"encrypted": false,
"name": "snow_host",
"value": "<arecord>.service-now.com"
},
{
"encrypted": true,
"name": "em7_password",
"value": "<password>"
},
{
"encrypted": false,
"name": "sl1_user",
"value": "${config.em7_user}"
},
{
"encrypted": false,
"name": "sl1_password",
"value": "${config.em7_password}"
},
{
"encrypted": false,
"name": "sl1_db_user",
"value": "${config.em7_db_user}"
},
{
"encrypted": false,
"name": "sl1_db_password",
"value": "${config.em7_db_password}"
},
{
"encrypted": false,
"name": "em7_user",
"value": "<username>"
},
{
"encrypted": false,
"name": "em7_db_user",
"value": "root"
},
{
"encrypted": false,
"name": "em7_db_password",
"value": "<password>"
},
{
"encrypted": false,
"name": "snow_user",
"value": "<username>"
},
{
"encrypted": true,
"name": "snow_password",
"value": "<password>"
},
{
"encrypted": false,
"name": "Domain_Credentials",
"value": {
"c9818d2c4a36231201624433851894bb": {
"password": "3m7Admin!",
"user": "is4DomainUser2"
}
}
},
{
"name": "region",
"value": "ACMEScaleStack"
},
{
"encrypted": false,
"name": "em7_id",
"value": "${config.region}"
},
{
"encrypted": false,
"name": "generate_report",
"value": "true"
}
]
[
{
"application_id": "device_sync_sciencelogic_to_servicenow",
"entry_id": "dsync every 13 hrs acme",
"last_run": null,
"params": {
"configuration": "acme-scale-config",
"mappings": {
"cmbd_ci_ip_router": [
"Cisco Systems | 12410 GSR",
"Cisco Systems | AIR-AP1141N",
"Cisco Systems | AP 1200-IOS",
"Cisco Systems | Catalyst 5505"
],
"cmdb_ci_esx_resource_pool": [
"VMware | Resource Pool"
],
"cmdb_ci_esx_server": [
"VMware | ESXi 5.1 w/HR",
"VMware | Host Server",
"VMware | ESX(i) 4.0",
"VMware | ESX(i) w/HR",
"VMware | ESX(i) 4.0 w/HR",
"VMware | ESX(i)",
"VMware | ESX(i) 4.1 w/HR",
"VMware | ESXi 5.1 w/HR",
"VMware | ESXi 5.0 w/HR",
"VMware | ESX(i) 4.1",
"VMware | ESXi 5.1",
"VMware | ESXi 5.0"
],
"cmdb_ci_linux_server": [
"ScienceLogic, Inc. | EM7 Message Collector",
"ScienceLogic, Inc. | EM7 Customer Portal",
"ScienceLogic, Inc. | EM7 All-In-One",
"ScienceLogic, Inc. | EM7 Integration Server",
"ScienceLogic, Inc. | EM7 Admin Portal",
"ScienceLogic, Inc. | EM7 Database",
"ScienceLogic, Inc. | OEM",
"ScienceLogic, Inc. | EM7 Data Collector",
"NET-SNMP | Linux",
"RHEL | Redhat 5.5",
"Virtual Device | Content Verification"
],
"cmdb_ci_vcenter": [
"VMware | vCenter",
"Virtual Device | Windows Services"
],
"cmdb_ci_vcenter_cluster": [
"VMware | Cluster"
],
"cmdb_ci_vcenter_datacenter": [
"VMware | Datacenter"
],
"cmdb_ci_vcenter_datastore": [
"VMware | Datastore",
"VMware | Datastore Cluser"
],
"cmdb_ci_vcenter_dv_port_group": [
"VMware | Distributed Virtual Portgroup"
],
"cmdb_ci_vcenter_dvs": [
"VMware | Distributed Virtual Switch"
],
"cmdb_ci_vcenter_folder": [
"VMware | Folder"
],
"cmdb_ci_vcenter_network": [
"VMware | Network"
],
"cmdb_ci_vmware_instance": [
"VMware | Virtual Machine"
]
},
"queue": "acmequeue",
"region": "ACMEScaleStack"
},
"schedule": {
"schedule_info": {
"run_every": 47200
},
"schedule_type": "frequency"
},
"total_runs": 0
},
{
"application_id": "device_sync_sciencelogic_to_servicenow",
"entry_id": "dsync every 12 hrs on .223",
"last_run": null,
"params": {
"configuration": "em7-host-223",
"mappings": {
"cmdb_ci_esx_resource_pool": [
"VMware | Resource Pool"
],
"cmdb_ci_esx_server": [
"VMware | ESXi 5.1 w/HR",
"VMware | Host Server",
"VMware | ESX(i) 4.0",
"VMware | ESX(i) w/HR",
"VMware | ESX(i) 4.0 w/HR",
"VMware | ESX(i)",
"VMware | ESX(i) 4.1 w/HR",
"VMware | ESXi 5.1 w/HR",
"VMware | ESXi 5.0 w/HR",
"VMware | ESX(i) 4.1",
"VMware | ESXi 5.1",
"VMware | ESXi 5.0"
],
"cmdb_ci_linux_server": [
"ScienceLogic, Inc. | EM7 Message Collector",
"ScienceLogic, Inc. | EM7 Customer Portal",
"ScienceLogic, Inc. | EM7 All-In-One",
"ScienceLogic, Inc. | EM7 Integration Server",
"ScienceLogic, Inc. | EM7 Admin Portal",
"ScienceLogic, Inc. | EM7 Database",
"ScienceLogic, Inc. | OEM",
"ScienceLogic, Inc. | EM7 Data Collector",
"NET-SNMP | Linux",
"RHEL | Redhat 5.5",
"Virtual Device | Content Verification"
],
"cmdb_ci_vcenter": [
"VMware | vCenter",
"Virtual Device | Windows Services"
],
"cmdb_ci_vcenter_cluster": [
"VMware | Cluster"
],
"cmdb_ci_vcenter_datacenter": [
"VMware | Datacenter"
],
"cmdb_ci_vcenter_datastore": [
"VMware | Datastore",
"VMware | Datastore Cluser"
],
"cmdb_ci_vcenter_dv_port_group": [
"VMware | Distributed Virtual Portgroup"
],
"cmdb_ci_vcenter_dvs": [
"VMware | Distributed Virtual Switch"
],
"cmdb_ci_vcenter_folder": [
"VMware | Folder"
],
"cmdb_ci_vcenter_network": [
"VMware | Network"
],
"cmdb_ci_vmware_instance": [
"VMware | Virtual Machine"
]
}
},
"schedule": {
"schedule_info": {
"run_every": 43200
},
"schedule_type": "frequency"
},
"total_runs": 0
},
{
"application_id": "cisco_correlation_queue_manager",
"entry_id": "acme catchup events",
"last_run": {
"href": "/api/v1/tasks/isapp-a20d5e08-a802-4437-92ef-32d643c6b777",
"start_time": 1544474203
},
"params": {
"configuration": "acme-scale-config",
"queue": "acmequeue-catchup"
},
"schedule": {
"schedule_info": {
"run_every": 300
},
"schedule_type": "frequency"
},
"total_runs": 33
},
{
"application_id": "cisco_incident_state_sync",
"entry_id": "incident sync every 5 mins on .223",
"last_run": {
"href": "/api/v1/tasks/isapp-52b19097-e0bf-450b-948c-487aff33fc3b",
"start_time": 1544474203
},
"params": {
"configuration": "em7-host-223"
},
"schedule": {
"schedule_info": {
"run_every": 300
},
"schedule_type": "frequency"
},
"total_runs": 2815
},
{
"application_id": "cisco_incident_state_sync",
"entry_id": "incident sync every 5 mins acme",
"last_run": {
"href": "/api/v1/tasks/isapp-dde1dba5-2343-4026-8801-35a02e4e57a1",
"start_time": 1544474202
},
"params": {
"configuration": "acme-scale-config",
"queue": "acmequeue"
},
"schedule": {
"schedule_info": {
"run_every": 300
},
"schedule_type": "frequency"
},
"total_runs": 1587
},
{
"application_id": "cisco_correlation_queue_manager",
"entry_id": "qmanager .223",
"last_run": {
"href": "/api/v1/tasks/isapp-cb7cc2e5-eab1-474a-907a-055f26dbc36d",
"start_time": 1544474203
},
"params": {
"configuration": "em7-host-223"
},
"schedule": {
"schedule_info": {
"run_every": 300
},
"schedule_type": "frequency"
},
"total_runs": 1589
}
]
Event throughput testing with the Integration Service only:
The following test cases can be attempted with any number of dedicated customer queues. The expectation is that each customer queue will be filled with 10,000 events, and then you can time how long it takes to process through all 10,000 events in each queue.
Event throughput testing with SL1 triggering the Integration Service:
This test is executed in the same manner as the event throughput test described above, but in this scenario you never disable the steprunners, and you let the events process through the Integration Service as they are alerted to by SL1.
The difference between the timing of this test and the previous test can show how much of a delay the SL1 is taking to alert the Integration Service about an event, and subsequently sync it.
Separated queue test scenarios
This section covers the items you should back up in your Integration Service system, and how to restore backups.
When taking backups of the Integration Service environment, collect the following information from the host level of your primary manager node (this is the node from which you control the stack):
Files in /opt/iservices/scripts:
All files in /etc/iservices/
In addition to the above files, make sure you are storing Couchbase dumps somewhere by using the cbbackup command, or the "Integration Service Backup" integration application.
You should do a data-only restore if:
You should do a full restore if:
Once failed over, be sure to disable the "Integration Service Backup" integration application from being scheduled.
If multiple RabbitMQ cluster nodes are lost at once, the cluster might enter a "Network Partition" or "Split-brain" state. In this state, the queues will become paused if there is no auto-handling policy applied. The cluster will remain paused until a user takes manual action. To ensure that the cluster knows how to handle this scenario as the user would want, and not pause waiting for manual intervention, it is essential to set a partition handling policy.
For more information on RabbitMQ Network partition (split-brain) state, how it can occur, and what happens, see: http://www.rabbitmq.com/partitions.html.
By default, ScienceLogic sets the partition policy to autoheal in favor of continued service if any nodes go down. However, depending on the environment, you might wish to change this setting.
For more information about the automatic split-brain handling strategies that RabbitMQ provides, see: http://www.rabbitmq.com/partitions.html#automatic-handling.
autoheal is the default setting set by SL1, and as such, queues should always be available, though if multiple nodes fail, some messages may be lost.
If you are using pause_minority mode and a "split-brain" scenario occurs for RabbitMQ in a single cluster, when the split-brain situation is resolved, new messages that are queued will be mirrored (replicated between all nodes once again).
ScienceLogic's recommendations for applying changes to the default policy include the following:
The best way to change the SL1 default split-brain strategy is to make a copy of the RabbitMQ config file from a running rabbit system, add your change, and then mount that config back into the appropriate place to apply your overrides.
docker cp <container-id>:/etc/rabbitmq/rabbitmq.conf /destination/on/host
change cluster_partition_handling value
mount "[/path/to/config]/rabbitmq.conf:/etc/rabbitmq/rabbitmq.conf"
To maintain Swarm health, ScienceLogic recommends that you deploy some swarm managers that do not take any of the workload of the application. The only purpose for these managers is to maintain the health of the swarm. Separating these workloads ensures that a spike in application activity will not affect the swarm clustering management services.
ScienceLogic recommends that these systems have 2 CPU and 4 GB of memory.
To deploy a drained manager node:
docker node update --availability drain <node>
Draining the node ensures that no containers will be deployed to it.
For more information, see https://docs.docker.com/engine/swarm/admin_guide/.
There are two potential update workflows for updating the Integration Service cluster. The first workflow involves using a Docker registry that is connectable to swarm nodes on the network. The second workflow requires manually copying the Integration Service RPM or containers to each individual node.
This section covers the sizing considerations for the Couchbase, RabbitMQ, Redis, contentapi, and GUI services.
The initial sizing provided for Couchbase nodes in the multi-tenant cluster for 6 CPUs and 56 GB memory should be more than enough to handle multiple customer event syncing workloads.
ScienceLogic recommends monitoring the CPU percentage and Memory Utilization percentage of the Couchbase nodes to understand when a good time to increase resources is, such as when Memory and CPU are consistently above 80%.
The only special considerations for RabbitMQ sizing is how many events you will plan for in the queue at once.
Every 10,000 events populated in the Integration Service queue will consume approximately 1.5 GB of memory.
This memory usage is drained as soon as the events leave the queue.
The initial sizing deployment for redis should be sufficient for multiple customer event syncing.
The only time memory might need to be increased to redis is if you are attempting to view logs from a previous run, and the logs are not available. A lack of run logs from a recently run integration indicates that the redis cache does not have enough room to store all the step and log data from recently executed runs.
The contentapi services sizing should remain limited at 2 GB memory, as is set by default.
If you notice timeouts, or 500s when there is a large load going through the Integration Service system, you may want to increase the number of contentapi replicas.
For more information, see placement considerations, and ensure the API replicas are deployed in the same location as the redis instance.
The GUI service should not need to be scaled up at all, as it merely acts as an ingress proxy to the rest of the Integration Service services.
Refer to the worker sizing charts provided by ScienceLogic for the recommended steprunner sizes.
Flower and Scheduler do not need to be scaled up at all.
An issue exists where if there latency exists between the contentapi and redis, the integrations page may not load. This issue is caused by the API making too many calls before returning. The added latency for each individual call can cause the overall endpoint to take longer to load than the designated timeout window of thirty seconds.
The only impact of this issue is the applications/ page won't load. There is no operational impact on the integrations as a whole, even if workers are in separate geos than redis.
There is also no risk to High Availability (HA) by placing the API andRredis services on the same geo. If for whatever reason that geo drops out, the containers will be restarted automatically in the other location.
Tool | Issue | Symptoms | Cause | Solution |
---|---|---|---|---|
Docker Visualizer | Docker Visualizer shows some services as "undefined". |
When viewing the Docker Visualizer user interface, some services are displayed as "undefined", and states aren't accurate.
Impact: Cannot use Visualizer to get the current state of the stack. |
Failing docker stack deployment:
https://github.com/ |
Ensure your stack is healthy, and services are deployed correctly. If no services are failing and things are still showing as undefined, elect a new swarm leader.
To prevent: Ensure your configuration is valid before deploying. |
RabbitMQ | RabbitMQ queues encountered a node failure and are in a "Network partition state" (split-brain scenario). |
The workers are able to connect to the queue, and there are messages on the queue, but the messages are not being distributed to the workers.
Log in to the RabbitMQ admin user interface, which displays a message similar to "RabbitMQ experienced a network partition and the cluster is paused".
Impact: The RabbitMQ cluster is paused and waiting for user intervention to clean the split-brain state.
|
Multi-node failure occurred, and rabbit wasn't able to determine who the new master should be. This also will only occur if there is NO partition handling policy in place (see the resiliency section for more information)
Note: ScienceLogic sets the autoheal policy by default |
Handle the split-brain partition state and resynchronize your RabbitMQ queues.
Note: This is enabled by default.
To prevent: Set a partition handling policy.
See the Resiliency section for more information. |
RabbitMQ, continued |
Execing into the RabbitMQ container and running rabbitmqcli cluster-status shows nodes in a partition state like the following:
[{nodes, [{disc, ['rabbit@rabbit_node1.isnet','rabbit@rabbit_node2.isnet', 'rabbit@rabbit_node3.isnet','rabbit@rabbit_node4.isnet', 'rabbit@rabbit_node5.isnet','rabbit@rabbit_node6.isnet']}]}, {running_nodes,['rabbit@rabbit_node4.isnet']}, {cluster_name,<<"rabbit@rabbit_node1">>}, {partitions, [{'rabbit@rabbit_node4.isnet', ['rabbit@rabbit_node1.isnet','rabbit@rabbit_node2.isnet', 'rabbit@rabbit_node3.isnet','rabbit@rabbit_node5.isnet', 'rabbit@rabbit_node6.isnet']}]}, {alarms,[{'rabbit@rabbit_node4.isnet',[]}]}] |
|||
Integration Service steprunners and RabbitMQ | Workers constantly restarting, no real error message. |
Workers of a particular queue are not stable and constantly restart.
Impact: One queue's workers will not be processing. |
Multi-node failure in RabbitMQ, when it loses majority and can not failover.
Queues go out of sync because of broken swarm. |
Recreate queues for the particular worker.
To prevent: Deploy enough nodes to ensure quorum for failover. |
Couchbase | Couchbase node is unable to restart due to indexer error. |
This issue can be monitored in the Couchbase logs:
Service 'indexer' exited with status 134. Restarting. Messages: sync.runtime_Semacquire(0xc4236dd33c)
Impact: One couchbase node becomes corrupt. |
Memory is removed from the database while it is in operation (memory must be dedicated to the VM running Couchbase).
The Couchbase node encounters a failure, which causes the corruption. |
Ensure that the memory allocated to your database nodes is dedicated and not shared among other VMs.
To prevent: Ensure that the memory allocated to your database nodes is dedicated and not shared among other VMs. |
Couchbase | Couchbase is unable to rebalance. |
Couchbase nodes will not rebalance, usually with an error saying "exited by janitor".
Impact: Couchbase nodes cannot rebalance and provide even replication. |
Network issues: missing firewall rules or blocked ports.
The Docker swarm network is stale because of a stack failure. |
Validate that all firewall rules are in place, and that no external firewalls are blocking ports.
Reset the Docker swarm network status by electing a new swarm leader.
To prevent: Validate the firewall rules before deployment.
|
Integration Service steprunners to Couchbase | Steprunners unable to communicate to Couchbase |
Steprunners unable to communicate to Couchbase database, with errors like "client side timeout", or "connection reset by peer".
Impact: Steprunners cannot access the database. |
Missing Environment variables in compose:
Check the db_host setting for the steprunner and make sure they specify all Couchbase hosts available .
Validate couchbase settings, ensure that the proper aliases, hostname, and environment variables are set.
Stale docker network. |
Validate the deployment configuration and network settings of your docker-compose. Redeploy with valid settings.
In the event of a swarm failure, or stale swarm network, reset the Docker swarm network status by electing a new swarm leader.
To prevent: Validate hostnames, aliases, and environment settings before deployment.
|
Flower | Worker display in flower is not organized and hard to read, and it shows many old workers in an offline state. |
Flower shows all containers that previously existed, even if they failed, cluttering the dashboard.
Impact: Flower dashboard is not organized and hard to read. |
Flower running for a long time while workers are restarted or coming up/coming down, maintaining the history of all the old workers.
Another possibility is a known issue in task processing due to the --max-tasks-per-child setting. At high CPU workloads, the max-tasks-per-child setting causes workers to exit prematurely. |
Restart the flower service by running the following command:
docker service update --force iservices_flower
You can also remove the --max-tasks-per-child setting in the steprunners. |
All containers on a particular node | All containers on a particular node do not deploy. |
Services are not deploying to a particular node, but instead they are getting moved to other nodes.
Impact: The node is not running anything. |
One of the following situations could cause this issue:
Invalid label deployment configuration.
The node does not have the containers you are telling it to deploy.
The node is missing a required directory to mount into the container. |
Make sure the node that you are deploying to is labeled correctly, and that the services you expect to be deployed there are properly constrained to that system.
Go through the troubleshooting steps of "When a docker service doesn't deploy" to check that the service is not missing a requirement on the host.
Check the node status for errors: docker node ls
To prevent: Validate your configuration before deploying. |
All containers on a particular node | All containers on a particular node periodically restart at the same time. |
All containers on a particular node restart at the same time.
The system logs indicate an error like:
“error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
Impact: All containers restart on a node. |
This issue only occurs in single-node deployments when the only manager allocates too many resources to its containers, and the containers all restart since the swarm drops.
The manager node gets overloaded by container workloads and is not able to handle swarm management, and the swarm loses quorum. |
Use some drained manager nodes for swarm management to separate the workloads.
To prevent: |
General Docker service | Docker service does not deploy. Replicas remain at 0/3 | Docker service does not deploy. | There are a variety of reasons for this issue, and you can reveal most causes by checking the service logs to address the issue. | Identify the cause of the service not deploying. |
Integration Service user interface | The Timeline or the Integrations page do not appear in the user interface. | The Timeline is not showing accurate information, or the Integrations page is not rendering. |
One of the following situations could cause these issues:
Indexes do not exist on a particular Couchbase node.
Latency between the API and the redis service is too great for the API to collect all the data it needs before the 30-second timeout is reached.
The indexer can't keep up to a large number of requests, and Couchbase requires additional resources to service the requests. |
Solutions:
Verify that indexes exist.
Place the API and redis containers in the same geography so there is little latency. This issue will be fixed in a future IS release
Increase the amount of memory allocated to the Couchbase indexer service. |
This section contains a set of solutions and explanations for a variety of issues.
Sometimes when managers lose connection to each other, either through latency or a workload spike, there are instances when the swarm needs to be reset or refreshed. By electing a new leader, you can effectively force the swarm to redo service discovery and refresh the metadata for the swarm. This procedure is highly preferred over removing and re-deploying the whole stack.
To elect a new swarm leader:
docker node ls
docker node demote <node>
docker node ls
docker node promote <node>
If you do not want to retain any messages in the queue, the following procedure is the best method for recreating the queues. If you do have data that you want to retain, you can resynchronize RabbitMQ queues.
To recreate RabbitMQ queues:
rabbitmqadmin delete queue name=name_of_queue
rabbitmqadmin delete exchange name=name_of_queue
After you delete the queues, the queues will be recreated the next time a worker connects.
If your RabbitMQ cluster ends up in a "split-brain" or partitioned state, you might need to manually decide which node should become the master. For more information, see http://www.rabbitmq.com/partitions.html#recovering.
To resynchronize RabbitMQ queues:
docker service scale iservices_rabbitmqx=x0
Step 1: Obtain the ID of the failed container for the service
Run the following command for the service that failed previously:
docker service ps --no-trunc <servicename>
For example:
docker service ps --no-trunc iservices_redis
From the command result above, we see that one container with the ID 3s7s86n45skf failed previously running on node is-scale-03 (non-zero exit) and another container was restarted in its place.
At this point, you can ask the following questions:
At this point, the cause of the issue is not a deploy configuration issue, and it is not an entire node failure. The problem exists within the service itself. Continue to Step 2 if this is the case.
Step 2: Check for any interesting error messages or logs indicating an error
Using the ID obtained in Step 1, collect the logs from the failed container with the following command:
docker service logs <failed-id>
For example:
docker service logs 3s7s86n45skf
Review the service logs for any explicit errors or warning messages that might indicate why the failure occurred.
Index stuck in “created” (not ready) state
This situation usually occurs when a node starts creating an index, but another index creation was performed at the same time by another node. After the index is created, you can run a simple query to build the index which will change it from created to “ready”:
BUILD index on 'content'('idx_content_content_type_config_a3f867db_7430_4c4b_b1b6_138f06109edb') using GSI
Deleting an index
If you encounter duplicate indexes, such as a situation where indexes were manually created more than once, you can delete an index:
DROP index content.idx_content_content_type_config_d8a45ead_4bbb_4952_b0b0_2fe227702260
Recreating all indexes on a particular node
To recreate all indexes on a particular Couchbase node, exec into the couchbase container and run the following command:
Initialize_couchbase -s
Running this command recreates all indexes, even if the indexes already exist.
To remove a Couchbase node and re-add it to the cluster:
rm -rf /var/data/couchbase/*
Backup
cbbackup http://couchbase.isnet:8091 /opt/couchbase/var/backup -u [user] -p [password] -x data_only=1
Delete Couchbase
rm -f /var/data/couchbase/*
Restore
cbrestore /opt/couchbase/var/backup http://couchbase.isnet:8091 -b content -u <user> -p <password>
cbrestore /opt/couchbase/var/backup http://couchbase.isnet:8091 -b logs -u <user> -p <password>
This section describes how to upgrade the Integration Service in a multi-tenant environment with as little downtime as possible.
Validate Cluster states
Validate Backups exist
Clean out old container images if desired
Before upgrading to the latest version of the Integration Service, check the local file system and see if there are any older versions taking up space that you might want to remove. These containers exist both locally on the fs and the internal docker registry. To view any old container version,s check the /opt/iservices/images directory. ScienceLogic recommends that you keep at a minimum the last version of containers, so you can downgrade if necessary.
Cleaning out images is not mandatory, but it is just a means of clearing out additional space on the system if necessary.
To remove old images:
Install the new RPM
The first step of upgrading is to install the new RPM on all systems in the stack. Doing so will ensure that the new containers are populated onto the system (if using that particular RPM), and any other host settings are changed. RPM installation does not pause any services or affect the docker system in any way, other than using some resources.
The Integration Service has two RPMs, one with containers and one without. If you have populated an internal docker registry with docker containers, you can install the RPM without containers built in. If no internal docker repository is present, you must install the RPM which has the containers built in it. Other than the containers, there is no difference between the RPMs.
For advanced users, installing the RPM can be skipped. However this means that the user is completely responsible for maintaining the docker-compose and host level configurations.
To install the RPM:
rpm -Uvh <new-rpm-file>
Compare compose file changes and resolve differences
After the RPM is installed, you will notice a new docker-compose file is placed at /etc/iservices/scripts/docker-compose.yml. As long as your environment-specific changes exist solely in the compose-override file, all user changes and new version updates will be resolved into that new docker-compose.yml file.
ScienceLogic recommends that you check the differences between the two docker-compose files. You should validate that:
Make containers available to systems
After you apply the host-level updates, you should make sure that the containers are available to the system.
If you upgraded using the RPM with container images included, the containers should already be on all of the nodes, you can run docker images to validate the new containers are present. If this is the case you may skip to the next section.
If the upgrade was performed using the RPM which did not contain the container images, ScienceLogic recommends that you run the following command to make sure all nodes have the latest images:
docker-compose -f <new_docker_compose_file> pull
This command validates that the containers specified by your compose file can be pulled and reached from the nodes. While not required, you might to make sure that the images can be pulled before starting the upgrade. If the images are not pulled manually, they will automatically be pulled by Docker when the new image is called for by the stack.
To perform the upgrade on a clustered system with little downtime, the Integration Service re-deploys services to the stack in groups. To do this, the Integration Service gradually makes the updates to groups of services and re-runs docker stack deploy for each change. To ensure that no unintended services are updated, start off using the same docker-compose file that was previously used to deploy. Reusing the same docker-compose file and updating only sections at a time ensures that only the intended services to be updated are affected at any given time.
Avoid putting all the changes in a single docker-compose file, and do a new docker stack deploy with all changes at once. If downtime is not a concern, you can update all services, but updating services gradually allows you to have little or no downtime.
Before upgrading any group of services, be sure that the docker-compose file you are deploying from is exactly identical to the currently deployed stack (the previous version). Start with the same docker-compose file and update it for each group of services as needed,
Upgrade Redis, Scheduler, and Flower
The first group to update includes the Redis, Scheduler and Flower. If desired, this group can be upgraded along with any other group.
To update:
docker stack deploy -c <old_compose_with_small_changes> iservices
Example image definition of this upgrade group:
services:
contentapi:
image: repository.auto.sciencelogic.local:5000/is-api:1.8.1
couchbase:
image: repository.auto.sciencelogic.local:5000/is-couchbase:1.8.1
couchbase-worker:
image: repository.auto.sciencelogic.local:5000/is-couchbase:1.8.1
flower:
image: repository.auto.sciencelogic.local:5000/is-worker:hotfix-1.8.3
gui:
image: repository.auto.sciencelogic.local:5000/is-gui:1.8.1
rabbitmq:
image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2
rabbitmq2:
image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2
rabbitmq3:
image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2
redis:
image: repository.auto.sciencelogic.local:5000/is-redis:4.0.11-2
scheduler:
image: repository.auto.sciencelogic.local:5000/is-worker:hotfix-1.8.3
steprunner:
image: repository.auto.sciencelogic.local:5000/is-worker:1.8.1
couchbase-worker2:
image: repository.auto.sciencelogic.local:5000/is-couchbase:1.8.1
steprunner2:
image: repository.auto.sciencelogic.local:5000/is-worker:1.8.1
Redis Version
As the Redis version might not change with every release of the Integration Service, there might not be any changes needed in the upgrade for Redis. This can be expected and is not an issue.
Flower Dashboard
Due to a known issue addressed in version 1.8.3 of the Integration Service, the Flower Dashboard might not display any workers. Flower eventually picks up the new workers when they are restarted in the worker group. If this is a concern, you can perform the Flower upgrade in the same group as the workers.
The next group of services to update together are the RabbitMQ/Couchbase database services, as well as the GUI. Because the core services are individually defined and "pinned" to specific nodes, upgrade these two services at the same time, on a node-by-node basis. In between each node upgrade, wait and validate that the node rejoins the Couchbase and Rabbit clusters and re-balances appropriately.
Because there will always be two out of three nodes running these core services, this group should not cause any downtime for the system.
Rabbit/Couchbase Versions
The Couchbase and RabbitMQ versions used might not change with every release of the Integration Service. If there is no update or change to be made to the services, you can ignore this section for RabbitMQ or Couchbase upgrades, or both. Assess the differences between the old and new docker-compose files to check if there is an image or environment change necessary for the new version. If not, you can move on to the next section.
Update Actions (assuming three core nodes)
To update first node services:
docker stack deploy -c <compose_file>
First node Couchbase update considerations:
Special GUI consideration with 1.8.3
In the upgrade to version 1.8.3 of the Integration Service, the Couchbase and RabbitMQ user interface ports will be exposed through the Integration Service user interface with HTTPS. To ensure there is no port conflict between services and the Integration Service user interface, ensure that the Couchbase and RabbitMQ user interface port mappings are removed or modified from the default (8091) admin port. To avoid conflicts, make sure the new Integration Service user interface definition does not conflict with the Couchbase or RabbitMQ definitions.
The Integration Service user interface will not update until all port conflicts are resolved. You can upgrade the Integration Service user interface at any time after this has been done, but be sure to first review the Update the GUI topic, below.
You can manually remove port mappings from a service with the following command, though the command will restart the service: docker service update --publish-rm published=8091,target=8091 iservices_couchbase
Example docker-compose with images and JOIN_ON for updating the first node:
services:
contentapi:
image: repository.auto.sciencelogic.local:5000/is-api:1.8.1
couchbase:
image: repository.auto.sciencelogic.local:5000/is-couchbase:hotfix-1.8.3
environment:
JOIN_ON: "couchbase-worker2"
couchbase-worker:
image: repository.auto.sciencelogic.local:5000/is-couchbase:1.8.1
flower:
image: repository.auto.sciencelogic.local:5000/is-worker:hotfix-1.8.3
gui:
image: repository.auto.sciencelogic.local:5000/is-gui:hotfix-1.8.3
rabbitmq:
image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2
rabbitmq2:
image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2
rabbitmq3:
image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2
redis:
image: repository.auto.sciencelogic.local:5000/is-redis:4.0.11-2
scheduler:
image: repository.auto.sciencelogic.local:5000/is-worker:hotfix-1.8.3
steprunner:
image: repository.auto.sciencelogic.local:5000/is-worker:1.8.1
couchbase-worker2:
image: repository.auto.sciencelogic.local:5000/is-couchbase:1.8.1
steprunner2:
image: repository.auto.sciencelogic.local:5000/is-worker:1.8.1
Update second, and third node services
To update the second and third node services, repeat the steps from the first node on each node until all nodes are re-clustered and available. Be sure to check the service port mappings to ensure that there are no conflicts (as described above), and remove any HTTP ports if you choose.
You can update the GUI service along with any other group, but due to the port mapping changes in version 1.8.3 of the Integration Service, you should update this service after the databases and RabbitMQ nodes have been updated, and their port mappings no longer conflict.
Since the GUI service provides all ingress proxy routing to the services, there might be a very small window where the Integration Service might not receive API requests as the GUI (proxy) is not running. This downtime is limited to the time it takes for the GUI container to restart.
To update the user interface:
You should update the workers and contentapi last. Because these services use multiple replicas (multiple steprunner or containerapi containers running per service), you can rely on Docker to incrementally update each replica of the service individually. By default, when a service is updated, it will update one container of the service at a time, and only after the previous container is up and stable will the next container be deployed.
You can utilize additional Docker options in docker-compose to set the behavior of how many containers to update at once, when to bring down the old container, and what happens if a container upgrade fails. See the update_config and rollback_config options available in Docker documentation: https://docs.docker.com/compose/compose-file/.
Upgrade testing was performed by ScienceLogic using default options. An example where these settings are helpful is to change the parallelism of update_config so that all worker containers of a service update at the same time.
The update scenario described below takes extra precautions and only updates one node of workers per customer at a time. If you decide, you can also safely update all workers at once.
To update the workers and contentapi:
Example docker-compose definition with one of two worker nodes and contentapi updated:
services:
contentapi:
image: repository.auto.sciencelogic.local:5000/is-api:hotfix-1.8.3
deploy:
replicas: 3
couchbase:
image: repository.auto.sciencelogic.local:5000/is-couchbase:hotfix-1.8.3
environment:
JOIN_ON: "couchbase-worker2"
couchbase-worker:
image: repository.auto.sciencelogic.local:5000/is-couchbase:hotfix-1.8.3
flower:
image: repository.auto.sciencelogic.local:5000/is-worker:hotfix-1.8.3
gui:
image: repository.auto.sciencelogic.local:5000/is-gui:hotfix-1.8.3
rabbitmq:
image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2
rabbitmq2:
image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2
rabbitmq3:
image: repository.auto.sciencelogic.local:5000/is-rabbit:3.7.7-2
redis:
image: repository.auto.sciencelogic.local:5000/is-redis:4.0.11-2
scheduler:
image: repository.auto.sciencelogic.local:5000/is-worker:hotfix-1.8.3
steprunner:
image: repository.auto.sciencelogic.local:5000/is-worker:hotfix-1.8.3
couchbase-worker2:
image: repository.auto.sciencelogic.local:5000/is-couchbase:hotfix-1.8.3
steprunner2:
image: repository.auto.sciencelogic.local:5000/is-worker:1.8.1