Appendix A: Configuring the Skylar Automation System for High Availability

Download this manual as a PDF file 

This appendix describes how to create High Availability deployments to protect the data in Skylar Automation (formerly PowerFlow).

Types of High Availability Deployments for Skylar Automation

The following table contains a set of ratings that depict the level of resiliency enabled by various Skylar Automation deployment types. The higher the rating, the more resilient the Skylar Automation system, not just from a node failure perspective, but also from a throughput and load-balancing regard.

Deployment Type Resiliency Rating Typical Audience
Single-node deployment F Users who want Skylar Automation running, but do not care about failover.
Three-node cluster B+ Users who want Skylar Automation running, and also want support for automatic failover for one-node failure.
3+ node cluster with separate workers (at least 4 nodes) A- Users who want automatic failover for one-node failure, and intend to have very CPU- or memory-intensive tasks executing on the workers constantly.
3+ node cluster with separate workers, and drained manager nodes (at least 6 nodes) A Users who want automatic failover for one-node failure, intend to have very CPU- or memory-intensive tasks executing on the workers, and want to completely mitigate risks of resource contention between services.

You can start with any deployment type, and at a later time scale up to any other deployment type as needed. For example, a you can start with a single-node deployment, then at a later date add three more nodes to enable a 3+ node cluster with separate workers.

The deployments listed in the table are just the standards for deployment. For very high-scale customers, a more advanced deployment might be necessary. For deployment requirements like this, please contact ScienceLogic Support.

If you are deploying Skylar Automation without a load balancer, you can only use the deployed IP address as the management user interface. If you use another node to log in to the Skylar Automation system, you will get an internal server error. Also, if the deployed node is down, you must redeploy the system using the IP address for another active node to access the management user interface.

There is no support for active or passive Disaster Recovery. ScienceLogic recommends that your Skylar Automation Disaster Recovery plans include regular backups and restoring from backup. For more information, see Backing up Data.

The standard deployments are listed below in the following topics:

You can use a command-line utility called skyautocontrol (skyautoctl) that performs multiple administrator-level actions on either the node or the cluster. You can use this script to automate the configuration of a three-node cluster. For more information, see Automating the Configuration of a Three-Node Cluster.

Standard Single-node Deployment (1 Node)

Single-node deployment is the standard deployment that comes along with the ISO and RPM installation. This is the default deployment if you install the ISO and run the pull_start_iservices.sh script.

This deployment provides a single node running the Skylar Automation system. If this node fails, the system will not be operational.

Requirements

One node, 8 CPU, 24 GB memory minimum, preferably 34 GB to 56 GB memory, depending on workload sizes. For more information, see System Requirements.

Risks

A single node supports no data replication, no queue mirroring, and no failover capabilities.

Configuration

This configuration is available as a default deployment with the docker-compose included in the Skylar Automation 2.0.0 or later ISO or RPM.

Standard Three-node Cluster (3 Nodes)

The following High Availability deployment is an example of a three-node cluster:

  • Each node in the Swarm is a Swarm Manager.
  • All Swarm nodes are located within the same data center.

The three-node cluster is the most basic option providing full High Availability and data replication support among three nodes. In this deployment, each of the three nodes are running the same services in a clustered environment, which provides failover and data loss prevention capabilities. This deployment option will satisfy most High Availability needs, but it does not mitigate risks with the potential for worker operations to affect and degrade the database and queue services, because all services are running on the same nodes.

This deployment provides:

  • Automatic failover for one out of three node failure: If one node in the cluster fails, automatic failover occurs, and the Skylar Automation will continue to be operational running on two out of three of the nodes.
  • Full data replication between all three nodes. All nodes have a copy of the same data replicated across all three nodes. If one or two nodes fail, you will not experience data loss in the database or in the queues.
  • Full queue mirroring between all three nodes. All nodes have a mirror of the queues defined in the Skylar Automation environment. If one or two nodes fail, the system still retains messages in queues using the autoheal policy by default. For more information about autoheal behavior in RabbitMQ, see The RabbitMQ Split-brain Handling Strategy.

Requirements

Three nodes, 8 CPU, 24 GB memory minimum, preferably 34 GB to 56 GB memory, depending on workload sizes. For more information, see System Requirements.

Risks

When only three nodes are allocated used for High Availability, the following risks are present:

  • Over-utilization of nodes causing clustering issues. In a three node cluster, worker containers, and Docker Swarm Managers are running on the same node as the database and queue services. As a result, if the node is not provisioned correctly, there could be some resource contention. If a node reaches 100% CPU, Docker Swarm cluster operations might fail, causing a node to completely restart, and causing a failover or other unexpected behavior.
  • Over-utilization of workers nodes causing database or queue issues. Since all services are sharing the same nodes in this configuration, if worker operations become extremely CPU- or memory-intensive, the system might try to use resources needed from the database or queue. If this happens, you might encounter failures when querying the database or using the queues.

Mitigating Risks

The above risks can be mitigated by ensuring that the node is deployed with adequate CPU and memory for the workloads that you plan to run on the node. Memory limits are placed on containers by default. If needed, you could also add CPU limits to worker containers to further prevent resource contention.

Configuration

Skylar Automation uses a docker-compose-override.yml file to persistently store user-specific configurations for containers, such as proxy settings, replica settings, additional node settings, and deploy constraints. The user-specific changes are kept in this file so that they can be re-applied when the /opt/iservices/scripts/docker-compose.yml file is completely replaced on an RPM upgrade, ensuring that no user-specific configurations are lost. By default only main core services are included in the docker-compose-override.yml file, if extra services need to be added they should be included as needed.

Below is an example docker-compose-override.yml file for Skylar Automation:

3+ Node Cluster with Separate Workers (4 or More Nodes)

The three-node cluster with separate workers is a slight variation of the standard three-node cluster. With this deployment strategy, all worker operation load is run by a separate independent node. This is preferable over the standard three-node deployment, because it completely prevents worker operations from stealing resources from the databases or queues.

Since steprunner workload is entirely on dedicated servers, you have greater ability to scale up to more workers, or even add additional nodes of workers to the system, without affecting critical database or queue operations.

This deployment provides a complete separation of worker processing from the database and queue processing, which is very helpful for users which have very CPU-intensive tasks that execute frequently.

The following High Availability deployment adds Docker Swarm worker nodes where steprunners can be constrained. This lets you continue to scale out new worker nodes as the load increases. This also lets you distribute steprunners based on workloads. Core services include ContentAPI, RabbitMQ, and Couchbase.

You can add drained Docker Swarm Manager nodes to increase fault tolerance of the Swarm, and to ensure that the orchestration of the Swarm is not impeded by large workloads on the core nodes.

The maximum Couchbase cluster with fully replicated nodes is four. Anything greater than four will not have a full replica set and will auto-shard data across additional nodes. There is no way as of this version of Couchbase to set the placement of the replicas. Redis replication and clustering is not currently supported in this version of Skylar Automation.

Requirements

Three nodes, 8 CPU, 24 GB memory minimum, preferably 34 GB to 56 GB memory, depending on workload sizes. For more information, see System Requirements.

One or more worker node with your choice of sizing.

Worker Node Sizing

Worker nodes can be sized to any CPU or memory constraints, though the greater the memory and CPU, more workers the node can run. The minimum size of a worker node is 2 CPU, 4 GB memory.

Risks

Core Node over-utilization could cause Swarm clustering problems. Because the Swarms are the same nodes as the core managers, there is a possibility for heavily loaded databases and queues to contend with the Swarm hosts for resources. In this case the Swarm may restart itself and the services running on that node. This is not as likely to occur with workers running on their own dedicated nodes.

Mitigating Risks

The above risks can easily be mitigated by ensuring the node is deployed with adequate CPU and memory for the workloads it is expected to run. Additionally, you can apply CPU and memory limits to the database or queue containers so that there will always be enough resources allocated to the host to prevent this scenario. For more information, see Configuring Additional Elements of Skylar Automation.

Configuration

Using this configuration consists of:

  • Joining the standard three-node Swarm cluster with one or more nodes as a Swarm worker.
  • Labeling each additional "worker" node with a Swarm label "worker". For more information, see Creating a Node Label. You can also use the worker node role to restrict the steprunners to run only in the Swarm worker nodes using node.role==worker in the constraints section in the docker-compose file.
  • In addition to the standard three-node deployment, you should update the steprunners to run on a dedicated node in the docker-compose file:

    steprunner3:
      deploy:
        placement:
          constraints:
          - node.labels.types == worker
  • You can edit the value of --max-replica-per-node in the docker-compose-override file to restrict the number of replicas that will run in each Swarm node. The default value is 5:

    steprunner:
      deploy:
        replicas: 15
        ... 
        placement:
          max_replicas_per_node: 5
      environment:
        ... 

    The --max-replica-per-node option is available with docker-compose 3.8 or later. Add version: '3.8' at the start of the docker-compose file to ensure compatibility.

3+ Node Cluster with Separate Workers and Drained Manager Nodes (6 or More Nodes)

This deployment option is the most robust of the one-node auto-failover deployments, and completely mitigates known risks for resource contention in clusters.

This configuration provides everything that the 3+ node cluster with dedicated workers provides, with the addition of drained Swarm Managers. The drained Swarm Managers mitigate the risk of database or queue processing causing contention of resources for the Swarm clustering operations at the host level.

This deployment should only be used for large deployments of Skylar Automation. This deployment separates out all the core services onto their own dedicated worker node and lets you distribute steprunners based on workloads:

You can add drained Docker Swarm Manager nodes to increase fault tolerance of the Swarm, and to ensure that the orchestration of the Swarm is not impeded by large workloads on the core nodes.

The maximum Couchbase cluster with fully replicated nodes is four. Anything greater than four will not have a full replica set and will auto-shard data across additional nodes. There is no way as of this version of Couchbase to set the placement of the replicas. Redis replication and clustering is not currently supported in this version of Skylar Automation.

Requirements

Three nodes, 8 CPU, 24 GB memory minimum, preferably 34 GB to 56 GB memory, depending on workload sizes. For more information, see System Requirements.

Also, three nodes, 2 CPU, 4 GB memory for the Swarm Manager.

Risks

None.

Configuration

Use the same docker-compose-override.yml file found in Standard Three-node Cluster (3 Nodes).

Next, add the additional three nodes to the cluster as managers, and drain them of all services (see Using Drained Managers to Maintain Swarm Health). Promote the drained nodes to Swarm Managers, and make all other nodes workers.

Additional Deployment Options

The following diagrams show additional High Availability deployment architectures that are supported for Skylar Automation.

Cross-Data Center Swarm Configuration

Docker Swarm requires three data centers to maintain quorum of the swarm in the event of a full data center outage. Each data center must have a low-latency connection between the data centers.

Implementing clustering across links with a latency that is greater than 80 ms is not supported, and may cause one or more of the following situations: nodes dropping out of the cluster, or automatically failover, failed data replication, and potential cluster communication issues resulting in timeouts and significantly increased overhead.

The cross-data center configuration has the following limitation: the Redis service cannot be deployed in High Availability. As a result, all task results saved by any steprunner will have to be saved within that data center. Upon a failure of that data center, a new Redis service will be created, but an application in the middle of its run would have to retry.

The following High Availability deployment shows a cross-data center swarm configuration:

Additional Notes

Tagging and constraints in the Docker compose file should be used to ensure proper placement. Example compose files are not available at this time.

Configuration management solutions such as Ansible should be used to update and manage large swarm environments.

For an easy upgrade of Skylar Automation, use Docker Hub to pull the latest images or use an internal Docker registry.

Requirements Overview

Because Skylar Automation uses the Docker Swarm tool to maintain its cluster and automatically re-balance services across nodes, ScienceLogic strongly recommends that you implement the following best practices from Docker, Couchbase, and RabbitMQ. The topics in this section describe those best practices, along with requirements and frequently asked questions.

IMPORTANT: To support automatic failover of the Couchbase database without manual intervention, you must set up at least three nodes for automatic failover of a single node, five nodes for automatic failover of two nodes, and so on.

For a clustered Skylar Automation environment, you must install the Skylar Automation RPM on every server that you plan to cluster the Skylar Automation. You can load the Docker images for the services onto each server locally by running /opt/iservices/scripts/pull_start_iservices.sh. Installing the RPM onto each server ensures that the Skylar Automation containers and necessary data are available on all servers in the cluster. For more information, see Installing Skylar Automation via RPM.

You can use a command-line utility called skyautocontrol (skyautoctl) that performs multiple administrator-level actions on either the node or the cluster. You can use this script to automate the configuration of a three-node cluster. For more information, see Automating the Configuration of a Three-Node Cluster.

Docker Swarm Requirements for High Availability

After implementing Docker Swarm High Availability, if a node goes down, all the services on that failed node can be dynamically re-provisioned and orchestrated among the other nodes in the cluster. High Availability for Swarm also facilitates network connections with the various other High Availability components.

Docker Swarm requires the following:

  • The cluster contains at least three nodes running as managers. With three nodes, there can be a quorum vote between managers when a node is failed over.
  • A load balancer with a virtual IP running in front of all nodes in the cluster. The load balancer allows user interface requests to be distributed among each of the hosts in the case one of the hosts fails for ports 443:HTTPS, 3141:Devpi and 5556:Dex.

An example of why a load balancer is needed in front of the virtual IP is the ServiceNow ticketing workflow. If you’re only directing the request to a single node and that node goes down, your ticketing will stop even if the other Skylar Automation nodes are still up and functional. The load balancer will account for the downed node and automatically route to the other nodes in the cluster.

For more information, see the Docker High Availability Documentation.

What happens if I use three nodes and two of the nodes fail?

Docker fault tolerance is limited to one failure in a three-node cluster. If more than one node goes down in a three-node cluster, automatic High Availability and failover cannot be guaranteed, and manual intervention may be required. Adding more nodes is the only way to increase the fault tolerance.

In the event of a two out of three failure, after you perform manual failover actions, the Skylar Automation system will be back up and running.

For more information about the manual failover steps, see the Failover section.

Couchbase Database Requirements for High Availability

Couchbase High Availability ensures that no application, configuration, or step data from the Skylar Automation system will be lost in the event of a node failure.

To support automatic failover, Couchbase requires at least three nodes in the high availability cluster.

Each node will have an independent and persistent storage volume that is replicated throughout the cluster. Alternatively, shared storage can be used instead of independent persistent volumes. This replication ensures that data is replicated in all places, and if a single node goes down, no data will be lost.

For more information, see the Couchbase documentation.

What if I have three nodes and two of them fail?

In the event of a failure of two out of three nodes, no data will be lost, because the data is being replicated.

If multiple Couchbase data nodes go down at the same time, automatic failover might not occur (not even nodes for quorum to failover). You will then need to perform manual failover steps. After you perform these manual actions, the Skylar Automation system will be operational again. For more information about the manual failover steps, see the Failover section.

RabbitMQ Clustering and Persistence for High Availability

Implementing RabbitMQ High Availability ensures that if any integrations or tasks are waiting in the Rabbit queue, those tasks will not be lost if a node containing the Rabbit queue fails.

NOTE: You can switch between both single-node and cluster options at any time during deployment.

RabbitMQ clustering requires a Docker Swarm configuration with multiple nodes. For more information, see Configuring Docker Swarm.

As a best practice for security, enable the user interface only temporarily during cluster configuration.

RabbitMQ Option 1: Persisting Queue to Disk on a Single Node (Default Configuration)

With this configuration, the Skylar Automation queue runs on a single node, and the queue is persisted on disk. As a result, if the Skylar Automation stack is removed and re-deployed, no messages are lost during the downtime. Any messages that exist in the queue before the stack is stopped continue to exist after the stack is re-deployed.

Potential Risks and Mitigations

Because the queue runs on a single node, if that node fails, then the queue and its related data might be lost.

You can mitigate data loss by persisting the queues on your choice of network shared storage, so that if the queue fails on one node, the queue and its messages can be brought back up on another node.

Requirements/Setup (Enabled by Default)

  • You must define a static hostname for the RabbitMQ host in the docker-compose file. The default is rabbit_node1.isnet.
  • You must mount a volume to /var/lib/rabbitmq in the docker-compose file.

Example docker-compose Definition

rabbitmq:
  image: sciencelogic/is-rabbit:3.7.7-1
  hostname: rabbit_node1.isnet
  volumes:
    - "rabbitdb:/var/lib/rabbitmq"
  networks:
    isnet:
      aliases:
        - rabbit
        - rabbit_node1.isnet

RabbitMQ Option 2: Clustering Nodes with Persistent Queues on Each Node

This configuration lets multiple nodes join a RabbitMQ cluster. When you include multiple nodes int he RabbitMQ cluster, all queue data, messages, and other necessary information is automatically replicated and persisted on all nodes in the cluster. If any node fails, then the remaining nodes in the cluster continue maintaining and processing the queue.

Because the RabbitMQ cluster includes disk-persisted queues, if all nodes in the Rabbit cluster fail, or if the service is removed entirely, then no data loss should occur. Upon restart, the nodes will resume with the same cluster configuration and with the previously saved data.

If you include multiple nodes in a RabbitMQ cluster, Skylar Automation automatically applies an HA policy of all-node replication, with retroactive queue synchronization disabled. For more information, refer to the RabbitMQ documentation.

Potential Risks and Mitigations

If you create a Docker Swarm cluster with only two nodes, the cluster might stop functioning if a single node fails. To prevent this situation, include at least three nodes in each cluster.

Requirements/Setup

For a Docker Swarm configuration with multiple independent nodes:

  • Both RabbitMQ services must be "pinned" to each of the two nodes. See the Example Compose Definition below.
  • You must add a new RabbitMQ service to the docker-compose.yml file. This new service should have a hostname and alias following the designated pattern. The designated pattern is: rabbit_nodex.isnet, where x is the node number. This configuration supports up to 20 clustered nodes by default.
  • After you update the docker-compose.yml file, the nodes will auto-cluster when you perform a deployment.

Example Code: docker-compose Definition of Two Clustered Rabbit Services

Checking the Status of a RabbitMQ Cluster

This section contains commands and additional resources for administering your clusters.

To check the status of your clustered RabbitMQ environment:

  1. Run docker ps and locate the iservices_rabbit container.
  2. Run the following command on the RabbitMQ container:

    docker exec -it [container_id] /bin/bash

You can run the following commands for more information:

  • rabbitmqctl cluster_status. Returns information about the current cluster status, including nodes in the cluster, and failed nodes.
  • rabbitmqctl list_policies. Returns information about current policies. Ensure that the ha-all policy is automatically set for your cluster.

For additional cluster-related administrative commands, see the RabbitMQ Cluster Management documentation page.

Preparing the Skylar Automation System for High Availability

You need to prepare your Skylar Automation system in the following ways before configuring the High Availability solution:

  1. Make sure that your Skylar Automation system has been updated with yum upgrade.

  2. Run the following commands to open up the proper firewall ports for Docker Swarm on each swarm node:

    firewall-cmd --add-port=2376/tcp --permanent

    firewall-cmd --add-port=2377/tcp --permanent

    firewall-cmd --add-port=7946/tcp --permanent

    firewall-cmd --add-port=7946/udp --permanent

    firewall-cmd --add-port=4789/udp --permanent

    firewall-cmd --add-protocol=esp --permanent

    If your system is fully yum-updated, you only need to run the following commands:

    firewall-cmd --add-service docker-swarm --permanent

    firewall-cmd --reload

    systemctl restart docker

    To view a list of all ports, run the following command: firewall-cmd --list-all

  1. Make sure that the /etc/iservices/is_pass and /etc/iservices/encryption_key are identical on all clustered nodes.
  2. Make sure that NTP is properly configured on all nodes:
  • Edit the /etc/chrony.conf file to add NTP servers. If you want to use the pool.ntp.org NTP servers, remove the .ol. from the domain names.

  • Enable chronyd by running the following commands:

    systemctl start chronyd

    systemctl enable chronyd

    timedatectl #ensure ntp is enabled is yes and ntp sync is yes

Troubleshooting Ports and Protocols

If you have trouble with cluster configuration, make sure that all of the following ports and protocols are enabled between each of the nodes in the Skylar Automation cluster within your network:

  • 2376/tcp
  • 2377/tcp
  • 7946/tcp
  • 7946/udp
  • 4789/udp
  • protocol=esp (IP Protocol 50)

Firewall rules are automatically configured within the Skylar Automation operating system. Failing to allow required ports between nodes block Skylar Automation clustering and networking.

Additionally, when using a load balancer, ensure that each of the Skylar Automation cluster nodes are able to communicate to the load balancer listening ports (443, 5556, 3141).

Configuring Clustering and High Availability

This section describes how to configure clustering and High Availability with Docker Swarm and the Couchbase database, using three or more nodes.

This topic assumes you are using Skylar Automation ISOs for each node, which includes an initial Docker Swarm node configuration. The use of the Skylar Automation ISO is not required, however. You could instead deploy another node (without using the Skylar Automation ISO) and configure a Linux operating system based on Red Hat. You could then add that system to the swarm.

When configuring a three-node clustered environment, you can set the OPEN_SECONDARY_CB_PORTS configuration variable to "true" to expose Couchbase secondary ports through the main node IP or host name. You can set this configuration variable as a GUI environment variable in the docker-compose.yml file, or you can set it in the isconfig.yml file in the host. If OPEN_SECONDARY_CB_PORTS is set to "true", the GUI service exposes the Couchbase secondary ports in the compose file. The autocluster cluster-action in the skyautocontrol  (skyautoctl) utility was updated to automatically expose Couchbase secondary ports when creating a three-node clustered environment.

For more information about troubleshooting issues with clustering, see Troubleshooting Clustering and Node Failover.

Automating the Configuration of a Three-Node Cluster

You can use the skyautocontrol (skyautoctl) command-line utility to perform multiple administrator-level actions on your Skylar Automation cluster. You can use the autocluster action with the skyautocontrol (skyautoctl) command to automate the configuration of a three-node cluster.

If you are using another cluster configuration, the deployment process should be manual, because the skyautocontrol (skyautoctl) utility only supports the automated configuration of a three-node cluster.

The autocluster action will completely reset and remove all data from the system. When you run this action, you will get a prompt verifying that you want run the action and delete all data.

To automate the configuration of a three-node cluster, run the following command:

skyautoctl --host <sa_host1> <username>:<host_password> --host <sa_host2> <username>:<host_password> --host <sa_host3> <username>:<host_password> autocluster

For example:

skyautoctl --host 192.11.1.1 isadmin:passw0rd --host 192.11.1.2 isadmin:passw0rd --host 192.11.1.3 isadmin:passw0rd autocluster

Running this command will configure your Skylar Automation three-node cluster without any additional manual steps required.

You can use the generate_haproxy_config cluster-action in the skyautocontrol  (skyautoctl) utility to create an HAProxy configuration template that lets you easily set an HAProxy load balancer for a three-node cluster.

For example: 

skyautoctl --host <host_IP_1> user:host_password --host <host_IP_2> user:host_password --host <host_IP_3> user:host_password cluster-action --action generate_haproxy_config

or

skyautoctl --config skyautoctl.yml cluster-action --action generate_haproxy_config

For more information about other actions you can perform with the skyautocontrol (skyautoctl) utility, see Using the skyautocontrol Command-line Utility.

Configuring Docker Swarm

To configure Docker Swarm for clustering (three or more nodes) and High Availability:

Two-Node High Availability is not possible because Docker Swarm requires an odd number of nodes (3+) for quorum and consensus.

  1. If you do not already have Skylar Automation running in your environment, install Skylar Automation on a single node. Doing this creates a single-node Docker Swarm manager.

  2. Ensure that NTP is configured on all swarm nodes. For more information, see Preparing Skylar Automation System for High Availability.

  3. SSH to the Docker Swarm manager (leader) and run the following command to retrieve the join token. Make note of the token, because you will need it to join a node to the swarm in step 4, below:

    docker swarm join-token manager

  1. Run the following commands on each Docker Swarm node that you want to join to the cluster:

    docker swarm init

    docker swarm join --token <join token> <swarm manager ip>:<port>

    where <join token> is the value from step 3. For example:

    docker swarm join --token SWMTKN-1-5e8skxby61cthkfkv6gzhhil89v0og2m7lx014tvvv42n7m0rz-an0fdam5zj0v7d471co57d09h 10.7.3.21:2377

  1. Run the following command to verify that the nodes have been added:

    docker node ls

  1. If you are using local images and not connecting to Docker Hub, load docker images on the other swarm nodes:

    for i in $(ls -1 /opt/iservices/images/); do docker load -i /opt/iservices/images/$i; done

Configuring the Couchbase Database

To add a Couchbase worker node:

  1. In the docker-compose-override.yml file, add the following line to constrain the Couchbase container to a single Docker Swarm node at the bottom of the couchbase section:

    deploy:
    ...
     hostname: couchbase.isnet
     deploy:
       placement:
         constraints:
           - node.hostname == <name of Docker Swarm node>
    
      networks:
        isnet:
          aliases:
            - couchbase
            - couchbase.isnet
    
       environment:
          db_host: couchbase.isnet

  1. Add the couchbase-worker and couchbase-worker2 section. deploy > replicas on the workers should be set to 0:

    couchbase-worker:
     image: repository.auto.sciencelogic.local:5000/is-couchbase:feature-INT-1208-HA-IS-Services
     container_name: couchbase-worker.isnet
     volumes:
       - "/var/data/couchbase:/opt/couchbase/var"
     deploy:
       placement:
         constraints:
           - node.hostname == <name of Docker Swarm node>
     networks: 
       isnet:
         aliases:
           - couchbase-worker
           - couchbase-worker.isnet
     hostname: couchbase-worker.isnet
    
     ports:
       - "8095:8091"
     secrets:
       - is_pass
       - encryption_key
     ulimits:
       nofile: 80000
       core: 100000000
       memlock: 100000000
     environment:
       TYPE: 'WORKER'
       AUTO_REBALANCE: 'true'
       db_host: 'couchbase'

NOTE: This deployment makes the Couchbase worker user interface available on port 8095 of the Docker Swarm stack. If the master node goes down, or if the primary Couchbase user interface is not available on port 8091, you can still access the secondary Couchbase user interface through port 8095.

  1. Add couchbase-worker to the db_host setting for contentapi:

    contentapi:
    ...
      environment:
        ...
        db_host: 'couchbase,couchbase-worker,couchbase-worker2'

  1. All db_host variables in docker-compose should be in the following format:

    db_host: 'couchbase,couchbase-worker,couchbase-worker2'

  1. If you are using the override file, run the /opt/iservices/compose_override.sh script to validate and update the docker-compose.yml file with your changes.

  2. Deploy the stack with only the Couchbase node by editing the replicas on couchbase-worker to 1 and running the following command:

    docker stack deploy -c <location of compose file> iservices

  1. After the two-node Couchbase cluster has been successfully deployed and the secondary indexes are successfully added, edit the replicas on couchbase-worker2 to 1 and run the following command:

    docker stack deploy -c <location of compose file> iservices

  1. Set the replicas in the docker-compose-override.yml file as well.

  2. After the second worker is added, set the number of replicas to "2" on each bucket (content and logs) in the Couchbase Administrator user interface and click Save Changes:

  1. Rebalance the cluster by navigating to the Servers section of the Couchbase Administrator user interface and clicking the Rebalance button:

Code Example: docker-compose-override.yml

Skylar Automation uses a docker-compose-override.yml file to persistently store user-specific configurations for containers, such as proxy settings, replica settings, additional node settings, and deploy constraints. The user-specific changes are kept in this file so that they can be re-applied when the /opt/iservices/scripts/docker-compose.yml file is completely replaced on an RPM upgrade, ensuring that no user-specific configurations are lost. By default only main core services are included in the docker-compose-override.yml file, if extra services need to be added they should be included as needed.

If you are running Skylar Automation in a cluster, these files should always be the same between all manager nodes. With this in place, if any manager node dies, you can re-deploy with the same settings from any other manager node.

The following section includes a complete example of the /opt/iservices/scripts/docker-compose-override.yml file for a three-node Couchbase and RabbitMQ clustered deployment:

NOTE: If shared volumes are available in the cluster, the deploy placement can be omitted and removed.

Scaling iservices_contentapi

To scale out iservices_contentapi to distribute the service across the three nodes, run the following command:

docker service scale iservices_contentapi=3

Single Manager Failure - Automatic Failover

When one node in the cluster fails (Node1, Node2, or Node3), the remaining nodes maintain quorum and fail-over happens automatically.

  1. Log in to Skylar Automation.

  2. Access the Couchbase Administrator user interface using one of the Couchbase Node Ports that is still up:

    • Couchbase: 8091

    • Couchbase Worker 1: 8100 or Couchbase Worker 2: 8100

    https://<IP of Skylar Automation>:Port

  3. Once the downed node comes back, it should rejoin the cluster automatically. If it does not, and needs to be reset, follow the steps below:

    1. Force the node to leave the swarm.

    2. Get the swarm token from the other nodes.

    3. Force the node to join the swarm again. Execute the following command:

    docker swarm leave --force

In a three-node cluster, a single failed node will be automatically removed. You will still need to perform a re-balance.

Manual Failover

If you have a cluster with three or more nodes that is not configured with automatic failover, you must perform the following manual failover steps.

NOTE: If you can access the Couchbase Administrator user interface (Couchbase: 8091, Couchbase Worker 1: 8100, or Couchbase Worker 2: 8100) on the node that is still running, you can simply click the Failover button in the Couchbase Administrator user interface instead of manually running the couchbase-cli commands below.

In a three-node cluster, a single failed node will be automatically removed. You will still need to perform a re-balance.

Initiating Manual Failover

To initiate a manual failover and promote the only Couchbase node that is up:

  1. Log in to the Docker Swarm node where the node that is running resides.

  2. Run the following command on that node to see which node IDs exist:

    docker node ls

If this command failed, and the error message states "The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.", run the following command to restart the cluster creation. This will restart the running services. Go to step 4 if this command was run and wait until the Couchbase container starts.

docker swarm init --force-new-cluster

  1. Remove any failed manager nodes from the cluster by running the following Docker command:

    docker node rm <failed node id>

  2. Run the following command to identify the Container ID of the running Couchbase container:

    docker ps

  3. Connect to the Docker container that is still up:

    docker exec -u root -i -t $(docker ps -q -n 1 --filter name=iservices_couchbase) /bin/bash

  4. Use the instance of Couchbase that is up by running the following commands:

    couchbase-cli server-list -c <operating-couchbase-node> -u isadmin -p <password>

    where <operating-couchbase-node> could be one of the following:

    • couchbase.isnet
    • couchbase-worker.isnet
    • couchbase-worker2.isnet

    and the password is the Skylar Automation Administrator user interface password.

  5. One of the previous commands will show one or two failed nodes. Copy the Couchbase node names for step 8.

  6. Use the currently running node (operating-couchbase-node) and the failed node's names to run the following command to failover:

    couchbase-cli failover -c <operating-couchbase-node>:8091 -u isadmin -p <password> --server-failover <failed-couchbase-node>:8091 --force

    For example, if the operating node is couchbase-worker, and the isnet name:port of the failed service is couchbase.isnet:8091, then the command would be:

    couchbase-cli failover -c couchbase-worker:8091 -u isadmin -p <password> --server-failover couchbase.isnet:8091 --force

    If the command fails and suggests using the --hard argument, then the command would be:

    couchbase-cli failover -c couchbase-worker:8091 -u isadmin -p <password> --server-failover couchbase.isnet:8091 --force --hard

    Execute the command above for both failed nodes. If the command returns an error, run the following commands to sendboth failed nodes as part of the command. This is likely to happen if both unhealthy nodes went down at the same time.

    couchbase-cli failover -c <operating-couchbase-node>:8091 -u isadmin -p <password> --server-failover <failed-couchbase-node>:8091,<second-failed-couchbase-node>:8091 --force

    couchbase-cli failover -c <operating-couchbase-node>:8091 -u isadmin -p <password> --server-failover <failed-couchbase-node>:8091,<second-failed-couchbase-node>:8091 --force --hard

  7. Rebalance the cluster using the functioning container name:

    couchbase-cli rebalance -c <operating-couchbase-node>:8091 --server-remove <failed-couchbase-node>:8091 --server-remove <second- failed-couchbase-node>:8091 -u isadmin -p <password>

  8. Remove and recreate the indexes to avoid issues:

    couchcontrol -c <operating-couchbase-node> index remove-secondary

    couchcontrol -c <operating-couchbase-node> index create-secondary -f /tmp/scripts/couchbase_index.json -b

For example, if the operating node is couchbase-worker, then the command would be:

couchcontrol -c couchbase-worker index remove-secondary

couchcontrol -c couchbase-worker index create-secondary -f /tmp/scripts/couchbase_index.json -b

  1. Scale down the unhealthy and inactive couchbase services that were just removed from the cluster

    docker service scale iservices_<couchbase-failed-node>=0

    docker service scale iservices_<second-couchbase-failed-node>=0

  2. If the contentapi service is in a waiting state, restart the couchbase node that was just promoted to reset the connection and resolve the API waiting. Run the following command:

    docker service update --force iservices_<couchbase-node-name>

  3. Force the Dex server service to restart:

    docker service update --force iservices_dexserver

  4. Log in to the Skylar Automation user interface and validate that your data still exists.

Some documents may be lost. If Couchbase lost its quorum, multiple documents, including the scheduler document, might be lost. The applications that were still queued to run should still be able to run if the applications and their configurations were not affected.

  1. Go to the Couchbase user interface, which should be available at one of the following ports, depending on the active node :8091, :8100, or :8101. If the removed nodes are still there and waiting for a rebalance action, click the Rebalance button.

Recovering a Docker Swarm Node

If a node does not join the swarm automatically, follow the steps below to recover a Docker Swarm node:

  1. Restart the node.

  2. If manual failover actions were taken while this node was offline, run the following command to force the node to leave the swarm now that the node is back online:

    docker swarm leave --force

  3. Follow the steps in Configuring Docker Swarm to add the node back to the existing swarm by obtaining the join-token from the manager.

Restoring a Couchbase Node

You should take the restoration actions in this topic only after a manual or automatic failover has been performed and the node has been completely removed from the cluster(the node should not be visible in the user interface or server-list).

If the logs bucket has more than 10,000 documents, the rebalance actions could take more time when the new nodes rejoin the cluster. To avoid this, if the logs documents are not critical, you can flush the logs bucket in the Couchbase user interface.

To restore the failed Couchbase node:

  1. Log in to Couchbase Administrator user interface using the port of the node that is still up (8091 or 8100)

  1. If the Docker Swarm node was restored and not rebuilt, remove files from the old container:

    rm -rf /var/data/couchbase/*

    docker volume rm iservices_tmp_couchbase # only for MUD environments

    docker service scale iservices_couchbase=1

    A new node is added to the Couchbase cluster. If the rebalance environment variable is set, the balancer process will start automatically. If not, click the Rebalance button in the Couchbase user interface so the reset node can be added to the cluster.

    If the server is not completely removed from the cluster and is just waiting to be added back, you may do so using the Couchbase user interface, or by running healthcheck and autoheal actions with the skyautocontrol (skyautoctl) command-line utility.

  2. After all nodes in a cluster are running, be sure to perform healthcheck and autoheal actions with the skyautocontrol (skyautoctl) command-line utility to re-validate the cluster and re-set configurations such as replication and index counts. For more information, see healthcheck and autoheal.

If the master node goes down, the SyncPacks for the Skylar Automation system might not display. This is because the pypiserver is constrained by default to one master node, so it does not start on workers if that master node goes down. To address this issue after completing the failover steps, above, you can re-import the SyncPacks.

If two Couchbase nodes were reset, documents form the database such as applications, configurations, scheduler and others may be lost. ScienceLogic recommends restoring from a recent backup to have all documents in place.

Restoring RabbitMQ

RabbitMQ nodes automatically join the cluster and sync data, but if big workloads were desynchronized, there could be some issues and the unhealthy RabbitMQ nodes may need to be reset.

The skyautocontrol (skyautoctl) healthcheck executes the actions below, but ScienceLogic also recommends checking manually:

  1. Log in to the RabbitMQ user interface and check that all the nodes are clustered together.
  2. Go to the Queues tab and check that the queues for the application are synchronized. If there is a red “+1“ in the Node list, enter that queue and click the Synchronize button.
  3. If the synchronization above fails, try clearing out the volumes of the nodes that do not want to be syncronized. For more information, see the Troubleshooting SL1 Skylar Automation chapter in the SL1 Skylar Automation Platform manual.
  4. Run skyautocontrol (skyautoctl) healthcheck and autoheal to make sure the system is healthy and has the corresponding configurations. For more information, see healthcheck and autoheal.

Additional Configuration Information

Load Balancer Recommended Settings

Configurations to Improve Load Balancer Compatibility

You can use the following configurations in the /etc/iservices/isconfig.yml file to improve load balancer compatibility if the load balancer sends requests to the client in proxy protocol format like AWS ELB:

  • LOAD_BALANCED: true. Setting this value to true specifies that the load balancer will send requests to the client in proxy protocol format. This value is false by default.

  • RATE_LIMITED. Setting this value to true enables rate limiting. This value is false by default.

  • RATE_LIMIT_REQUESTS_PER_SECOND. This value specifies the number of rate limit requests per second. The default is '50'.

  • RATE_LIMIT_BURST. This value specifies the rate limit burst. The default is '100'.

You will need to re-deploy the Skylar Automation stack for any changes to the docker-compose.yml file to take place.

In addition, the exposed ports in the docker-compose.yml file are set to mode: host to let Skylar Automation capture the proper client IP address of the requests being sent into Skylar Automation. This setting lets Skylar Automation set the proper rate limits and log transactions. This feature does not allow using the Swarm ingress; as a result, you will need to scale the gui container and place the container in the nodes that will be expecting ingress traffic.

Recommended Load Balancer Modes

Use TCP mode or HTTP mode, plus the recommended healthcheck endpoints listed below. ScienceLogic recommends that you use TCP instead of HTTP (which requires specific endpoints).

If you use HTTP mode, make sure that the SSL cipher configurations are in place to work with OpenSSL 1.1.1K FIPS ciphers. If needed, run some openssl commands, such as openssl ciphers ... against the Skylar Automation system.

Recommended HealthCheck Endpoints

When using HTTP mode with the load balancer, configure the following healthcheck endpoints to make sure that the Skylar Automation nodes are responding correctly.

PowerFlow 2.5.0 or later

This version includes healthcheck endpoints for easy verification.

https://pf-node:5556/healthcheck. Should respond with a "200 status, with an "ok" response.

or

https://pf-node:5556/dex/theme/styles.css . Should respond with a "200" status code.

https://pf-node/discovery. Should respond with a "200 status, with an "ok" response.

https://pf-node:15672/healthcheck. Should respond with a "200 status, with an "ok" response.

https://pf-node:8091/healthcheck. Should respond with a "200 status, with an "ok" response.

PowerFlow 2.4.1

https://pf-node:5556/dex/theme/styles.css. Should respond with a "200" status code.

https://pf-node/discovery. Should respond with a "200" status code.

https://pf-node:15672.Should respond with a "302" status code.

https://pf-node:8091. Should respond with a "301" status code

cURL Commands

To verify that Skylar Automation nodes can reach the Load Balancer, you can execute the following cURL commands from the Skylar Automation nodes to the Load Balancer, and from the Load Balancer to the PowerFlow nodes:

curl https://IP:5556/dex/theme/styles.css. Should respond with a "200" status code.

curl https://IP/discovery. Should respond with a "200" status code.

curl https://IP:15672. Should respond with a "302" status code.

curl https://IP:8091 . Should respond with a "301" status code.

Optimization Settings to Improve RabbitMQ Reclustering

To avoid a potential race condition between three RabbitMQ nodes, and to improve how the nodes recluster after the Skylar Automation stack is redeployed, ScienceLogic recommends setting the following configurations:

  • Skylar Automation RabbitMQ node 1 = 20-second grace period. The configuration stop_grace_period: 20s is required in the rabbitmq service definition in docker-compose.yml. For example:

      rabbitmq:
        stop_grace_period: 20s
  • Skylar Automation RabbitMQ node 2 = 10-second grace period. No configuration is required as this is the default value.

  • Skylar Automation RabbitMQ node 3 = 10-second grace period. No configuration is required as this is the default value.

In environments where the latency between nodes is higher, you can use the following settings:

  • Skylar Automation RabbitMQ node 1 = 30-second grace period. The configuration stop_grace_period: 20s is required in the rabbitmq service definition in docker-compose.yml.

  • Skylar Automation RabbitMQ node 2 = 20-second grace period. The configuration stop_grace_period: 20s is required in the rabbitmq service definition in docker-compose.yml. .

  • Skylar Automation RabbitMQ node 3 = 10-second grace period. No configuration is required as this is the default value.

The stop_grace_period setting allows Docker Swarm to stop the container after the configured time, which is why the first node must have a longer grace period. For more information, see https://docs.docker.com/compose/compose-file/05-services/#stop_grace_period.

Optimization Settings to Improve Performance of Large-Scale Clusters

In large-scale clusters, one of the root causes of abnormal memory and CPU usage is from inter-worker communication overhead, or overly "chatty" workers, and their event queues. You can completely disable inter-worker eventing to significantly reduce overhead on the queuing system and prevent the symptoms associated with abnormal memory usage.

Also, to improve the performance of large-scale clusters by default, the following optimization settings were added to the docker-compose.yml file for all workers in version 2.0.1 of the PowerFlow platform:

steprunner-<worker-x>:
  environment:
    additional_worker_args: "--max-tasks-per-child 1 --without-gossip --without-mingle"

In addition to the default optimization settings above, you can further reduce system overhead by setting the --without-heartbeat environment variable in additional_worker_args. Please note that this setting will reduce the memory and CPU utilization of the system, but it will come at the cost of preventing the Flower service from getting an accurate depiction of current worker states.

If you want to disable these new configuration settings, set the environment variable "disable_default_optimizations" to "True" for all workers.

Workers will continue to generate events for consumption from monitoring tools like Flower even with the new default configuration settings. In some extremely large clusters, you might want to completely disable eventing of workers completely, especially if Flower is not in use. To completely disable worker eventing, set the environment variable "disable_events" to "True".

For more information, see https://docs.celeryproject.org/en/latest/reference/celery.bin.worker.html#cmdoption-celery-worker-without-gossip

Additional suggestions for improving performance in large-scale clusters:

  • Assess the impact of using Flower before keeping it enabled for a long period of time. Running Flower can cause increased overhead on the RabbitMQ nodes, but the overhead is not significant initially. However, the overhead generated by Flower will continue to increase as more workers are added to the stack, and those workers send events to Flower.

  • ScienceLogic recommends that you monitor memory and queue utilization before and after running Flower with your current environment size to determine whether the extra overhead provided is worth the task information it provides.

  • If a system event causes workers to restart, it is possible that all workers constantly restarting at the same time, every 0 seconds will generate increased load on the system, making it difficult for other services to start up. To prevent this, it is recommended to add a restart_delay to workers to prevent a "rush" of hundreds of workers trying to re-connect over the network all at once. For example:

    steprunner-<worker-x>:
      deploy
        restart_policy:
          delay: 30s

Exposing Additional Couchbase Cluster Node Management Interfaces overTLS

The is_gui container acts as a reverse proxy to the internal services and their individual management interfaces. This container configured in /etc/nginx/conf.d/default.conf.

To expose the management interfaces of additional Couchbase nodes within a cluster:

  1. Copy the configuration from the gui container:

    docker cp <container id>:/etc/nginx/conf.d/default.conf ./

  1. Edit the configuration to include the desired services:

    server {
        listen       8092 ssl;
        server_name  couchbase-worker;
     
    location = / {
      return 301 https://$host:8092/ui/index.html;
    }
     
     location / {
      resolver 127.0.0.11 valid=5s;
      set $upstream couchbase-worker.isnet;
      proxy_pass http://$upstream:8092$request_uri;
      proxy_pass_header Server;
      proxy_pass_header Cache-Control;
      proxy_pass_header Content-Length;
      proxy_pass_header Connection;
      proxy_pass_header Pragma;
      proxy_pass_header ns-server-ui;
      proxy_pass_header invalid-auth-response;
    }
        ssl_certificate     /etc/iservices/is_cert.pem;
        ssl_certificate_key /etc/iservices/is_key.pem;
        ssl_protocols       TLSv1.2;
        ssl_ciphers         'ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-
    SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-
    GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-
    SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256';
        ssl_prefer_server_ciphers on;
        ssl_session_cache shared:SSL:20m;
        ssl_session_timeout 180m;
        add_header Strict-Transport-Security "max-age=31536000" always;
    }
  1. Create the following Dockerfile:

    FROM sciencelogic/is_gui

    COPY ./default.conf /etc/nginx/conf.d/default.conf

  1. Build the container with the new configurations:

    docker build -t <customer>/is_gui:<Skylar Automation version>-1 -f Dockerfile

  2. Add the image name to the is_gui section in the docker-compose-override.yml file, and do a Docker stack deploy to enable the new is_gui container.

Restricting the Number of Replicas

If you use the max-replica-per-node option in the docker-compose-override file to restrict the number of replicas that will run in each Swarm node, you should also specify more than one replica for the replicas option.

ScienceLogic recommends using three replicas for the gui service: one running on each of the core nodes. The default max-replica-per-node value is 1 for the gui service.

For the gui service, the max_replicas_per_node option is set, but the gui service is not pinned to the core nodes using labels. If there are more than three nodes (core and worker nodes) the gui service replicas will run in any node, so you will need to add a restriction to the docker-compose-override file, such as the following:

placement:
  constraints:
    - node.labels.types == master

If you update the docker-compose-override file, you will need to redeploy the corresponding service. In the example below, you will need to redeploy the gui service after you update the override file:

gui:
  deploy:
    replicas: 3
    ... 
    placement:
      max_replicas_per_node: 1
  environment:
    ... 

You can also configure the max_replicas_per_node option for the contentapi, dexserver and steprunners services, and any other Skylar Automation services that have more than one replica.

The max-replica-per-node option is available with docker-compose 3.8 or later. Add version: '3.8' at the start of the docker-compose file to ensure compatibility.

HAProxy Configuration (Optional)

As a convenience, ScienceLogic provides an example configuration for the HAProxy load balancer below. Please note that it is your responsibility to configure the load balancer. ScienceLogic cannot be held responsible for any deployments that deviate from the example HAProxy load balancer configuration.

The following example configuration describes using HAProxy as a load balancer:

Code Example: HAProxy as Load Balancer

Known Issues

The following section describes the known issues you might encounter with the High Availability solution and how to address those issues.

Docker container on last swarm node cannot communicate with other swarm nodes

This is an issue with the Encapsulating Security Payload (ESP) protocol not being enabled in firewalld. You can enable the ESP protocol with the firewalld docker-swarm script.

To address this issue, add the following firewall rule to each node:

firewall-cmd --add-protocol=esp --permanent

firewall-cmd --reload

Couchbase service does not start, remains at nc -z localhost

To address this issue, stop the container where this is happening and remove its persistent volume:

rm -rf /var/data/couchbase

Couchbase-worker fails to connect to master

A connection failure might happen a few times when a stack is freshly deployed. You can ignore these messages, and the worker should eventually connect to the master.

Couchbase database stops unexpectedly and the disk is full

If you are running large or customized workloads, you might encounter a situation where Couchbase stops unexpectedly because the disk is full. To prevent this situation, review the considerations in What should I do if the Couchbase disk is full, indexer is crashing, and the database is unusable?.

Couchbase rebalance fails with "Rebalance exited" error

In this situation, you received the following error:

Rebalance exited with reason {service_rebalance_failed,index,

{linked_process_died,<12807.821.0>,

{no_connection,"index-service_api"}

}}

If the Couchbase rebalance fails on the initial rejoin of a failed node into a cluster, you should check the index states and wait until the indexes are no longer in a warmup state. After the indexes are created on that node, the rebalance should succeed.

When setting up a three-node High Availability Couchbase cluster, the second node does not appear

In this situation, if you have cloned any of the nodes, the nodes might think that there is a split-brain condition.

To address this issue, delete the Couchbase data on the newly added nodes by running the following command on each node:

rm -rf /var/data/couchbase/*

The Skylar Automation user interface fails to start after a manual failover of the swarm node

To address this issue, run the following commands on the relevant node:

docker stack rm iservices

systemctl restart docker

docker stack deploy -c docker-compose.yml iservices

The Skylar Automation user interface returns 504 errors

Ensure that your Skylar Automation systems have been updated with yum upgrade.

NTP should be used, and all node times should be in sync

If all nodes time are not in sync, you might experience issues with the iservices_steprunners.

The following is an example of a Docker Swarm error caused by the time not being in sync:

Error response from daemon: certificate (1 - 2v4umws4pxag6kbxaelwfl3vf) not valid before Fri, 30 Nov 2018 13:47:00 UTC, and it is currently Fri, 30 Nov 2018 06:41:24 UTC: x509: certificate has expired or is not yet valid

For more information, see Preparing the Skylar Automation System for High Availability.

Example Logs from Flower

iservices_flower.1.jg6glaf298d2@is-scale-05 | [W 181023 20:17:40 state:113] Substantial drift from celery@1ee384863e37 may mean clocks are out of sync. Current drift is iservices_flower.1.jg6glaf298d2@is-scale-05 | 18 seconds. [orig: 2018-10-23 20:17:40.090473 recv: 2018-10-23 20:17:58.486666]