Appendix B: Configuring the SL1 PowerFlow System for Multi-tenant Environments

Download this manual as a PDF file 

This appendix describes the best practices and troubleshooting solutions for deploying PowerFlow in a multi-tenant environment that supports multiple customers in a highly available fashion. This section also covers how to perform an upgrade of PowerFlow with minimal downtime.

Quick Start Checklist for Deployment

  1. Deploy and cluster the initial High Availability stack. Label these nodes as "core".
  2. Create the PowerFlow configuration object for the new PowerFlow systems. The configuration object includes the SL1 IP address, the ServiceNow user and domain, and other related information.
  3. Deploy and cluster the worker node or nodes for the customer.
  4. Label the worker node or nodes specifically for the customer.
  5. Update the docker-compose.yml file on a core node:
  • Add two steprunner services for each customer, one for real-time eventing, and one for backlogged events, labeled based on the organization name: acme and acme-catchups.
  • Update the new steprunner hostnames to indicate who the steprunner works for.
  • Update the new steprunner deploy constraints to deploy only to the designated labels.
  • Update the new steprunner user_queues environment variable to only listen on the desired queues.
  1. Schedule the required PowerFlow integrations:
  • Run Device Sync daily, if desired
  • Correlation queue manager running on the catchup queue
  1. Modify the Run Book Automations in SL1 to trigger the integration to run on the queue for this customer:
  • Modify the IS_PASSTHROUGH dictionary with "queue" setting.
  • Specify the configuration object to use in PowerFlow for this SL1 instance.

Deployment

The following sections describe how to deploy PowerFlow in a multi-tenant environment. After the initial High Availability (HA) core services are deployed, the multi-tenant environment differs in the deployment and placement of workers and use of custom queues.

Core Service Nodes

For a multi-tenant deployment, ScienceLogic recommends that you dedicate at least three nodes to the core PowerFlow services. These core PowerFlow services are shared by all workers and customers. As a result, it is essential that these services are clustered to handle failovers.

Because these core services are critical, ScienceLogic recommends that you initially allocate a fairly large amount of resources to these services. Allocating more resources than necessary to these nodes allows you to further scale workers in the future. If these nodes become overly taxed, you can add another node dedicated to the core services in the cluster.

These core services nodes are dedicated to the following services:

  • API
  • UI
  • RabbitMQ
  • Couchbase
  • Redis

It is critical to monitor these core service nodes, and to always make sure these nodes have enough resources for new customers and workers as they are on-boarded.

To ensure proper failover and persistence of volumes and cluster information, the core services must be pinned to each of the nodes. For more information, see Configuring Core Service Nodes, below.

Requirements

Three nodes (or more for additional failover support) with six CPUs and 56 GB memory each.

Configuring Core Service Nodes

Critical Elements to Monitor on Core Nodes

  • Memory utilization: Warnings at 80%
  • CPU utilization: Warnings at 80%
  • RabbitMQ queue sizes (can also be monitored from the Flower API, or the PowerFlow user interface)

Worker Service Nodes

Separate from the core services are the worker services. These worker services are intended to be deployed on nodes separate from the core services, and other workers, and these worker services aim to provide processing only for specified dedicated queues. Separating the VMs or modes where worker services are deployed will ensure that one customer's workload, no matter how heavy it gets, will not negatively affect the other core services, or other customer workloads.

Requirements

The resources allocated to the worker nodes depends on the worker sizing chosen, the more resources provided to a worker, the faster their throughput. Below is a brief guideline for sizing. Please note that even if you exceed the number of event syncs per minute, events will be queued up, so the sizing does not have to be exact. The below sizing just provides a suggested guideline.

Event Sync Throughput Node Sizing

CPU Memory Worker count Time to sync a queue full of 10,000 events Events Synced per second

2

16 GB

6

90 minutes

1.3

4

32 GB

12

46 minutes

3.6

8

54 GB

25

16.5 minutes

10.1

Test Environment and Scenario

  • Each Event Sync consists of PowerFlow workers reading from the pre-populated queue of 10000 events. The sync interprets, transforms, and then POSTS the new event as a correlated ServiceNow incident into ServiceNow. This process goes on to then query ServiceNow for the new sysID generated for the incident, transforms it, and then POSTs it back to SL1 as an external ticket to complete the process.
  • Tests were performed on a node of workers only.
  • Tests were performed with a 2.6 GHz virtualized CPU in a vCenter VM. Both SL1 and ServiceNow were responding quickly when doing so.
  • Tests were performed with a pre-populated queue of 10000 events.
  • Tests were performed with the current deployed version of Cisco custom integration. Data will again be gathered for the next version when it is completed by Pro Services.
  • Each event on the queue consisted of a single correlated event.

Configuring the Worker Node

  • Install the PowerFlow RPM on the new node.
  • See the High Availability section for information about how to join the cluster as a manager or worker, and copy the /etc/iservices/encryption_key and /etc/iservices/is_pass file from a core service node to the new worker node (same location and permissions).
  • By default, the worker will listen on and accept work from the default queue, which is used primarily by the user interface, and any integration run without a custom queue.
  • To configure this worker to run customer-specific workloads with custom queues, see Onboarding a Customer.
  • Modify the docker-compose.yml on a core service node accordingly.
  • If you just want the node to accept default work, the only change necessary is to increase worker count using the table provided in the requirements section
  • If you want the node to be customer specific, be sure to add the proper labels and setup custom queues for the worker in the docker-compose when deploying. This information is contained in the Onboarding a customer section.

Initial Worker Node Deployment Settings

It is required that there is always at least one worker instance listening on the default queue for proper functionality. The default worker can run in any node.

Worker Failover Considerations and Additional Sizing

When deploying a new worker, especially if it is going to be a custom queue dedicated worker, it is wise to consider deploying an extra worker listening on the same queues. If you have on a single worker node listening to a dedicated customer queue, there is potential for that queue processing to stop completely if that single node worker fails.

For this reason, ScienceLogic recommends that for each customer dedicated worker you deploy, you deploy a second one as well. This way there are two nodes listening to the customer dedicated queue, and if one node fails, the other node will continue processing from the queue with no interruptions.

When deciding on worker sizing, it's important to take this into consideration. For example, if you have a customer that requires a four-CPU node for optimal throughput, an option would be to deploy two nodes with two CPUs, so that there is failover if one node fails.

  • How to know when more resources are necessary
  • Extra worker nodes ready for additional load or failover

Knowing When More Resources are Necessary for a Worker

Monitoring the memory, CPU and pending integrations in queue can give you an indication of whether more resources are needed for the worker. Generally, when queue times start to build up, and tickets are not synced over in an acceptable time frame, more workers for task processing are required.

Although more workers will process more tasks, they will be unable to do so if the memory or CPU required by the additional workers is not present. When adding additional workers, it is important to watch the memory or CPU utilization, so long as the utilization is under 75%, it should be okay to add another worker. If utilization is consistently over 80%, then you should add more resources to the system before addling additional workers.

Keeping a Worker Node on Standby for Excess Load Distribution

Even if you have multiple workers dedicated to a single customer, there are still scenarios in which a particular customer queue spikes in load, and you'd like an immediate increase in throughput to handle this load. In this scenario you don't have the time to deploy a new PowerFlow node and configure it to distribute the load for greater throughput, as you need increased load immediately.

This can be handled by having a node on standby. This node has the same PowerFlow RPM version installed, and sits idle in the stack (or is turned off completely). When a spike happens, and you need more resources to distribute the load, you can then apply the label to the corresponding to the customer who's queues spiked. After setting the label on the standby node, you can scale up the worker count for that particular customer. Now, with the stand-alone node labeled for work for that customer, additional worker instances will be distributed to and started on the standby node.

When the spike has completed, you can return the node to standby by reversing the above process. Decrease the worker count to what it was earlier, and then remove the customer specific label from the node.

Critical Elements to Monitor in a Steprunner

  • Memory utilization: Warnings at 80%
  • CPU utilization: Warnings at 80%
  • Successful, failed, active tasks executed by steprunner (retrievable from Flower API or PowerPack)
  • Pending tasks in queue for the worker (retrievable by Flower API or PowerPack)
  • Integrations in queue (similar information here as in pending tasks in queue, but this is retrievable from the PowerFlow API).

Advanced RabbitMQ Administration and Maintenance

This section describes how multi-tenant deployments can use separate virtual hosts and users for each tenant.

Using an External RabbitMQ Instance

In certain scenarios, you might not want to use the default RabbitMQ queue that is prepackaged with PowerFlow. For example, you might already have a RabbitMQ production cluster available that you just want to connect with PowerFlow. You can do this by defining a new virtual host in RabbitMQ, and then you configure the PowerFlow broker URL for contentapi, steprunner, scheduler services so that they point to the new virtual host.

Any use of an external RabbitMQ server will not be officially supported by ScienceLogic if there are issues in the external RabbitMQ instance.

Setting a User other than Guest for Queue Connections

When communicating with RabbitMQ in the swarm cluster, all communication is encrypted and secured within the overlay Docker network.

To add another user, or to change the user that PowerFlow uses when communicating with the queues:

  1. Create a new user in RabbitMQ that has full permissions to a virtual host. For more information, see the RabbitMQ documentation.
  2. Update the broker_url environment variable with the new credentials in the docker-compose file and then re-deploy.

Configuring the Broker (Queue) URL

When using an external RabbitMQ system, you need to update the broker_url environment variable in the contentapi, steprunner, and scheduler services. You can do this by modifying the environment section of the services in docker-compose and changing broker_url. The following line is an example:

broker_url: 'pyamqp://username:password@rabbitmq-hostname/v-host'

Creating Specific Queues for Customers

When a new SL1 system is to be onboarded into PowerFlow, by default their integrations are executed on the default queue. In large multi-tenant environments, ScienceLogic recommends separate queues for each customer. If desired, each customer can also have specific queues.

For more information about queues, see PowerFlow Queue FAQs.

Create the Configuration Object

The first step to setting up a new PowerFlow system is to create a configuration object with variables that will satisfy all PowerFlow applications. The values of these should be specific to the new system (such as SL1 IP address, username, password).

See the example configuration for a template you can fill out for new system.

Because integrations might update their variable names from EM7 to SL1 in the future, ScienceLogic recommends to cover variables for both em7_ and sl1_. The example configuration contains this information.

Label the Worker Node Specific to the Customer

For an example label, if you want a worker node to be dedicated to a customer called "acme", you could create a node label called "customer" and make the value of the label "acme". Setting this label now makes it easier to cluster in additional workers and distribute load dynamically in the future.

Creating a Node Label

This topic outlines creating a label for a node. Labels provide the ability to deploy a service to specific nodes (determined by labels) and to categorize the nodes for the work they will be performing. Take the following actions to set a node label:

# get the list of nodes available in this cluster (must run from a manager node)

docker node ls

# example of adding a label to a docker swarm node

docker node update --label-add customer=acme <node id>

Placing a Service on a Labeled Node

After you create a node label, refer to the example below for updating your docker-compose-override.yml file and ensuring the desired services deploy to the matching labeled nodes:

# example of placing workers on a specific labeled node

steprunner-acme:

...

deploy:

placement:

constraints:

- node.labels.customer == acme

resources:

limits:

memory: 1.5G

replicas: 15

...

Creating a Queue Dedicated to a Specific Application or Customer

You can create a new queue that is specific to a PowerFlow application or to a customer to ensure that work and events created from one system will not affect or slow down work created from another system, provided the multi-tenant system has enough resources allocated.

In the example below, we created two new queues in addition to the default queue, and allocated workers to it. Both of these worker services use separate queues as described below, but run on the same labeled worker node.

New Queues to Deploy:

  • acmequeue. The queue we use to sync events specific from a customer called "acme". Only events syncs and other integrations for "acme" will run on this queue.
  • acmequeue-catchup. The queue where any old events that should have synced over already (but failed due to PowerFlow not being available or other reason) will run. Running these catchup integrations on a separate queue ensures that real-time event syncing isn't delayed in favor of an older event catching up.

Add Workers for the New Queues

First, define additional workers in our stack that are responsible for handling the new queues. All modifications are made in docker-compose-override.yml:

  1. Copy an existing steprunner service definition.
  2. Change the steprunner service name to something unique for the stack. For this example, use the following names:
  • steprunner-acmequeue
  • steprunner-acmequeue-catchup
  1. Modify the replicas value to specify how many workers should be listening to this queue:
  • steprunner-acmequeue will get 15 workers because it is expecting a very heavy load
  • steprunner-acmequeue-catchup will get three workers because it will not run very often
  1. Add a new environment variable labeled user_queues. This environment variable tells the worker what queues to listen to:
  • steprunner-acmequeue will set user_queues= "acmequeue"
  • steprunner-acmequeue-catchup will set user_queues="acmequeue-catchup"
  1. To ensure that these workers can be easily identified for the queue to which they are assigned, modify the hostname setting:
  • Hostname: "acmequeue-{{.Task.ID}}"
  • Hostname "acmequeue-catchup-{{.Task.ID}}"
  1. After the changes have been made, run /opt/iservices/script/compose-override.sh to validate that the syntax is correct.
  2. When you are ready to deploy, re-run the docker stack deploy with the new compose file.

Code Example: docker-compose entries for new steprunners

After these changes have been made, your docker-compose entries for the new steprunners should look similar to the following (the values relevant to this procedures display in bold, below).

Once deployed via Docker Stack deploy, you should see the new workers in Flower, as in the following image:

You can verify the queues being listened to by looking at the "broker" section of Flower, or by clicking into a worker and clicking the Queues tab:

Adding a PowerFlow Application to a Specific Queue

To add a PowerFlow application to a specific queue:

  1. Use Postman or cURL to do a GET to load the list of PowerFlow applications:

    GET <PowerFlow_hostname>/api/v1/applications

    where <PowerFlow_hostname> is the IP address or URL for your PowerFlow system.

  2. Locate the "id" value for the PowerFlow application name you want to use, and include that value to load the specific application by name:

    GET <PowerFlow_hostname>/api/v1/applications/<application_name>

    For example:

    GET 10.1.1.22/api/v1/applications/interface_sync_sciencelogic_to_servicenow

  3. Copy the entire JSON code and save it to a file with the same name as the application from step 2.

  4. Edit the JSON code for the application by adding the following line to the initial code block, after "id" or "progress":

    "queue": "<queue_name>"

    For example:

    "queue": "acmequeue"

  5. Upload the updated application using the iscli tool:

    -uaf <application-name> -H PowerFlow_hostname> -p <password>

Create Application Schedules and Automation Settings to Utilize Separate Queues

After the workers have been configured with specific queue assignments, schedule your PowerFlow applications to run on those queues, and configure Run Book Automations (RBAs) to place the applications on those queues.

Scheduling an Application with a Specific Queue and Configuration

To run an application on a specific queue using a configuration for a particular system, you can use the Custom Parameters override available in the scheduler. Below is an example of the scheduled application that utilizes the acmecqueue-catchup queue:

In the example above, the cisco_correlation_queue_manager is scheduled to run every 300 seconds, using the acme configuration, and will run on the acmequeue. You can have any number of scheduled runs per application. If we were to add additional customers, we would add a new schedule entry with differing configurations, and queues for each.

Configuring Applications to Utilize a Specific Queue and Configuration

The last step to ensuring integrations for your newly onboarded SL1 system is to update the Run Book Automations in SL1 to provide the configuration and queue to use when the Run Book Automation triggers an event.

Modify the Event-correlation policy with the following changes:

  1. IS4_PASSTHROUGH= {"queue":"acmequeue"}
  2. CONFIG_OVERRIDE= 'acme-scale-config'

PowerFlow Queue FAQs

This section contains a list of frequently asked questions and answers about queues in PowerFlow.

What is RabbitMQ, and what messages are placed in it?

RabbitMQ is the queueing service used for most PowerFlow deployments. When a PowerFlow application is run, each step of the application (and a few other internal steps) gets placed in the queue for processing.

Some customers might have multiple queues corresponding to specific workloads.

What does it mean when the queue reports a high message count?

When a queue reports a high message count, it means that there are many tasks in the queue waiting to be processed. When this occurs, additional tasks placed in the queue might be delayed in their execution until the previous tasks have completed. A high message count is not always an actionable concern.

When should I be concerned about a high message count?

A high message count can simply mean that there is a large workflow being processed on a periodic schedule for a short period of time, like syncing many devices. If the queue goes up and then back down periodically, especially at the same time every day, it could be normal behavior depending on the customer's schedules, and as a result should not be a cause for concern.

A high message count could be an issue if:

  • The queue has been increasing steadily over a duration of time and never reducing.
  • There is no pattern of high queue count at this time of day for that system.

How can I tell what is currently in queue to be processed?

In the PowerFlow user interface, go to the Control Tower page and scroll down to the All Applications circle graph. If you click the Pending (yellow) or Started (blue) elements on the circle graph, a list of applications in that state displays below the graphs:

Alternatively, you can check the steprunner logs to see exactly what they are processing, or you can access Flower (the tool used for monitoring PowerFlow tasks and workers) at https://<IP of PowerFlow>/flower/dashboard.When viewing tasks on the Tasks tab, look in the kwargs column for the sn (step name) and an (app name) values to see what the task is for.

How can I tell what caused the queue backlog?

A queue backlog might be caused by two possible scenarios:

  1. Over-scheduling applications in PowerFlow (most likely). In this situation, PowerFlow applications have been scheduled more frequently than they actually take to complete a run. In other words, if an application is scheduled to run every minute, but the app actually takes five minutes to run, there will inevitably be a backlog.

    To check if this is the case, review the current schedules for your PowerFlow applications. If there are any schedules that are set to run frequently (multiple times an hour), check how long they are taking to run by adding up the steps' run time (visible in step logs),

  2. Event Flood triggered from SL1. In this situation, an event flood of more events than typical coming from SL1 causes of a high message count.

    To check if this is the case, log in to SL1 and check to make sure that the run book automation policies are reasonable, and ensure that any run book automations for ServiceNow are not configured to run on notice events, for example.

    The following is an example of a n SL1 database query that can be used to show run book automation triggers over time:

    select date_format(notify_stamp, '%Y-%m-%d %H') as date, count(*) from master_events.events_cleared

    where notify_count>0 and notify_stamp > now() - interval 34 hour

    group by date order by date desc;

What do I do if the high message count was caused by over-scheduling?

If the cause of the queue backlog is due to over-scheduling, then you must assess which PowerFlow applications are taking too long, and then correct the schedule for those applications. Additionally, you should try to understand why these applications were scheduled so frequently in the first place:

  • Did you simply make a mistake and over-schedule unintentionally?
  • Is there a reason that your apps are now taking longer to run than before? A typical cause of this is a large amount of orphaned open incidents in ServiceNow; reducing the number of these events will reduce app run time. Another potential cause is if you recently onboarded many additional devices, and your schedules need to be adjusted for that.
  • If you are using custom applications that need to be run often, you should investigate why these applications need to run so frequently, and you should assess whether your PowerFlow instance needs to be upsized.

What do I do if the high message count was caused by an SL1 event flood?

If the cause of the queue backlog was due to an SL1 event flood, the best thing to do is to determine what events caused it, and whether policies need to be changed:

  • Assess the run book automation policies that triggered the event flood to see if there is a recommended update, such as not triggering an event for notice level events
  • If you feel that your automation policies need to be extremely granular and should not change, you might need to assess whether your PowerFlow instance needs to be upsized.

After the event flood has been reconciled, you can clear the queue to return to fast processing.

How can I clear messages from the queue?

You can clear messages from the queue in two ways (eventually this will be possible from the Control Tower page):

  1. From the RabbitMQ user interface:

    • Click the Queues tab and select the queue you would like to purge.
    • Scroll down to the purge drop down.
    • Select and execute the purge messages button.
  2. From the PowerFlow node, run the following command:

    docker exec $(docker ps -q --filter name=iservices_steprunner*|head -n 1) celery --app=ipaascommon.celeryapp:app purge -f -Q celery

    Change -Q celery to -Q <other-queue-name> to purge a different queue than the default queue.

Why are PowerFlow applications still showing as "Pending" after I cleared the queue?

PowerFlow applications that are initially placed in the queue, but have not yet run will be placed in the "Pending" status. Should those tasks be forcefully purged from the queue, the application will never have a chance to change to another state.

After clearing a queue, it is expected that previously queued applications (which are now cleared) will display as "Pending", and they will never run.

Rather than looking at the previous runs, you should validate processing behavior by triggering a new run of any application. If the queue backlog is cleared, you should see the application go from "Pending" to "Started" almost instantaneously.

Why are messages stuck in the broadcast queue in RabbitMQ?

The syncpack_steprunners use the broadcast queue only to install SyncPacks across all nodes. Every syncpack_steprunner (which runs on every node) creates its own broadcast queue.

If a syncpack_steprunner is restarted, it creates a new broadcast queue, leaving the old one from the previous replica around. When those broadcast queues from old syncpack_steprunner containers are left around, and a user clicks "activate/install syncpack", that message gets placed on all broadcast queues, including the ones left around from old workers.

You can use the pfctl remove_rabbit_non_consumer_queues to clean up these unwanted queues, and that action is also called by the autoheal action.

Failure Scenarios

This topic cover how PowerFlow handles situations where certain services fail.

Worker Containers

In case of failure, when can the worker containers be expected to restart?

  • The worker containers have a strict memory limit of 2 GB. These containers may be restarted if the service requests more memory than the 2 GB limit.
  • The restart is done by a SIGKILL of the OOM_Killer on the Linux system.

What happens when a worker container fails?

  • Worker containers are ephemeral, and simply execute the tasks allotted to them. There is no impact to a worker instance restarting.

What processing is affected when service is down?

What data can be lost?

  • Workers contain no persistent data, so there is no data to lose, other than the data from the current task that is being executed on that worker when its shut down (if there is one)
  • Any PowerFlow application that fails can be replayed (and re-executed by the workers) on demand with the /api/v1/tasks/<task-id>/replay endpoint.

API

When can the API be expected to restart?

  • The API also has a default memory limit. As with the steprunners (worker containers), if the memory limit is reached, the API is restarted by a SIGKILL of the OOM_Killer on the Linux system to prevent a memory leak.

What happens when it fails?

  • On the clustered system, there are always three contentapi services, so if one of the API containers fails, API requests will still be routed to the functioning containers through the internal load balancer.

What processing is affected when service is down?

  • If none of the API services are up and running, any Run Book Automation calls to sync an incident through PowerFlow results in an error. The next time that scheduled integration runs, the integration recognizes the events that failed to send to PowerFlow, and the integration will process them so that the events sync.
  • Even if the API is down, the events that were generated while it was down will be synced by the scheduled application. PowerFlow will reach out to SL1 for those items that SL1 failed to post to the PowerFlow.

What data can be lost?

  • The API contains no persistent data, so there is no risk of data loss.

Couchbase

If a core service node running Couchbase fails, the database should continue to work normally and continue processing events, as long as a suitable number of clustered nodes are still up and running. Three core service nodes provides automatic failover handling of one node, five core service nodes provides automatic failover handling of two nodes, and so on. See the High Availability section for more information.

If there are enough clustered core nodes still running, the failover will occur with no interruptions, and the failing node can be added back at any time with no interruptions.

NOTE: For optimal performance and data distribution after rejoining a cluster, you can click the Re-balance button from the Couchbase user interface, if needed.

If there are not enough clustered core nodes still running, then you will manually have to fail over the Couchbase Server. In this scenario, since automatic failover could not be performed (due to too few nodes available), there will be disruption in event processing. For more information, see the Manual Failover section.

In case of failure, when can Couchbase be expected to restart?

  • In ideal conditions, the Couchbase database should not restart, although Couchbase might be restarted when the node it is running on is over-provisioned. For more information, see the known issue.

What happens when it fails?

  • Each Couchbase node in the cluster contains a fully replicated set of data. If any single node fails, automatic failover will occur after the designated time (120 seconds by default). Automatic failover, processing, and queries to the database will continue without issue.
  • If the database simply is restarted and not down for a long period of time (120 seconds), then the system will not automatically fail over, and the cluster will still be maintained.
  • If two out of three of the database nodes fail for a period of time, processing may be paused until a user takes manual failover action. These manual actions are documented in the Manual Failover section.

What processing is affected when service is down?

  • In the event of an automatic failover (1/3 node failure), no processing will be affected and queries to the database will still be functional.
  • In the event of a large failure (2/3 node failure) automatic failover will not happen, and manual intervention may be needed to so you can query the database again.

What data can be lost?

  • Every Couchbase node has full data replication between each of the three nodes. In the event of a failure of any of the nodes, no data is lost, as a replicated copy exists across the cluster.

RabbitMQ

RabbitMQ clustered among all core service nodes provides full mirroring to each node. So long as there is at least one node available running RabbitMQ, the queues should exist and be reachable. This means that a multiple node failure will have no effect on the RabbitMQ services, and it should continue to operate normally.

In case of failure, when can RabbitMQ be expected to restart?

  • Similar to the Couchbase database, in a smooth-running system, RabbitMQ should never really restart.
  • As with other containers, RabbitMQ might be restarted when the node it is running on is over-provisioned. For more information, see the known issue.

What happens when RabbitMQ fails?

  • All RabbitMQ nodes in the cluster mirror the other queues and completely replicate the data between them. The data is also persisted.
  • In the event of any RabbitMQ node failure, the other nodes in the cluster will take over responsibility for processing its queues.
  • If all RabbitMQ nodes are restarted, their messages are persisted to disk, so any tasks or messages sitting in queue at the time of the failure are not lost, and are reloaded once the system comes back up.
  • In the event of a network partition ("split-brain" state) RabbitMQ will follow the configured partition handling strategy (default autoheal).
  • For more information, see https://www.rabbitmq.com/partitions.html#automatic-handling.

What processing is affected when service is down?

  • When this service is down completely (no nodes running), the API may fail to place event sync tasks onto the queue. As such, any Run Book Automation calls to sync an incident through PowerFlow will result in an error.
  • These failed event syncs are then placed in a database table in SL1 which PowerFlow queries on a schedule every few minutes. The next time that scheduled integration runs, the integration recognizes the events that failed to send to PowerFlow, and the integration will process them so that the events sync.

What data can be lost?

  • All data is replicated between nodes, so there is little risk of data loss.
  • The only time there may be loss of tasks in queues is if there is a network partition, also called a "split-brain" state.

PowerFlow User Interface

In case of failure, when can the user interface be expected to restart?

  • The PowerFlow user interface (GUI) should never be seen as restarted unless a user forcefully restarted the interface.
  • The PowerFlow user interface might be restarted when the node it is running on is over-provisioned. For more information, see the known issue.

What happens when it fails?

  • The GUI service provides the proxy routing throughout the stack, so if the GUI service is not available, Run Book Automation POSTS to PowerFlow will fail. However, as with an API failure, if the Run Book Actions can not POST to PowerFlow, those events will be placed in a database table in SL1 that PowerFlow queries on a schedule every few minutes. The next time that scheduled integration runs, the integration recognizes the events that failed to send to PowerFlow, and the integration will process them so that the events sync.
  • When the GUI service is down and SL1 cannot POST to it, the syncing of the events might be slightly delayed, as the events will be pulled in and created with the next run of the scheduled integration.

What data can be lost?

  • The PowerFlow user interface persists no data, so there is no risk of any data loss.

Redis

If the Redis service fails, it will automatically be restarted, and will be available again in a few minutes. The impact of this happening, is that task processing in PowerFlow is delayed slightly, as the worker services pause themselves and wait for the Redis service to become available again.

Consistent Redis failures

Consistent failures and restarts in Redis typically indicate your system has too little memory, or the Redis service memory limit is set too low, or not low at all. PowerFlow ships with a default memory limit of 8 GB to ensure that the Redis service only ever uses 8 GB of memory, and it ejects entries if it is going to go over that limit. This limit is typically sufficient, though if you have enough workers running large enough integrations to overfill the memory, you may need to increase the limit.

Before increasing Redis memory limit, be sure that there is suitable memory available to the system.

Known Issue for Groups of Containers

If you see multiple containers restarting at the same time on the same node, it indicates an over-provisioning of resources on that node. This only occurs on Swarm manager nodes, as the nodes are not only responsible for the services they are running, but also for maintaining the Swarm cluster and communicating with other manager nodes.

If resources become over-provisioned on one of those manager nodes (as they were with the core nodes when we saw the failure), the Swarm manager will not be able to perform its duties and may cause a docker restart on that particular node. This failure is indicated by "context deadline exceeded", and "heartbeat failures" in the logs from running journalctl -–no-page |grep docker |grep err .

This is one of the reasons why docker recommends running “manager-only” nodes, in which the manager nodes are only responsible for maintaining the Swarm, and not responsible for running other services. If any nodes that are running PowerFlow services are also a Swarm manager, make sure that the nodes are not over-provisioned, otherwise the containers on that node may restart. For this reason, ScienceLogic recommends monitoring and placing thresholds at 80% utilization.

To combat the risk of over-provisioning affecting the docker Swarm manager, apply resource constraints on the services for the nodes that are also Swarm managers, so that docker operations always have some extra memory or CPU on the host to do what they need to do. Alternatively, you can only use drained nodes, which are not running any services, as Swarm managers, and not apply any extra constraints.

For more information about Swarm management, see https://docs.docker.com/engine/Swarm/admin_guide/.

Examples and Reference

Code Example: A Configuration Object

Code Example: A Schedule Configuration Object

Test Cases

Load Throughput Test Cases

Event throughput testing with PowerFlow only:

The following test cases can be attempted with any number of dedicated customer queues. The expectation is that each customer queue will be filled with 10,000 events, and then you can time how long it takes to process through all 10,000 events in each queue.

  1. Disable any steprunners.
  2. Trigger 10,000 events through SL1, and let them build up in the PowerFlow queue.
  3. After all 10,000 events are waiting in queue, enable the steprunners to begin processing.
  4. Time the throughput of all event processing to get an estimate of how many events per second and per minute that PowerFlow will handle.
  5. The results from the ScienceLogic test system are listed in the sizing section of worker nodes.

Event throughput testing with SL1 triggering PowerFlow:

This test is executed in the same manner as the event throughput test described above, but in this scenario you never disable the steprunners, and you let the events process through PowerFlow as they are alerted to by SL1.

  1. Steprunners are running.
  2. Trigger 10,000 events through SL1, and let the steprunners handle the events as they come in.
  3. Time the throughput of all event processing to get an estimate of how many events per second and per minute that PowerFlow will handle.

The difference between the timing of this test and the previous test can show how much of a delay the SL1 is taking to alert PowerFlow about an event, and subsequently sync it.

Failure Test Cases

  1. Validate that bringing one of the core nodes down does not impact the overall functionality of the PowerFlow system. Also, validate that bringing the core node back up rejoins the cluster and the system continues to be operational.
  2. Validate that bringing down a dedicated worker node only affects that specific workers processing. Also validate that adding a new "standby" node is able to pick up the worker where the previous failed worker left off.
  3. Validate that when the Redis service fails on any node, it is redistributed and is functional on another node.
  4. Validate that when a PowerFlow application fails, you can view the failure in the PowerFlow Timeline.
  5. Validate that you can query for and filter only for failing tasks for a specific customer.

Separated queue test scenarios

  1. Validate that scheduling two runs of the same application with differing configurations and queues works as expected:
  • Each scheduled run should be placed on the designated queue and configuration for that schedule.
  • The runs, their queues, and configurations should be viewable from the PowerFlow Timeline, or can be queried from the log's endpoint.
  1. Validate that each SL1 triggering event is correctly sending the appropriate queue and configuration that the event sync should be run on:
  • This data should be viewable from the PowerFlow Timeline.
  • The queue and configuration should be correctly recognized by PowerFlow and executed by the corresponding worker.
  1. Validate the behavior of a node left "on standby" waiting for a label to start picking up work. As soon as a label is assigned and workers are scaled, that node should begin processing the designated work.

Backup Considerations

This section covers the items you should back up in your PowerFlow system, and how to restore backups. For more information, see Backing up Data.

What to Back Up

When taking backups of the PowerFlow environment, collect the following information from the host level of your primary manager node (this is the node from which you control the stack):

Files in /opt/iservices/scripts:

  • /opt/iservices/scripts/docker-compose.yml
  • /opt/iservices/scripts/docker-compose-override.yml

All files in /etc/iservices/:

  • /etc/iservices/is_pass
  • /etc/iservices/encryption_key

In addition to the above files, make sure you are storing Couchbase dumps somewhere by using the cbbackup command, or the "PowerFlow Backup" application.

Fall Back and Restore to a Disaster Recovery (Passive) System

You should do a data-only restore if:

  • The system you are restoring to is a different configuration or cluster setup than the system where you made the backup.
  • The backup system has all the indexes added and already up to date.

You should do a full restore if:

  • The deployment where the backup was made is identical to the deployment where it is being restored (same amount of nodes).
  • There are no indexes defined on the system you're backing up.

Once failed over, be sure to disable the "PowerFlow Backup" application from being scheduled.

Resiliency Considerations

The RabbitMQ Split-brain Handling Strategy (SL1 Default Set to Autoheal)

If multiple RabbitMQ cluster nodes are lost at once, the cluster might enter a "Network Partition" or "Split-brain" state. In this state, the queues will become paused if there is no auto-handling policy applied. The cluster will remain paused until a user takes manual action. To ensure that the cluster knows how to handle this scenario as the user would want, and not pause waiting for manual intervention, it is essential to set a partition handling policy.

For more information on RabbitMQ Network partition (split-brain) state, how it can occur, and what happens, see: http://www.rabbitmq.com/partitions.html.

By default, ScienceLogic sets the partition policy to autoheal in favor of continued service if any nodes go down. However, depending on the environment, you might wish to change this setting.

For more information about the automatic split-brain handling strategies that RabbitMQ provides, see: http://www.rabbitmq.com/partitions.html#automatic-handling.

autoheal is the default setting set by SL1, and as such, queues should always be available, though if multiple nodes fail, some messages may be lost.

If you are using pause_minority mode and a "split-brain" scenario occurs for RabbitMQ in a single cluster, when the split-brain situation is resolved, new messages that are queued will be mirrored (replicated between all nodes once again).

ScienceLogic Policy Recommendation

ScienceLogic's recommendations for applying changes to the default policy include the following:

  • If you care more about continuity of service in a data center outage, with queues always available, even if the system doesn't retain some messages from a failed data center, use autoheal. This is the SL1 default setting.
  • If you care more about retaining message data in a data center outage, with queues that might not be available until the nodes are back, but will recover themselves once nodes are back online to ensure that no messages are lost, use pause_minority.
  • If you prefer not to have RabbitMQ handle this scenario automatically, and you want to manually recover your queues and data, where queues will be paused and unusable until manual intervention is made to determine where to fallback, use ignore.

Changing the RabbitMQ Default Split-brain Handling Policy

The best way to change the SL1 default split-brain strategy is to make a copy of the RabbitMQ config file from a running rabbit system, add your change, and then mount that config back into the appropriate place to apply your overrides.

  1. Copy the config file from a currently running container:

    docker cp <container-id>:/etc/rabbitmq/rabbitmq.conf /destination/on/host

  2. Modify the config file:

    change cluster_partition_handling value

  3. Update your docker-compose.yml file to mount that file over the config for all rabbitmq nodes:

    mount "<path/to/config>/rabbitmq.conf:/etc/rabbitmq/rabbitmq.conf"

Using Drained Managers to Maintain Swarm Health

To maintain Swarm health, ScienceLogic recommends that you deploy some swarm managers that do not take any of the workload of the application. The only purpose for these managers is to maintain the health of the swarm. Separating these workloads ensures that a spike in application activity will not affect the swarm clustering management services.

ScienceLogic recommends that these systems have 2 CPU and 4 GB of memory.

To deploy a drained manager node:

  1. Add your new manager node into the swarm.

  2. Drain it with the following command:

    docker node update --availability drain <node>

Draining the node ensures that no containers will be deployed to it. For more information, see https://docs.docker.com/engine/swarm/admin_guide/.

Updating the PowerFlow Cluster with Little to No Downtime

There are two potential update workflows for updating the PowerFlow cluster. The first workflow involves using a Docker registry that is connectable to swarm nodes on the network. The second workflow requires manually copying the PowerFlow RPM or containers to each individual node.

Updating Offline (No Connection to a Docker Registry)

  1. Copy the PowerFlow RPM over to all swarm nodes.
  2. Only install the RPM on all nodes. Do not stack deploy. This RPM installation automatically extracts the latest PowerFlow containers, making them available to each node in the cluster.
  3. From the primary manager node, make sure your docker-compose file has been updated, and is now using the appropriate version tag: either latest for the latest version on the system, or 2.x.x.
  4. If all swarm nodes have the RPM installed, the container images should be runnable and the stack should update itself. If the RPM was missed installing on any of the nodes, it may not have the required images, and as a result, services might not deploy to that node.

Updating Online (All Nodes Have a Connection to a Docker Registry)

  1. Install the PowerFlow RPM only onto the master node.
  2. Make sure the RPM doesn't contain any host-level changes, such as Docker daemon configuration updates. If there are host level updates, you might want to make that update on other nodes in the cluster
  3. Populate your Docker registry with the latest PowerFlow images.
  4. From the primary manager node, make sure your docker-compose file has been updated, and is now using the appropriate version tag: either latest for the latest version on the system, or 2.x.x.
  5. Docker stack deploy the services. Because all nodes have access to the same Docker registry, which has the designated images, all nodes will download the images automatically and update with the latest versions as defined by the docker-compose file.

Additional Sizing Considerations

This section covers the sizing considerations for the Couchbase, RabbitMQ, Redis, contentapi, and GUI services.

Sizing for Couchbase Services

The initial sizing provided for Couchbase nodes in the multi-tenant cluster for 6 CPUs and 56 GB memory should be more than enough to handle multiple customer event syncing workloads.

ScienceLogic recommends monitoring the CPU percentage and Memory Utilization percentage of the Couchbase nodes to understand when a good time to increase resources is, such as when Memory and CPU are consistently above 80%.

Sizing for RabbitMQ Services

The only special considerations for RabbitMQ sizing is how many events you will plan for in the queue at once.

Every 10,000 events populated in the PowerFlow queue will consume approximately 1.5 GB of memory.

This memory usage is drained as soon as the events leave the queue.

Sizing for Redis Services

The initial sizing deployment for redis should be sufficient for multiple customer event syncing.

The only time memory might need to be increased to Redis is if you are attempting to view logs from a previous run, and the logs are not available. A lack of run logs from a recently run integration indicates that the Redis cache does not have enough room to store all the step and log data from recently executed runs.

Sizing for contentapi Services

The contentapi services sizing should remain limited at 2 GB memory, as is set by default.

If you notice timeouts, or 500s when there is a large load going through the PowerFlow system, you may want to increase the number of contentapi replicas.

For more information, see placement considerations, and ensure the API replicas are deployed in the same location as the redis instance.

Sizing for the GUI Service

The GUI service should not need to be scaled up at all, as it merely acts as an ingress proxy to the rest of the PowerFlow services.

Sizing for Workers: Scheduler, Steprunner, Flower

Refer to the worker sizing charts provided by ScienceLogic for the recommended steprunner sizes.

Flower and Scheduler do not need to be scaled up at all.

Scaling the PowerFlow Devpi Server

For large environments, you can replicate the PowerFlow Devpi Server, which is the internal Python package repository. Creating Devpi Server replicas prevents multiple syncpacks_steprunners from attempting to access a single Devpi Server at the same time, which might cause failures when creating or recreating SyncPack virtual environments.

The Devpi Server is deployed as the pypiserver service on a PowerFlow stack.

When to Add a New Devpi Server Replica to the PowerFlow Stack

ScienceLogic recommends that you add replicas if you have more than 75 syncpack_steprunners, or add retries to the SyncPack installation process.

Number of syncpack_steprunners Devpi Server Replicas
75 or more 0
100 or more 1
150 or more 2

Adding a New Devpi Server Replica to the Stack

If you want to add a Devpi Server replica to the PowerFlow stack, you will need to add a new service to the docker-compose-override file, using the same configuration as the code block, below.

You can add any number of replicas to the stack, but each replica must have its own unique alias and volume, as the Devpi Servermaster volume cannot be used by a replica.

Code Example: docker-compose-override file

Considerations

  • To initialize Devpi Server replicas, the master Devpi Server service should be running and healthy. Replicas have the same information as the master, because the replicas are constantly syncing with their master.
  • To allow the Devpi Server and its replicas to receive more than 200 concurrent requests, you can increase the number of threads by setting the devpi_threads environment variable in the Devpi Server and its replicas.
  • When a Devpi Server replica is running, the replica makes a request to the Devpi Server service every 30 seconds to sync SyncPacks and their dependencies, which means that the Devpi Server can be busier than its replicas receiving requests from the steprunners.

Configuring Steprunners to Consume Data from Devpi Server Replicas

To allow steprunner and syncpacks_steprunner services to use Devpi Server replicas:

  1. Set the devpi_trusted_host environment variable for syncpacks_steprunner and steprunner services with a string that contains the aliases of the Devpi Server and its replicas separated by a comma.

    Other custom configurations related to the devpi_trusted_host include the following:

    • devpi_trusted_host. The default value is pypiserver.isnet.
    • devpi_random_order. The default value is false. This configuration let you mix the order of the devpi_trusted_host list.
    • devpi_random_host_number. The default value is the devpi_trusted_host length. This configuration defines how many hosts will be chosen randomly from the devpi_trusted_host list.

    The following example uses two Devpi Server replicas:

    syncpacks_steprunner:
      environment:
        devpi_trusted_host: pypiserver.isnet,pypiserver_replica.isnet,pypiserver_
    replica2.isnet
        devpi_random_order: true
        devpi_random_host_number: 2
    
  2. For setting a Devpi Server replica as a main resource for a syncpack_steprunner, define the following environment variable:

    syncpacks_steprunner:
      environment:
        devpi_host: pypiserver_replica.isnet

    You only need to do this step if you want to completely restrict a steprunner from calling the master Devpi Server service.

  3. To assign custom Devpi Server replicas to steprunners in different nodes, you can use a pip.conf file. The following example shows how to mount the custom pip.conf file as a volume.

    syncpacks_steprunner:
      environment:
        PIP_CONFIG_FILE: /usr/tmp/pip.conf
    .. ....      
    volumes:
      - /tmp/pip.conf:/usr/tmp/pip.conf
    

    Because volumes are owned by every node, this file can contain a different configuration based on the node where the syncpack_steprunners are running. This is not recommended, as managing different versions of pip.conf in different nodes can be difficult.

Additional Considerations

In environments where more than 75 syncpack_steprunners are running, ScienceLogic recommend the following configurations:

  • The number of Devpi Server threads devpi_threads should be increased from the default value of 200. Start with 500 and increase it to 1000 if needed:

    pypiserver_replica:
      container_name: devpi_replica
        deploy:
          replicas: 1
          placement:
            constraints:
            - node.hostname == pf-node2 # name of the node where this replica is running
      environment:
        devpi_role: 'replica'
        devpi_threads: 1000
    

    When the number of Devpi Server threads is increased, that service’s memory consumption is also slightly increased.

  • When PowerFlow is running offline, more calls can occur to Devpi Server and its replicas, so take that situation into account that when setting replicas and its threads.

  • Retries for pip should be set by setting by the environment variable PIP_RETRIES to 3. This configuration should be set on the syncpack_steprunners.

  • Retries for the SyncPack installation application is configured as an environment variable for the using sp_installation_retries, which has default value of 3.

    syncpacks_steprunner:
       environment:
         devpi_trusted_host: pypiserver.isnet,pypiserver_replica.isnet,pypiserver_replica2.isnet
         PIP_RETRIES: 3 # default value is 0
         sp_installation_retries: 5
         PIP_TIMEOUT: 10 # default value is 5 
    

Node Placement Considerations

Preventing a Known Issue: Place contentapi and Redis services in the Same Physical Location

An issue exists where if there latency exists between the contentapi and redis, the Applications page may not load. This issue is caused by the API making too many calls before returning. The added latency for each individual call can cause the overall endpoint to take longer to load than the designated timeout window of thirty seconds.

The only impact of this issue is the Applications page won't load. There is no operational impact on the integrations as a whole, even if workers are in separate geos than redis.

There is also no risk to High Availability (HA) by placing the API andRredis services on the same geo. If for whatever reason that geo drops out, the containers will be restarted automatically in the other location.

Common Problems, Symptoms, and Solutions

Tool Issue Symptoms Cause Solution
Docker Visualizer Docker Visualizer shows some services as "undefined".

When viewing the Docker Visualizer user interface, some services are displayed as "undefined", and states aren't accurate.

Impact:

Cannot use Visualizer to get the current state of the stack.

Failing docker stack deployment: https://github.com/
dockersamples/docker-swarm-visualizer/issues/110

Ensure your stack is healthy, and services are deployed correctly. If no services are failing and things are still showing as undefined, elect a new swarm leader.

To prevent:

Ensure your configuration is valid before deploying.

RabbitMQ RabbitMQ queues encountered a node failure and are in a "Network partition state" (split-brain scenario).

The workers are able to connect to the queue, and there are messages on the queue, but the messages are not being distributed to the workers.

Log in to the RabbitMQ admin user interface, which displays a message similar to "RabbitMQ experienced a network partition and the cluster is paused".

Impact:

The RabbitMQ cluster is paused and waiting for user intervention to clean the split-brain state.

Multi-node failure occurred, and rabbit wasn't able to determine who the new master should be. This also will only occur if there is NO partition handling policy in place (see the resiliency section for more information)

Note: ScienceLogic sets the autoheal policy by default

Handle the split-brain partition state and resynchronize your RabbitMQ queues.

Note: This is enabled by default.

To prevent:

Set a partition handling policy.

See the Resiliency section for more information.

RabbitMQ, continued

Execing into the RabbitMQ container and running rabbitmqcli cluster-status shows nodes in a partition state like the following:

[{nodes,

[{disc,

['rabbit@rabbit_node1.isnet','rabbit@rabbit_node2.isnet', 'rabbit@rabbit_node3.isnet','rabbit@rabbit_node4.isnet', 'rabbit@rabbit_node5.isnet','rabbit@rabbit_node6.isnet']}]},

{running_nodes,['rabbit@rabbit_node4.isnet']},

{cluster_name,<<"rabbit@rabbit_node1">>},

{partitions,

[{'rabbit@rabbit_node4.isnet', ['rabbit@rabbit_node1.isnet','rabbit@rabbit_node2.isnet', 'rabbit@rabbit_node3.isnet','rabbit@rabbit_node5.isnet', 'rabbit@rabbit_node6.isnet']}]},

{alarms,[{'rabbit@rabbit_node4.isnet',[]}]}]

PowerFlow steprunners and RabbitMQ Workers constantly restarting, no real error message.

Workers of a particular queue are not stable and constantly restart.

Impact:

One queue's workers will not be processing.

Multi-node failure in RabbitMQ, when it loses majority and can not failover.

Queues go out of sync because of broken swarm.

Recreate queues for the particular worker.

Resynchronize queues.

To prevent:

Deploy enough nodes to ensure quorum for failover.

Couchbase Couchbase node is unable to restart due to indexer error.

This issue can be monitored in the Couchbase logs:

Service 'indexer' exited with status 134. Restarting. Messages:

sync.runtime_Semacquire(0xc4236dd33c)

Impact:

One couchbase node becomes corrupt.

Memory is removed from the database while it is in operation (memory must be dedicated to the VM running Couchbase).

The Couchbase node encounters a failure, which causes the corruption.

Ensure that the memory allocated to your database nodes is dedicated and not shared among other VMs.

To prevent:

Ensure that the memory allocated to your database nodes is dedicated and not shared among other VMs.

Couchbase Couchbase is unable to rebalance.

Couchbase nodes will not rebalance, usually with an error saying "exited by janitor".

Impact:

Couchbase nodes cannot rebalance and provide even replication.

Network issues: missing firewall rules or blocked ports.

The Docker swarm network is stale because of a stack failure.

Validate that all firewall rules are in place, and that no external firewalls are blocking ports.

Reset the Docker swarm network status by electing a new swarm leader.

To prevent:

Validate the firewall rules before deployment.

Use drained managers to maintain swarm

PowerFlow steprunners to Couchbase Steprunners unable to communicate to Couchbase

Steprunners unable to communicate to Couchbase database, with errors like "client side timeout", or "connection reset by peer".

Impact:

Steprunners cannot access the database.

Missing Environment variables in compose:

Check the db_host setting for the steprunner and make sure they specify all Couchbase hosts available .

Validate couchbase settings, ensure that the proper aliases, hostname, and environment variables are set.

Stale docker network.

Validate the deployment configuration and network settings of your docker-compose. Redeploy with valid settings.

In the event of a swarm failure, or stale swarm network, reset the Docker swarm network status by electing a new swarm leader.

To prevent:

Validate hostnames, aliases, and environment settings before deployment.

Use drained managers to maintain swarm

Flower Worker display in flower is not organized and hard to read, and it shows many old workers in an offline state.

Flower shows all containers that previously existed, even if they failed, cluttering the dashboard.

Impact:

Flower dashboard is not organized and hard to read.

Flower running for a long time while workers are restarted or coming up/coming down, maintaining the history of all the old workers.

Another possibility is a known issue in task processing due to the --max-tasks-per-child setting. At high CPU workloads, the max-tasks-per-child setting causes workers to exit prematurely.

Restart the flower service by running the following command:

docker service update --force iservices_flower

You can also remove the --max-tasks-per-child setting in the steprunners.

All containers on a particular node All containers on a particular node do not deploy.

Services are not deploying to a particular node, but instead they are getting moved to other nodes.

Impact:

The node is not running anything.

One of the following situations could cause this issue:

Invalid label deployment configuration.

The node does not have the containers you are telling it to deploy.

The node is missing a required directory to mount into the container.

Make sure the node that you are deploying to is labeled correctly, and that the services you expect to be deployed there are properly constrained to that system.

Go through the troubleshooting steps of "When a docker service doesn't deploy" to check that the service is not missing a requirement on the host.

Check the node status for errors:

docker node ls

To prevent:

Validate your configuration before deploying.

All containers on a particular node All containers on a particular node periodically restart at the same time.

All containers on a particular node restart at the same time.

The system logs indicate an error like:

“error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

Impact:

All containers restart on a node.

This issue only occurs in single-node deployments when the only manager allocates too many resources to its containers, and the containers all restart since the swarm drops.

The manager node gets overloaded by container workloads and is not able to handle swarm management, and the swarm loses quorum.

Use some drained manager nodes for swarm management to separate the workloads.

To prevent:

Use drained managers to maintain swarm.

General Docker service Docker service does not deploy. Replicas remain at 0/3. Docker service does not deploy. There are a variety of reasons for this issue, and you can reveal most causes by checking the service logs to address the issue. Identify the cause of the service not deploying.
PowerFlow user interface The Timeline or the Applications page do not appear in the user interface. The Timeline is not showing accurate information, or the Applications page is not rendering.

One of the following situations could cause these issues:

Indexes do not exist on a particular Couchbase node.

Latency between the API and the redis service is too great for the API to collect all the data it needs before the 30-second timeout is reached.

The indexer can't keep up to a large number of requests, and Couchbase requires additional resources to service the requests.

Solutions:

Verify that indexes exist.

Place the API and redis containers in the same geography so there is little latency. This issue will be fixed in a future IS release

Increase the amount of memory allocated to the Couchbase indexer service.

Common Resolution Explanations

This section contains a set of solutions and explanations for a variety of issues.

Elect a New Swarm Leader

Sometimes when managers lose connection to each other, either through latency or a workload spike, there are instances when the swarm needs to be reset or refreshed. By electing a new leader, you can effectively force the swarm to redo service discovery and refresh the metadata for the swarm. This procedure is highly preferred over removing and re-deploying the whole stack.

To elect a new swarm leader:

  1. Make sure there at least three swarm managers in your stack.

  2. To identify which node is the current leader, run the following command:

    docker node ls

  3. Demote the current leader with the following command:

    docker node demote <node>

  4. Wait until a new node is elected leader:

    docker node ls

  5. After a new node is elected leader, promote the old node back to swarm leader:

    docker node promote <node>

Recreate RabbitMQ Queues and Exchanges

If you do not want to retain any messages in the queue, the following procedure is the best method for recreating the queues. If you do have data that you want to retain, you can resynchronize RabbitMQ queues.

To recreate RabbitMQ queues:

  1. Identify the queue or queues you need to delete:
  • If default workers are restarting, you need to delete queues celery and priority.high.
  • If a custom worker cannot connect to the queue, simply delete that worker's queue.
  1. Delete the queue and exchange through the RabbitMQ admin console:
  • Log in to the RabbitMQ admin console and go to the Queues tab.
  • Find the queue you want to delete and click it for more details.
  • Scroll down and click the Delete Queue button.
  • Go to the Exchanges tab and delete the exchange with the same name as the queue you just deleted.
  1. Delete the queue and exchange through the command line interface:
  • exec into a rabbitmq container

  • Delete the queue needed:

    rabbitmqadmin delete queue name=name_of_queue

  • Delete the exchange needed:

    rabbitmqadmin delete exchange name=name_of_queue

After you delete the queues, the queues will be recreated the next time a worker connects.

Resynchronize RabbitMQ Queues

If your RabbitMQ cluster ends up in a "split-brain" or partitioned state, you might need to manually decide which node should become the master. For more information, see http://www.rabbitmq.com/partitions.html#recovering.

To resynchronize RabbitMQ queues:

  1. Identify which node you want to be the master. In most cases, the master is the node with the most messages in its queue.

  2. After you have identified which node should be master, scale down all other RabbitMQ services:

    docker service scale iservices_rabbitmqx=x0

  3. After all other RabbitMQ services except the master have been scaled down, wait a few seconds, and then scale the other RabbitMQ services back to 1. Bringing all nodes but your new master down and back up again forces all nodes to sync to the state of the master that you chose.

Identify the Cause of a Service not Deploying

Step 1: Obtain the ID of the failed container for the service

Run the following command for the service that failed previously:

docker service ps --no-trunc <servicename>

For example:

docker service ps --no-trunc iservices_redis

From the command result above, we see that one container with the ID 3s7s86n45skf failed previously running on node is-scale-03 (non-zero exit) and another container was restarted in its place.

At this point, you can ask the following questions:

  • Is the error when using docker service ps --no-trunc something obvious? Does the error say that it cannot mount a volume, or that the image was not found? If so, that is most likely the root cause of the issue and needs to be addressed.
  • Did the node on which that container was running go down? Or is that node still up?
  • Are the other services running on that node running fine, and was only this service affected? If other services are running fine on that same node, it is probably a problem with the service itself. If all services on that node are not functional, it could mean a node failure.

At this point, the cause of the issue is not a deploy configuration issue, and it is not an entire node failure. The problem exists within the service itself. Continue to Step 2 if this is the case.

Step 2: Check for any interesting error messages or logs indicating an error

Using the ID obtained in Step 1, collect the logs from the failed container with the following command:

docker service logs <failed-id>

For example:

docker service logs 3s7s86n45skf

Review the service logs for any explicit errors or warning messages that might indicate why the failure occurred.

Repair Couchbase Indexes

Index stuck in “created” (not ready) state

This situation usually occurs when a node starts creating an index, but another index creation was performed at the same time by another node. After the index is created, you can run a simple query to build the index which will change it from created to “ready”:

BUILD index on 'content'('idx_content_content_type_config_a3f867db_7430_4c4b_b1b6_138f06109edb') using GSI

Deleting an index

If you encounter duplicate indexes, such as a situation where indexes were manually created more than once, you can delete an index:

DROP index content.idx_content_content_type_config_d8a45ead_4bbb_4952_b0b0_2fe227702260

Recreating all indexes on a particular node

To recreate all indexes on a particular Couchbase node, exec into the couchbase container and run the following command:

Initialize_couchbase -s

Running this command recreates all indexes, even if the indexes already exist.

Add a Broken Couchbase Node Back into the Cluster

To remove a Couchbase node and re-add it to the cluster:

  1. Stop the node in Docker.

  2. In the Couchbase user interface, you should see the node go down, failover manually, or wait the appropriate time until it automatically fails over.

  3. Clean the Couchbase data directory on the necessary host by running the following command:

    rm -rf /var/data/couchbase/*

  4. Restart the Couchbase node and watch it get added back into the cluster.

  5. Click the Rebalance button to replicate data evenly across nodes.

Restore Couchbase Manually

If you created the backup with the "PowerFlow Backup" application in PowerFlow, you will need to decompress the backup file. The Couchbase backup is in the couchbase folder, and you will need to use the backup in that folder to restore the backup.

Backup

  1. Run the following command on each manager node:

    docker ps

  2. Find the container with the Couchbase name and make a note of that container’s ID.
  3. Run the following command on that manager node, inserting the container ID for that node:

    docker exec -it <container_id> /bin/bash

  4. Execute into the Couchbase container by running the following command:

    cbbackup http://couchbase.isnet:8091 /opt/couchbase/var/backup -u <user> -p <password> -x data_only=1

  5. Exit the Couchbase shell and then copy the backup file in /var/data/couchbase/backup to a safe location, such as /home/isadmin.
  6. Repeat these steps on each PowerFlow node.

Delete Couchbase

rm -f /var/data/couchbase/*

Restore

  1. Copy the backup file into /var/data/couchbase/backup.

  2. Execute into the Couchbase container.

  3. Run the following command to restore the content:

    cbrestore /opt/couchbase/var/backup http://couchbase.isnet:8091 -b content -u <user> -p <password>

  4. Run the following command to restore the logs:

    cbrestore /opt/couchbase/var/backup http://couchbase.isnet:8091 -b logs -u <user> -p <password>

PowerFlow Multi-tenant Upgrade Process

This section describes how to upgrade PowerFlow in a multi-tenant environment with as little downtime as possible.

Performing Environment Checks Before Upgrading

Validate Cluster states

  • Validate that all Couchbase nodes in the cluster are replicated and fully balanced.
  • Validate that the RabbitMQ nodes are all clustered and queues have ha-v1-all policy applied.
  • Validate that the RabbitMQ nodes do not have a large number of messages backed up in queue.

Validate Backups exist

  • Ensure that you have a backup of the database before upgrading.
  • Ensure that you have a copy of your most recently deployed docker-compose file. If all user-specific changes are only populated in docker-compose-override, this is not necessary, but you might want a backup copy.
  • Make sure that each node in Couchbase is fully replicated, and no re-balancing is necessary.

Clean out old container images if desired

Before upgrading to the latest version of PowerFlow, check the local file system and see if there are any older versions taking up space that you might want to remove. These containers exist both locally on the file system and the internal Docker registry. To view any old container versions, check the /opt/iservices/images directory. ScienceLogic recommends that you keep at a minimum the last version of containers, so you can downgrade if necessary.

Cleaning out images is not mandatory, but it is just a means of clearing out additional space on the system if necessary.

To remove old images:

  1. Delete any unwanted versions in /opt/iservices/images.
  2. Identify any unwanted images known to Docker with docker images.
  3. Remove the images with the ID docker rmi <id>.

Installing the PowerFlow RPM

The first step of upgrading is to install the new RPM on all nodes in the cluster. Doing so will ensure that the new containers are populated onto the system (if using that particular RPM), and any other host settings are changed. RPM installation does not pause any services or affect the Docker system in any way, other than using some resources.

PowerFlow has two RPMs, one with containers and one without. If you have populated an internal Docker registry with Docker containers, you can install the RPM without containers built in. If no internal Docker repository is present, you must install the RPM which has the containers built in it. Other than the containers, there is no difference between the RPMs.

For advanced users, installing the RPM can be skipped. However this means that the user is completely responsible for maintaining the docker-compose and host level configurations.

To install the RPM:

  1. SSH into each node.

  2. If you are installing the RPM that contains the container images built in, you may want to upgrade each core node one by one, so that the load of extracting the images doesn't affect all core nodes at once

  3. Run the following command:

    sudo rpm -Uvh <full_path_of_rpm>

    where full_path_of_rpm is the name and path of the RPM file, such as /home/isadmin/sl1-powerflow-2.x.x-1.x86_64.

Compare docker-compose file changes and resolve differences

After the RPM is installed, you will notice a new docker-compose.yml file is placed in /opt/iservices/scripts/. As long as your environment-specific changes exist solely in the compose-override file, all user changes and new version updates will be resolved into that new docker-compose.yml file.

ScienceLogic recommends that you check the differences between the two docker-compose files. You should validate that:

  1. All environment-specific and custom user settings that existed in the old docker-compose also exist in the new docker-compose file.
  2. The image tags reference the correct version in the new docker-compose. If you are using an internal Docker registry, be sure these image tags represent the images from your internal registry.
  3. Make sure that any new environment variables added to services are applied to replicated services. To ensure these updates persist through the next upgrade, also make the changes in docker-compose-override.yml. In other words, if you added a new environment variable for Couchbase, make sure to apply that variable to couchbase-worker1 and couchbase-worker2 as well. If you added a new environment variable for the default steprunner, make sure to set the same environment variable on each custom worker as well.
  4. If you are using the latest tag for images, and you are using a remote repository for downloading, be sure that the latest tag refers to the images in your repository.
  5. The old docker-compose is completely unchanged, and it matches the current deployed environment. This enables PowerFlow to update services independently without restarting other services.
  6. After you resolve any differences between the compose files has been resolved, proceed with the upgrade using the old docker-compose.yml (the one that matches the currently deployed environment).

Make containers available to systems

After you apply the host-level updates, you should make sure that the containers are available to the system.

If you upgraded using the RPM with container images included, the containers should already be on all of the nodes, you can run Docker images to validate the new containers are present. If this is the case you may skip to the next section.

If the upgrade was performed using the RPM which did not contain the container images, ScienceLogic recommends that you run the following command to make sure all nodes have the latest images:

docker-compose -f <new_Docker_compose_file> pull

This command validates that the containers specified by your compose file can be pulled and reached from the nodes. While not required, you might to make sure that the images can be pulled before starting the upgrade. If the images are not pulled manually, they will automatically be pulled by Docker when the new image is called for by the stack.

Perform the Upgrade

To perform the upgrade on a clustered system with little downtime, PowerFlow re-deploys services to the stack in groups. To do this, PowerFlow gradually makes the updates to groups of services and re-runs docker stack deploy for each change. To ensure that no unintended services are updated, start off using the same docker-compose file that was previously used to deploy. Reusing the same docker-compose file and updating only sections at a time ensures that only the intended services to be updated are affected at any given time.

Avoid putting all the changes in a single docker-compose file, and do a new docker stack deploy with all changes at once. If downtime is not a concern, you can update all services, but updating services gradually allows you to have little or no downtime.

Before upgrading any group of services, be sure that the docker-compose file you are deploying from is exactly identical to the currently deployed stack (the previous version). Start with the same docker-compose file and update it for each group of services as needed,

Upgrade Redis, Scheduler, and Flower

The first group to update includes Redis, Scheduler and Flower. If desired, this group can be upgraded along with any other group.

To update:

  1. Copy the service entries for Redis, Scheduler and Flower from the new compose file into the old docker-compose file (the file that matches the currently deployed environment). Copying these entries makes it so that the only changes in the docker-compose file (compared to the deployed stack) are changes for Redis, Scheduler and Flower.

  2. Run the following command:

    docker stack deploy -c /opt/iservices/scripts/docker-compose.yml iservices

  3. Monitor the update, and wait until all services are up and running before proceeding.

Code Example: Image definition of this upgrade group

Redis Version

As the Redis version might not change with every release of PowerFlow, there might not be any changes needed in the upgrade for Redis. This can be expected and is not an issue.

You can configure Redis to let the contentapi container iterate through multiple potential Redis result stores to find the correct result id for a task. To enable this option in the docker-compose.yml file, set the result_backend environment variable of the contentapi container to a comma-delimited list of URLs for Redis instances, such as redis://redis:6378/0,redis:///redis2:6380/0. To deploy multiple Redis instances, make sure that the stack deploys the instances with different aliases, ports, and hostnames. Also, multiple backends are only supported on contentapi, not the steprunners. Steprunners can only write to a single backend.

Upgrade Core Services (RabbitMQ and Couchbase)

The next group of services to update together are the RabbitMQ/Couchbase database services, as well as the GUI. Because the core services are individually defined and "pinned" to specific nodes, upgrade these two services at the same time, on a node-by-node basis. In between each node upgrade, wait and validate that the node rejoins the Couchbase and Rabbit clusters and re-balances appropriately.

Because there will always be two out of three nodes running these core services, this group should not cause any downtime for the system.

Rabbit/Couchbase Versions

The Couchbase and RabbitMQ versions used might not change with every release of PowerFlow. If there is no update or change to be made to the services, you can ignore this section for RabbitMQ or Couchbase upgrades, or both. Assess the differences between the old and new docker-compose files to check if there is an image or environment change necessary for the new version. If not, you can move on to the next section.

Update Actions (assuming three core nodes)

To update first node services:

  1. Update just core node01 by copying service entries for couchbase, rabbitmq1 from the new compose file (compared and resolved as part of above prepare steps) into the old docker-compose file. At this point, the compose file you use to deploy should also contain the updates for the previous groups
  2. Before deploying, access the Couchbase user interface, select the first server node, and click "failover". Select "graceful failover". Manually failing over before updating ensures that the system is still operational when the container comes down.
  3. For the failover command that can be run through the command-line interface if the user interface is not available, see the Manual Failover section.
  4. Run the following command:

docker stack deploy -c <compose_file>

  1. Monitor the process to make sure the service updates and restarts with the new version. To make sure that as little time as possible is used when updating the database, the database containers should already be available on the core nodes.
  2. After the node is back up, go back to the Couchbase UI and add the node back, and rebalance the cluster to make it whole again.
  3. For more information on how to re-add the node and rebalance the cluster if the user interface is not available, see the Manual Failover section.

First node Couchbase update considerations

  • When updating the first couchbase node, be sure to set the environment variable JOIN_ON: "couchbase-worker2", so that the couchbase master knows to rejoin the workers after restarting.
  • Keep in mind by default, only the primary Couchbase node user interface is exposed. Because of this, when the first Couchbase node is restarted, the Couchbase admin user interface will be inaccessible. If you would like to have the Couchbase user interface available during the upgrade of this node, ensure that at least one other Couchbase-worker services port is exposed.

Code Example: docker-compose with images and JOIN_ON for updating the first node

Update second and third node services

To update the second and third node services, repeat the steps from the first node on each node until all nodes are re-clustered and available. Be sure to check the service port mappings to ensure that there are no conflicts (as described above), and remove any HTTP ports if you choose.

Update the GUI

Because the GUI service provides all ingress proxy routing to the services, there might be a very small window where PowerFlow might not receive API requests as the GUI (proxy) is not running. This downtime is limited to the time it takes for the GUI container to restart.

To update the user interface:

  1. Make sure that any conflicting port mappings are handled and addressed.
  2. Replace the docker-compose GUI service definition with the new one.
  3. Re-deploy the docker-compose file, and validate that the new GUI container is up and running.
  4. Make sure that the HTTPS ports are accessible for Couchbase/RabbitMG.

Update Workers and contentapi

You should update the workers and contentapi last. Because these services use multiple replicas (multiple steprunner or containerapi containers running per service), you can rely on Docker to incrementally update each replica of the service individually. By default, when a service is updated, it will update one container of the service at a time, and only after the previous container is up and stable will the next container be deployed.

You can utilize additional Docker options in docker-compose to set the behavior of how many containers to update at once, when to bring down the old container, and what happens if a container upgrade fails. See the update_config and rollback_config options available in Docker documentation: https://docs.docker.com/compose/compose-file/.

Upgrade testing was performed by ScienceLogic using default options. An example where these settings are helpful is to change the parallelism of update_config so that all worker containers of a service update at the same time.

The update scenario described below takes extra precautions and only updates one node of workers per customer at a time. If you decide, you can also safely update all workers at once.

To update the workers and contentapi:

  1. Modify the docker-compose file, the contentapi, and "worker_node1" services of all customers to use the new service definition.
  2. Run a docker stack deploy of the new compose file. Monitor the update, which should update the API container one instance at a time, always leaving a container available to service requests. The process updates the workers of node1 one container instance at a time by default.
  3. After workers are back up and the API is fully updated, modify the docker-compose file and update the second node's worker's service definitions.
  4. Monitor the upgrade, and validate as needed.

Code Example: docker-compose definition with one of two worker nodes and contentapi updated: