Troubleshooting High Availability and Disaster Recovery Configuration

Download this manual as a PDF file

This section describes how to troubleshoot various scenarios of High Availability or Disaster Recovery configurations.

This section assumes that you are comfortable using a UNIX shell session and can use the basic functions within the vi editor.

Use the following menu options to navigate the SL1 user interface:

  • To view a pop-out list of menu options, click the menu icon ().
  • To view a page containing all of the menu options, click the Advanced menu icon ().

Overview

Distributed Replicated Block Device (DRBD) is a distributed replicated storage system for Linux. It is generally used in high availability computer clusters as a kernel driver and to manage applications.

Although DRBD plays a major role in SL1 application clustering, the cluster manager that is responsible for loading, connecting, promoting, and demoting DRBD cannot automatically resolve any issues that might occur at the DRBD application level.

There are few scenarios in which DRBD might fail to connect:

  • Split-brain has occurred.
  • Unrelated data aborting occurred and the nodes do not recognize each other due to data divergence.
  • Firewall issues occurred.
  • Proxy service issues occurred (WFReportParams).
  • An unstable or bad network was provided.
  • Inconsistent kernel packages are created.

What is a Split-brain Configuration?

In SL1, DRBD resides on the Primary node regardless of whether the architecture is HA, DR, or HA and DR. The Primary node is replicated to the secondary node. You might have two or three nodes in the SL1 cluster, but only one can be the primary DRBD at a time. A split-brain configuration occurs when two nodes of a cluster both think they are elected to be the primary and start competing for shared resources. When a split-brain configuration occurs, both nodes will disconnect.

Resolving a Split-brain Configuration on a Disaster Recovery Stack with Two Appliances

To resolve a split-brain configuration between two nodes on a Disaster Recovery stack:

  1. Check the status of both appliances. To do this:
  • Log in to the console of the current Primary appliance as the em7admin user and enter the following at the shell prompt:

drbdadm status

  • If your output looks like this, you have a split-brain configuration:

r0 role:Primary

disk:UpToDate

There is no peer section listed in the status output. This means the node is in Stand Alone mode and not attempting to reconnect.

  • On the Disaster Recovery node, enter the following code:

drbdadm status

  • You could have several outputs returned.

    • If your output looks like this, you have a split-brain configuration that indicates the system is split and believes it is in Primary and Stand Alone mode:

      r0 role:Primary

      disk:UpToDate

    • If your output looks like this, you have a split-brain configuration that indicates the system is split and believes it is in Primary and attempting to connect to the peer:

      r0 role:Primary

      disk:UpToDate

      peer connection:Connecting

    • If your output looks like this, you have a split-brain configuration that indicates the system is split and believes it is Secondary and is in Stand Alone mode:

      r0 role:Secondary

      disk:UpToDate

    • If your output looks like this, you have a split-brain configuration that indicates the system is split and believes it is Secondary and is attempting to connect to the peer:

      r0 role:Secondary

      disk:UpToDate

      peer connection:Connecting

  • In a safe failover/failback scenario, your output would include:

    • Primary:

      r0 role:Primary

      disk:UpToDate

      peer role:Secondary

      replication:Established peer-disk:UpToDate

    • Secondary:

      r0 role:Secondary

      disk:UpToDate

      peer role:Primary

      replication:Established peer-disk:UpToDate

    • When using Oracle Linux 8 the output appears:

      • Primary:

        r0 role:Primary

        disk:UpToDate

        db2 role:Secondary

        peer-disk:UpToDate

      • Secondary:

        r0 role:Secondary

        disk:UpToDate

        db1 role:Primary

        peer-disk:UpToDate

 

  1. Resolve split-brain on the passive node.
  • If the passive node is attempting to connect to its peer, enter the following command:

sudo drbdadm disconnect r0

  • If the passive node has marked itself the Primary, enter the following command:

sudo crm_mon-1

  • If you see the line Resource Group: g_em7, enter the following command to demote the system:

crm_resource --resource ms_drbd_r0--set-parameter target-role --meta --parameter-value Slave

SL1 recommends periodically checking for the output of drbdadm status as the system transitions to secondary status while you wait for your system to demote itself.

  • If your system is already Secondary, or has transitioned to Secondary, enter the following command:

sudo drbdadm -- --discard-my-data

connect r0

 

  • Next, on the Primary node, enter the following command:

sudo drbdadm connect r0

  1. Validate that your split-brain issue has been resolved and is reconnected.
  • To check the status on both the Primary node and Secondary note, enter the following command:

drbdadm status

Depending on your network, it can take several minutes to establish the connection and restart replication.

  • If the output of the status returns connecting, DRBD is attempting to establish the connection. When you see the peer line, connection is established and replication has resumed.

  • However, if the connecting line disappears and no peer line appears, you have failed to resolve the split-brain situation. Please contact ScienceLogic Support for further assistance.

You must always perform your resolution on the passive node first.

Resolving a Split-brain Configuration on a High Availability Stack with Two Appliances

To resolve a split-brain configuration between two nodes on a High Availability Primary or Secondary stack:

  1. Check the status of both appliances. To do this:
  • Log in to the console of the current Primary appliance as the em7admin user and enter the following at the shell prompt:

drbdadm status

  • If your output looks like this, you have a split-brain configuration:

r0 role:Primary

disk:UpToDate

There is no peer section listed in the status output. This means the node is in Stand Alone mode and not attempting to reconnect.

  • On the Disaster Recovery node, enter the following code:

drbdadm status

  • You could have several outputs returned.

    • If your output looks like this, you have a split-brain configuration that indicates the system is split and believes it is in Primary and Stand Alone mode:

      r0 role:Primary

      disk:UpToDate

    • If your output looks like this, you have a split-brain configuration that indicates the system is split and believes it is in Primary and attempting to connect to the peer:

      r0 role:Primary

      disk:UpToDate

      peer connection:Connecting

    • If your output looks like this, you have a split-brain configuration that indicates the system is split and believes it is Secondary and is in Stand Alone mode:

      r0 role:Secondary

      disk:UpToDate

    • If your output looks like this, you have a split-brain configuration that indicates the system is split and believes it is Secondary and is attempting to connect to the peer:

      r0 role:Secondary

      disk:UpToDate

      peer connection:Connecting

  • In a safe failover/failback scenario, your output would include:

    • Primary:

      r0 role:Primary

      disk:UpToDate

      peer role:Secondary

      replication:Established peer-disk:UpToDate

    • Secondary:

      r0 role:Secondary

      disk:UpToDate

      peer role:Primary

      replication:Established peer-disk:UpToDate

    • When using Oracle Linux 8 the output appears:

      • Primary:

        r0 role:Primary

        disk:UpToDate

        db2 role:Secondary

        peer-disk:UpToDate

      • Secondary:

        r0 role:Secondary

        disk:UpToDate

        db1 role:Primary

        peer-disk:UpToDate

 

  1. Resolve split-brain on the passive node.
  • If the passive node is attempting to connect to its peer, enter the following command:

sudo drbdadm disconnect r0

  • If the passive node has marked itself the Primary (i.e. both nodes are claiming Primary) this indicates an issue with the High Availability cluster not being able to establish cluster connectivity and/or not being able to control resources correctly. Please stop the process and contact ScienceLogic Support for assistance.
  • If your system is already Secondary, or has transitioned to Secondary, enter the following command: 

sudo drbdadm -- --discard-my-data

connect r0

  • Next, on the Primary node, enter the following command:

sudo drbdadm connect r0

  1. Validate that your split-brain issue has been resolved and is reconnected.
  • To check the status on both the Primary node and Secondary note, enter the following command:

drbdadm status

Depending on your network, it can take several minutes to establish the connection and restart replication.

  • If the output of the status returns connecting, DRBD is attempting to establish the connection. When you see the peer line, connection is established and replication has resumed.

  • However, if the connecting line disappears and no peer line appears, you have failed to resolve the split-brain situation. Please contact ScienceLogic Support for further assistance.

You must always perform your resolution on the passive node first.

Resolving a Split-brain Configuration on a High Availability and Disaster Recovery Stack with Three Appliances

There are two scenarios you might encounter when working on a high availability and disaster recovery stack involving three appliances. Refer to the steps below for your specifications.

Resolving a Split-brain Between the Primary High Availability and Secondary High Availability Nodes

To resolve a split-brain configuration between three nodes on a High Availability Primary or Secondary stack:

  1. Check the status of all appliances. To do this:

    • Log in to the console of the current Primary appliance as the em7admin user and enter the following at the shell prompt:

    • drbdadm status

    • Your output will vary depending on which version of Oracle Linux you are running.

    • As of 12.2.0, SL1 can be deployed only on Oracle Linux 8 (OL8) operating systems. If you take no action before October 31, 2024, all older SL1 systems with OL7 will continue to run, but ScienceLogic will not support them, and the systems might not be secure. For more information, see the section on Updating SL1.

      If your output looks like this, you have a split-brain configuration:

      • On an OL8 system, your output will be similar to:
      • r0 role:Primary

        disk:UpToDate

        db2 connection:StandAlone

        db3 connection:Secondary

        peer-disk:UpToDate

        Stacked resources are no longer used in OL8. Instead, all nodes are direct peers and you have to associate the name to the correct nodes role. In the example above, the db2 is the high availability secondary.

      • On an OL7 system, your output will be similar to:
      • r0-L role:Primary

        disk:UpToDate

        The output for OL7 might contain a peer line listing various state information. When the nodes are disconnected, your output will either look similar to the example above or will indicate peer:Connecting.

    • On the High Availability (secondary) node enter the following code:
    • drbdadm status

    • If your output looks like this, you have a split-brain configuration:
      • On an OL8 system, your output will be similar to:
      • r0 role:Secondary

        disk:Outdated

        db1 connection:Connecting

        db3 connection:StandAlone

      Stacked resources are no longer used in OL8. Instead, all nodes are direct peers and you have to associate the name to the correct nodes role. In the example above, the db1 is the high availability primary and is outdated compared to the other nodes.

      • On an OL7 system, your output will be similar to:
      • r0-L role:Secondary

        disk:UpToDate

      The output for OL7 might contain a peer line listing various state information. When the nodes are disconnected, your output will either look similar to the example above or will indicate peer:Connecting.

  2. Resolve split-brain on the passive node on OL8 systems.

    • On the High Availability Secondary node enter the following command:

      drbdadm connect r0 --discard-my-data

    • On the High Availability Primary node enter the following command:

      drbdadmin connect r0

      Ensure there are no existing connections to the Disaster Recovery node as errors might appear in the output because the connections have already been made. Check the status with drbdadm status to make sure the nodes connect.

  3. Resolve split-brain on the passive node on OL7 systems.
    • On the High Availability (secondary) node, enter the following command:
    • drbdadm -- --discard-my-data connect r0-L

    • Next, on the High Availability Primary node, enter the following command:
    • drbdadm connect r0-L

      You must always perform your resolution on the passive node first.

Resolving a Split-brain Between the Primary High Availability and the Disaster Recovery Nodes

To resolve a split-brain configuration between three nodes on a High Availability (primary) and Disaster Recovery stack:

  1. Check the status of all appliances. To do this:
    • Log in to the console of the current Primary appliance as the em7admin user and enter the following at the shell prompt:
      • On OL8
      • drbdadm status

      • On OL7
      • drbdadm --stacked

      As of 12.2.0, SL1 can be deployed only on Oracle Linux 8 (OL8) operating systems. If you take no action before October 31, 2024, all older SL1 systems with OL7 will continue to run, but ScienceLogic will not support them, and the systems might not be secure. For more information, see the section on Updating SL1.

    • Your output will vary depending on which version of Oracle Linux you are running.
    • If your output looks like this, you have a split-brain configuration:

      • On an OL8 system, your output will be similar to:
      • r0 role:Primary

        disk:UpToDate

        db2 role:SEcondary

        peer-disk:UpToDate

        db3 connection:StandAlone

        In the example above, the db3 system is the Disaster Recovery node and the connection is StandAlone. Regardless of the db3 connection state, it is important to note that it is not connected and will not list a peer-disk status.

      • On an OL7 system, your output will be similar to:
      • r0-L role:Primary

        disk:UpToDate

        peer connection:Connecting

        The output for OL7 might contain a peer line listing various state information. When the nodes are disconnected, your output will look similar to the example above indicating replication is not established.

    • On the Disaster Recovery node enter the following code:
    • drbdadm status

    • If your output looks like this, you have a split-brain configuration:
    • 1: cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown r-----

      ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0

    In a safe failback scenario, your output would include ro:Primary/Secondary ds:UpToDate/UptoDate. Also, the kernel logs will include a message indicating that split-brain is detected but unresolved.

  2. Resolve split-brain on the passive node on OL8 systems.
    • On the High Availability Secondary node enter the following command:
    • drbdadm connect r0 --discard-my-data

    • On the High Availability Primary node enter the following command:
    • drbdsetup connect r0 2

    Ensure there are no existing connections to the Disaster Recovery node as errors might appear in the output because the connections have already been made. Check the status with drbdadm status to make sure the nodes connect.

  3. Resolve split-brain on the passive node on OL7 systems.
    • On the Disaster Recovery node, enter the following command:
    • drbdadm -- --discard-my-data connect r0

    • Next, on the High Availability Primary node, enter the following command:
    • drbdadm connect --stacked r0

    You must always perform your resolution on the passive node first.

Diverged Data

Overview

In some situations, DRBD nodes will not connect. When you inspect your logs, you might find the following error message:

2020-07-14T00:01:54.831938+00:00 sl1db02 kernel: [15919.112878] block drbd1: Unrelated data, aborting!

Resolving a Diverged Data Issue

DRBD will not connect if data has diverged beyond the point that the peers no longer recognize each other's generation identifiers. To resolve a diverged data issue:

  1. Stop the DRBD completely on the secondary (passive) node and delete the metadata:

drbdadm down r0

drbdadm wipe-md r0

  1. Recreate the DRBD metadata and restart the DRBD:

drbdadm create-md r0

drbdadm up r0

  1. Check to you see if the notes are connected and synchronizing by running the following command:

drbdadm status

Firewall Issues

Overview

DRBD uses TCP port 7788 for communication between Database Servers. The port must be open in both directions to allow DRBD connection and replication.

Resolving Firewall Issues

To verify there are no port issues and that data can flow bidirectionally:

  • Log in to the console of the current Primary appliance as the em7admin user and enter the following at the shell prompt:

firewall-cmd --list-all | grep 7788

  • If your output looks like this, you have establish connections:

tcp 0 0 10.64.70.42:54787 10.64.70.24:7788 ESTABLISHED 7181/drbd-proxy

tcp 0 0 169.254.1.6:56207 169.254.1.5:7788 ESTABLISHED -

tcp 0 0 10.64.70.42:7788 10.64.70.24:54728 ESTABLISHED 7181/drbd-proxy

tcp 0 0 169.254.1.6:7788 169.254.1.5:53315 ESTABLISHED

  • On the Disaster Recovery node, enter the following code:

firewall-cmd --list-all | grep 7788

  • If your output looks like this, you have established connections:

tcp 0 0 10.64.70.24:7788 10.64.70.42:54787 ESTABLISHED 1397/drbd-proxy

tcp 0 0 10.64.70.24:54728 10.64.70.42:7788 ESTABLISHED 1397/drbd-proxy

Ahead-Behind or Disconnect/Reconnect Issues

Overview

These conditions are always caused by a network issue between the Primary node and the Disaster Recovery node. There are several reasons why you might experience network issues, ranging from very simple to very complex, but the root cause is always an insufficiency in data transit. The most common reasons are:

  • insufficient bandwidth for the data
  • latency plus insufficient bandwidth
  • network congestion and queue drops
  • packet loss

Resolving Self-Monitoring Network Issues

ScienceLogic has many tools that illustrate some of the insufficient network issues you might face. Viewing the self-monitoring graphs can help you narrow the reason for your network issues.

  • DRBD Status Performance| Disk Write. This graph shows the amount of data DRBD is writing to the disk.
  • DRBD Status Performance |Network Send. This graph shows the amount of data DRBD has sent across the network from DRBD to the "next hop." This graph should be identical to the Disk Write graph. If not, it means that DRBD cannot send data as fast as it can write it.
  • DRBD Status Performance| Disk Write vs. Network Send. This graph is a presentation of the Disk Write and Network Send collectors in comparison to each other. This graph should be relatively flat if everything is operating normally on the network.

Another solution to these network issues is to enable compression on the DRBD proxy. For more information about enabling compression on the DRBD proxy, see the following Knowledge Base article: Using Compression with DRBD.

Other causes of network issues might require more in-depth intervention due to limitations beyond your control. The following suggestions are actions you can implement to help your situation:

  • If the bandwidth to the Disaster Recovery node is shared, implement quality of service (QoS) on the network and ensure the DRBD traffic is placed in a low latency, no-drop queue.
  • Monitor all network interfaces for all links between SL1 nodes and then enable packet, error, and discard collections on those interfaces. You will be able to see if there are drops or errors, or if it is overrunning the packet switching ability of a switch or router.
  • Check for retransmissions on the SL1 system with netstat -s. If you see a high number of retransmits compared to the total number of sent packets, your network is having a problem. In this case, "high" is 0.2% or greater.

In some cases, when the steps above do not resolve your network issue, you might need to escalate the issue to ScienceLogic Support. Before contacting Support, you should collect the following data:

  • Screenshots of all three of the DRBD Status Performance graphs referenced above. You should convert the time window when issues are found.
  • The output of netstat -s and ifconfig on both the Primary and the Disaster Recovery nodes.
  • Capture the output of each of these commands after running on the Database Server:
    • Basic ping: ping -w 60 <Dr IP>
    • Large packet ping: ping -w 60 -M do -s 1472 <dr IP>
    • Rapid ping: ping -w 60 -f <dr IP>
    • Large packet rapid ping: ping -w 60 -s 1472 -f <dr IP>

Do not control -C these commands, because they will expire after 60 seconds.

Other Issues

WFReportParams Loops

In the event the DRBD fails to reconnect during the process of a failover to Disaster Recovery, the DRBD will stall and you will see the symptoms of a "WFReportParams" loop error message.

To reconnect the DRBD on the High Availability node, enter the following command:

drbdadm proxy-down r0 && drbdadm proxy-up r0

To reconnect the DRBD on a stacked High Availability node, enter the following command:

drbdadm proxy-down r0 -S && drbdadm proxy-up r0 -S

To reconnect the DRBD on the Disaster Recovery node, enter the following command:

drbdadm proxy-down r0 && drbdadm proxy-up r0

If you need assistance, contact ScienceLogic Support.

Network Issues

As with any network-dependent application, the DRBD proxy requires a stable network. If the network has issues such as flapping link or fluctuating bandwidth capacity, it affects DRBD data replication. Make sure network issues are properly diagnosed and addressed.

Kernel Package Issues

Although rare, two DRBD nodes might fail to communicate due to incompatible DRBD modules. This could sometimes result from one node that was not rebooted after a kernel package update, which will only take effect after a reboot.