High Availability and Disaster Recovery with Three Appliances

Download this manual as a PDF file

This section describes how to troubleshoot various scenarios of High Availability clusters configured for Disaster Recovery.

This section assumes that you are comfortable using a UNIX shell session and can use the basic functions within the vi editor.

Use the following menu options to navigate the SL1 user interface:

  • To view a pop-out list of menu options, click the menu icon ().
  • To view a page containing all of the menu options, click the Advanced menu icon ().

Prerequisites

Before performing the steps listed in this section, you must:

  • Install and license each appliance
  • Have an Administrator account to log in to the Web Configuration Utility for each appliance
  • Have SSH or console access to each appliance
  • Know the em7admin console username and password for each appliance
  • Have identical hardware or virtual machine specifications on each appliance
  • Have configured a unique hostname on each appliance
  • Know the MariaDB username and password
  • (Option 1) ScienceLogic recommends that you connect the two appliances on a secondary interface with a crossover cable.
  • When using a crossover cable a virtual IP is required for the cluster.
  • (Option 2) Connect the two appliances on a secondary interface using a dedicated private network (VLAN). This network should only be used for cluster communication. If a crossover cable is not possible, you must have this private network for cluster communication containing only the two high availability nodes.
  • A virtual IP is optional when using this method.
  • One Admin Portal is required to be configured as a quorum witness.
  • Latency between the nodes must be less than 1 millisecond.
  • Nodes can optionally be placed in separate availability zones, but must be in the same region.
  • You can determine the virtual IP address for the cluster, if required. The virtual IP address will be associated with the primary appliance and will be transitioned between the appliances during failover and failback. The virtual IP address must be on the same network subnet as the primary network adapters of the appliances.
  • Request and configure a DRBD proxy license
  • Know the maximum link speed, in megabytes/second, between the High Availability cluster and the Disaster Recovery appliance.

See Deployment Recommendations for details regarding availability zones and regions.

Unique Host Names

You must ensure that a unique hostname is configured on each SL1 appliance. The hostname of an appliance is configured during the initial installation. To view and change the hostname of an appliance:

  1. Log in to the console of the SL1 appliance as the em7admin user. The current hostname appears before the command-prompt. For example, the login prompt might look like this, with the current hostname highlighted in bold:
  2. login as: em7admin

    em7admin@10.64.68.31's password:

    Last login: Wed Apr 27 21:25:26 2016 from silo1651.sciencelogic.local

    [em7admin@HADB01 ~]$

     

  3. To change the hostname, run the following command:
  4. sudo hostnamectl set-hostname <new hostname>

  5. When prompted, enter the password for the em7admin user.

Licensing DRBD Proxy

DRBD Proxy buffers all data between the active and redundant appliances to compensate for any bandwidth limitations. In addition, DRBD compresses and encrypts the data sent from the active appliance to the redundant appliance.

You must use DRBD Proxy if you are:

  • Configuring three appliances for High Availability and Disaster Recovery.
  • Configuring two appliances for Disaster Recovery and will not be configuring a direct connection between your appliances with a crossover cable.

To license DRBD Proxy, copy the drbd-proxy.license file to the /etc directory on all appliances in your system.

Addressing the Cluster

A database cluster has three IP addresses: one for the primary interface on each database appliance, and an additional virtual IP address. The virtual IP address is shared between the two database appliances, to be used by any system requesting database services from the cluster.

The following table describes which IP address you should supply for the Database Server when you configure other SL1 appliances and external systems:

Appliance/System IP Address
Administration Portal Use the Virtual IP address when configuring the Database IP Address in the Web Configuration Utility and the Appliance Manager page (System > Settings > Appliances).
Data Collector or Message Collector Include both primary interface IP addresses when configuring the ScienceLogic Central Database IP Address in the Web Configuration Utility and the Appliance Manager page (System > Settings > Appliances).
SNMP Monitoring Monitor each Database Server separately using the primary interface IP addresses.
Database Dynamic Applications Use the Virtual IP address in the Hostname/IP field in the Credential Editor page (System > Manage > Credentials > wrench icon).

Reconfiguring an Existing High Availability System

If you have previously configured two appliances in a High Availability cluster, perform the following steps to reconfigure the appliances as a High Availability cluster plus a Disaster Recovery appliance:

During the reconfiguration procedure, the SL1 System will be unavailable. This procedure must be performed during a maintenance window.

  1. Validate that the existing High Availability cluster and the third new appliance meet all the Prerequisites for configuring High Availability plus Disaster Recovery.
  2. Log in to the console of the current secondary appliance as the em7admin user.
  3. Run the following commands, entering the password for the em7admin user when prompted:
  4. sudo service pacemaker stop

    sudo service corosync stop

  5. Log in to the console of the current Primary appliance as the em7admin user.
  6. Run the following commands, entering the password for the em7admin user when prompted:
  7. sudo service pacemaker stop

    sudo service corosync stop

  8. Perform the steps listed in the Configuring Three Appliances for HA and DR section.

Configuring a Heartbeat IP Addresses

To cluster two databases, you must configure a heartbeat network between the appliances first. The heartbeat network is used by the databases to determine whether failover conditions have occurred. A heartbeat network consists of a crossover Ethernet cable attached to an interface on each database.

After attaching the network cable, you must complete the steps described in this section to configure the heartbeat network.

Perform the following steps on each appliance to configure the heartbeat network:

  1. Log in to the console of the Primary appliance as the em7admin user.
  2. Navigate to the following directory: /etc/sysconfig/network-scripts.
  3. Identify the file corresponding to the heartbeat adapter. Adapter files are named ifcfg-<if name>.
  4. Edit or add the following lines to the file you identified:

    IPADDR=”169.254.1.1”
    PREFIX=”30”
    BOOTPROTO=”none”
    ONBOOT=”yes”
  5. Run the following command:

    Ifup <Heartbeat adapter>

  6. Log in to the console of the Secondary appliance as the em7admin user and repeat steps 2-5, using 169.254.1.2 as the IP address.

Testing the Heartbeat Network

After you configure the heartbeat network, perform the following steps to test the connection:

  1. Log in to the console of the Primary appliance as the em7admin user.
  2. Run the following command:
  3. ping -c4 169.254.1.2

    If the heartbeat network is configured correctly, the output looks like this:

    PING 169.254.1.2 (169.254.1.2) 56(84) bytes of data.

    64 bytes from 169.254.1.2: icmp_seq=1 ttl=64 time=0.657 ms

    64 bytes from 169.254.1.2: icmp_seq=2 ttl=64 time=0.512 ms

    64 bytes from 169.254.1.2: icmp_seq=3 ttl=64 time=0.595 ms

    64 bytes from 169.254.1.2: icmp_seq=4 ttl=64 time=0.464 ms

    If the heartbeat network is not configured correctly, the output looks like this:

    PING 169.254.1.2 (169.254.1.2) 56(84) bytes of data.

    From 169.254.1.1 icmp_seq=1 Destination Host Unreachable

    From 169.254.1.1 icmp_seq=2 Destination Host Unreachable

    From 169.254.1.1 icmp_seq=3 Destination Host Unreachable

    From 169.254.1.1 icmp_seq=4 Destination Host Unreachable

Deployment Recommendations

ScienceLogic strongly recommends that you use a 10 GB network connection for the cluster communication link. Using a 1 GB link may not be suitable and lead to reduced database performance.

When deploying in a cloud environment the quorum witness must be placed in a different availability zone from the High Availability servers.

When deploying in a cloud environment the High Availability servers should be placed in separate availability zones and must not be in the same zone with the quorum witness.

Disaster Recovery replication is sensitive to packet-loss and the variations in time delay between when a signal is transmitted and when it's received over a network connection (network jitter). Use Quality of Service (QOS) to classify the Disaster Recovery network traffic to prevent network drops and ensures reliable transmission for replication traffic.

Configuring Three Appliances for High Availability and Disaster Recovery

To configure three appliances for High Availability and Disaster Recovery, you must configure the appliances in the following order:

  1. The Quorum Witness
  2. Primary appliance in the High Availability cluster
  3. Secondary appliance in the High Availability cluster
  4. Disaster recovery appliance

Configuring the Quorum Witness

A quorum witness is required in any setup that does not use a crossover cable. Two-node clusters without a crossover cable are known to be unstable. The quorum witness provides the needed "tie-breaking" vote to avoid split-brain scenarios.

Only one administration portal can be configured as a quorum witness. Please review the requirements for network placement and availability zones.

Prior to starting this procedure ensure the Admin Portal is connected to the primary database and licensed. To configure the quorum witness:

  1. Log in to the console of the Admin Portal appliance that will be the quorum witness as the em7admin user.

  2. Run the following command:

    sudo -s

  3. Enter the em7admin password when you are prompted.

  4. Run the following command:

    silo-quorum-install

    The following prompt appears:

    This wizard will assist you in configuring this admin portal to be a quorum device for a corosync/pacemaker HA or HA+DR cluster. When forming a cluster that does not have a cross over cable a quorum device is required to prevent split-brain scenarios. Please consult ScienceLogic documentation on the requirements for a cluster setup.

    Do you want to configure this node a quorum device for an HA or HA+DR cluster? (y/n)

  5. Enter yes and the following prompt appears:

    -Please enter the PRIMARY IP address for the Primary HA server:

  6. Enter the primary IP of the primary high availability server and press Enter. The following prompt will appear:

    -Please enter the PRIMARY IP address for the Secondary HA server:

  7. Enter the primary IP of the secondary high availability server and press Enter. The following confirmation will appear:

    Please check the following IP addresses are correct

    Node 1 Primary IP: <primary HA primary IP>

    Node 2 Primary IP: <secondary HA primary IP>

    Is this architecture correct? (y/n)

  8. Enter yes if your architecture is correct and the following messages will indicate the quorum witness is being setup.

    Updating firewalld configuration, please be patient...

    Waiting for quorum service to start...

    Service has started successfully.

    silo-quorum-install has exited

    After this setup is complete you can configure the primary appliance.

Configuring the Primary High Availability Appliance

To configure the Primary High Availability appliance for High Availability, perform the following steps:

  1. Log in to the console of the Primary High Availability appliance as the em7admin user.

  1. Run the following command:
  2. sudo -s

  3. When prompted, enter the password for the em7admin user.
  4. Run the following command:
  5. silo-cluster-install

    The following prompt appears:

    v3.7

    This wizard will assist you in setting up clustering for ScienceLogic appliance. Please be sure to consult the ScienceLogic documentation before running this script to ensure that all prerequisites have been met.

    1) HA

    2) DR

    3) HA+DR

    4) Quit

    Please select the architecture you'd like to setup:

  6. Enter "3". The following prompt appears:
  7. Will this system be the primary/active node in a new HA+DR cluster? (y/n)

  8. Enter yes. The following prompt appears:
  9. Is there a cross-over cable between HA nodes? (y/n)

  10. Enter yesif there is a physical crossover cable connecting the HA nodes and skip to step 10. Otherwise answer no. When answering no, the following prompt appears:
  11. Is there a dedicated/private network link between the HA nodes? (y/n)

  12. A dedicated private network link is required. If you answer no, the script will exit. Answer yes and the following prompt appears:
  13. Have you already configured an AP to act as a quorum device? (y/n)

  14. You must have already configured an Admin Portal to act as a quorum witness. If you answer no, the script will exit. Answer yes and the following prompt appears:
  15. Please enter the IP for the quorum device:

  16. Enter the IP address for the Admin Portal that was configured for as the quorum witness and press Enter. The following prompt appears:
  17. Choose the closest value for the speed of the dedicated between HA nodes.

    1) 10Gbps

    2) 1 Gbps

    What is the physical link speed between HA nodes?

  18. Select the response that most closely matches the speed of the private link between HA nodes.
  19. The following confirmation will appear:
  20. Review the following architecture selections to ensure they are correct

    Architecture: HA+DR

    Current node hostname: <hostname>

    Crossover Cable: No

    Dedicated link speed: <the value of the speed you chose>

    Quorum Node IP: <IP address of your quorum witness>

    Virtual IP (VIP) used: Yes

    DRBD Proxy: yes

    Is this architecture correct? (y/n)

  21. Verify that your architecture is correct. If not, enter no to reenter the data, otherwise enter yes. The following prompt appears:
  22. Primary node information:

    Please the IP used for HEARTBEAT traffic for this server:

    1)10.255.255.1

    ...

    Number:

  23. Select the IP and the following prompt appears:
  24. Secondary node information:

    What is the hostname of the Secondary HA server:

  25. Enter the hostname of the secondary server and press Enter. The following prompt appears:
  26. Please enter the IP used for HEARTBEAT traffic for the secondary high availability server that corresponds to 10.255.255.1:

  27. Enter the corresponding IP and press Enter. The following prompt appears:

    Please enter the PRIMARY IP address for the Secondary HA server:

  28. Enter the primary IP for the secondary high availability server and press Enter. The following prompt appears:

    Tertiary node information:

    Please enter the hostname of the DR server

  29. Enter the hostname for the disaster recovery server and press Enter. The following prompt appears:

    Please enter the DRBD IP for the DR server:

  30. Enter the IP address for the DR server and press Enter. The following prompt appears:

    Virtual IP address information:

    Please enter the Virtual IP Address:

    Please enter the CIDR for the Virtual IP with out the / (example: 24):

  31. Enter the VIP address and CIDR mask. the following prompt appears:

    Disaster recovery bandwidth information:

    Please enter the max link speed to the DR system in megabytes/second:

  32. Enter the speed of the ink in megabytes per second.

  33. The following confirmation appears:

    You have selected the following settings, please confirm they are correct:

    Architecture: HA+DR

    This Node: PRIMARY (Node 1)

     

    Node 1 Hostname: db1

    Node 1 DRBD/Heartbeat IP: 10.255.255.1

    Node 1 Primary IP: 10.64.166.251

    Node 2 Hostname: db2

    Node 2 DRBD/Heartbeat IP: 10.255.255.2

    Node 2 Primary IP: 10.64.166.252

    Node 3 Hostname: 10.64.166.253

    Node 3 DRBD IP: 10.64.166.253

     

    Virtual IP: 10.64.167.0/23

    DRBD Disk: /dev/mapper/em7vg-db

    DRBD Proxy: Yes

    Max DR DRBD Sync Speed: 120

     

    Is this information correct? (y/n)

  34. Review your information to confirm it is correct.

  35. Once confirmed, the system will configure itself and you can expect an output similar to the following:

    Setting up the environment...

    Pausing SL1 services

    No adjustment needed to /dev/mapper/em7vg-db for DRBD metadata

    Setting up DRBD...

    Setting up and starting Corosync...

    Setting up and starting Pacemaker...

    Waiting on cluster services to come up...

    Waiting on cluster services to come up...

    Waiting on cluster services to come up...

    Waiting on cluster services to come up...

    Cluster services detected as active

    Configuring silo.conf for clustering

    Unpausing SL1

    Setup is complete

     

    Current cluster status

    Stack: corosync

    Current DC: db1 (version 1.1.19.linbit-8+20181129.el7.2-c3c624ea3d) - partition with quorum

    Last updated: <date and time you last updated>

    Last change: <date and time you last changed> by hacluster via crmd on db1

     

    2 nodes configured

    7 resources configured

     

    Online: [ db1 ]

    OFFLINE: [ db2 ]

     

    Active resources:

    Resource Group: g_em7

    p_fs_drbd1 (ocf::heartbeat:Filesystem): Started db1

    mysql (ocf::sciencelogic:mysqul-systemd): Started db1

    Resource Group: g_vip

    virtual_ip (ocf::heartbeat:IPaddr2): Started db1

    drbdproxy (lsb:drbdproxy): Started db1

    Master/Slave Set: ms_drbd_r0 [p_drbd_r0]

    Masters: [ db1 ]

    Master/Slave set: ms_drbd_r0-L [p_drbd_r0-L]

    Masters: [ db1 ]

     

    Current DRBD status

    r0-L role:Primary

    disk:UpToDate

    peer connection:Connecting

     

    silo-cluster-install has exited

  36. The messages above are examples of a successful setup. The important messages that indicate a successful setup are:
  37. Unpausing SL1

    Setup is complete

     

    Current cluster status

    This message indicates that the system has been configured and the required clustering services have started successfully. You can review the output of the cluster status to see if all services have started as expected.

  38. If your setup did not complete successfully with the cluster services in a starting state, you will need to review the log to determine the cause of the failure. Once you correct the issues found in the log the set up can be run again.

Configuring the Secondary High Availability Appliance

To configure the Secondary High Availability appliance for High Availability, perform the following steps:

  1. Log in to the console of the Secondary High Availability appliance as the em7admin user.

  1. Run the following command to assume root user privileges:
  2. sudo -s

  3. When prompted, enter the password for the em7admin user.
  4. Run the following command:
  5. silo-cluster-install

    The following prompt appears:

    v3.7

    This wizard will assist you in setting up clustering for a ScienceLogic appliance. Please be sure to consult the ScienceLogic documentation before running this script to ensure that all prerequisites have been met.

    1) HA

    2) DR

    3) HA+DR

    4) Quit

    Please select the architecture you'd like to setup:

  6. Enter "3". The following prompt appears:
  7. Will this system be the primary/active node in a new HA+DR cluster? (y/n)

  8. Enter no. The following prompt appears:
  9. Has the primary/active system already been configured and is it running? (y/n)

  10. Enter yes if you have already configured the primary node and it is active and running. Otherwise, choose no and complete the setup on the primary node. The following prompt appears:
  11. After entering yes the following prompt appears:
  12. Enter an IP for the primary system that is reachable by this node:

  13. Enter the IP address of the primary system and press Enter. This node will attempt to connect to the primary system and retrieve the cluster configuration information. If a successful connection can't be established, you will be prompted again to enter the IP address and a username and a password. The username and password should be the same used for connecting to Maria DB.
  14. When a successful connection is made you will be prompted to confirm the architecture to ensure that it is the same for the primary server.
  15. Review the following architecture selections to ensure they are correct

     

    Architecture: HA

    Current node hostname: db2

    Crossover Cable: No

    Dedicated link speed: 10Gbps

    Quorum Node IP: 10.64.166.250

    Virtual IP (VIP) used: Yes

    DRBD Proxy: Yes

    Is this architecture correct? (y/n)

  16. Review that the information is correct and enter yes. Enter no if it is incorrect and regenerate the configuration on the primary/active system before proceeding.
  17. After you enter yes, the architecture will be validated for any errors and you will be prompted to confirm the configuration. Review that the information is correct. Your output should be similar to the following:
  18. You have selected the following settings, please confirm if they are correct:

    Architecture: HA+DR

    This Node: SECONDARY (Node 2)

     

    Node 1 Hostname: db1

    Node 1 DRBD/Heartbeat IP: 10.255.255.1

    Node 1 Primary IP: 10.64.166.251

    Node 2 Hostname: db2

    Node 2 DRBD/Heartbeat IP: 10.255.255.2

    Node 2 Primary IP: 10.64.166.252

    Node 3 Hostname: db3

    Node 3 DRBD IP: 10.64.166.253

     

    Virtual IP: 10.64.167.0/23

    DRBD Disk: /dev/mapper/em7vg-db

    DRBD Proxy: Yes

    Max DR DRBD Sync Speed: 120

     

    Is this information correct? (y/n)

  19. Review that the information is correct and enter yes. Enter no if it is incorrect and regenerate the configuration on the primary/active system before proceeding.
  1. After you enter yes, the following message indicates the system is being configured:
  2. Setting up the environment...

    Maria DB running, sending shutdown

    Pausing SL1 services

    Adjusting /dev/mapper/em7vg-db by extending to 39464 extents to accommodate filesystems and metadata

    -Updating firewalld configuration, please be patient...

    Setting up DRBD...

    Setting up and starting Corosync...

    Setting up starting Pacemaker...

    Waiting on cluster services to come up...

    Waiting on cluster services to come up...

    Cluster services detected as active

    Configuring silo.conf for clustering

    Unpausing SL1

    Setup is complete. Please monitor the DRBD synchronization status (drbdadm status)

    Failover cannot occur until DRBD is fully synced

     

    Current cluster status

    Stack: corosync

    Current DC: db1(version 1.1.19.linbit-8+20181129.el7.2-c3c624ea3d) - partition with quorum

    Last updated:<date and time you last updated>

    Last change:<date and time you last changed> by hacluster via crmd on db1

     

    2 nodes configured

    7 resources configured

     

    Online: [ db1 db2 ]

     

    Active resources:

    Resource Group: g_em7

    p_fs_drdb1 (ocf::heartbeat:Filesystem): Started db1

    mysql (ocf::sciencelogic:mysql-systemd): Started db1

    Resource Group: g_vip

    virtual_ip (ocf::heartbeat:IPaddr2): Started db1

    drdbproxy (lsb:drbdproxy): Started db1

    Master/Slave Set: ms_drbd_r0 [p_drbd_r0]

    Masters: [ db1 ]

    Master/Slave Set: ms_drbd_r0-L [p_drbd_r0-L]

    Masters: [ db1 ]

    Slaves: [ db2 ]

     

    Current DRBD status

    r0-L role:Secondary

    disk:Inconsistent blocked:lower

    peer role:Primary

    replication:SyncTarget peer-disk:UpToDate done:9.89

     

    silo-cluster-install has exited

Configuring the Disaster Recovery Appliance

To configure the Disaster Recovery appliance, perform the following steps:

  1. Log in to the console of the Disaster Recovery appliance as the em7admin user.

  1. Run the following command:
  2. sudo -s

  3. When prompted, enter the password for the em7admin user.
  4. Run the following command:
  5. silo-cluster-install

    The following prompt appears:

    v3.7

    This wizard will assist you in setting up clustering for ScienceLogic appliance. Please be sure to consult the ScienceLogic documentation before running this script to ensure that all prerequisites have been met.

    1) HA

    2) DR

    3) HA+DR

    4) Quit

    Please select the architecture you'd like to set up:

  6. Enter "3". The following prompt appears:
  7. Will this system be the primary/active node in a new HA+DR cluster? (y/n)

  8. Enter no. The following prompt appears:
  9. Has the primary/active system already been configured and is it running? (y/n)

  1. Enter yes if you have already configured the primary node and it is active and running Otherwise choose no and complete the setup on the primary node.

  1. After entering yes, the following prompt appears:
  2. Enter an IP for the primary system that is reachable by this node:

  3. Enter the IP address of the primary system and press Enter. This node will attempt to connect to the primary system and retrieve the cluster configuration information. If a successful connection can't be established, you will be prompted again to enter the IP address and a username and a password. The username and password should be the same used for connecting to Maria DB.
  4. When a successful connection is made you will be prompted to confirm the architecture to ensure that it is the same for the primary server.
  5. Preview the following architecture selections to ensure they are correct

    Architecture: HA+DR

    Current node hostname: db3

    Crossover Cable: No

    Dedicated link speed: 10Gbps

    Quorum Node IP: 10.64.166.250

    Virtual IP (VIP) used: Yes

    DRBD Proxy: Yes

    Is this architecture correct? (y/n)

  6. Review that the information is correct and enter yes. Enter no if it is incorrect and regenerate the configuration on the primary/active system before proceeding.
  7. After you enter yes, the architecture will be validated for any errors and you will be prompted to confirm the configuration. Review that the information is correct. Your output should be similar to the following:
  8. You have selected the following settings, please confirm if they are correct:

    Architecture: HA+DR

    Node 1 Hostname: db1

    Node 1 DRBD/Heartbeat IP: 10.255.255.1

    Node 1 Primary IP: 10.64.166.251

    Node 2 Hostname: db2

    Node 2 DRBD/Heartbeat IP: 10.255.255.2

    Node 2 Primary IP: 10.64.166.252

    Node 3 Hostname: db3

    Node 3 DRBD IP: 10.64.166.253

     

    Virtual IP: 10.64.166.0/23

    DRBD Disk: /dev/mapper/em7vg-db

    DRBD Proxy: Yes

    Max DR DRBD Sync Speed: 120

     

    Is this architecture correct? (y/n)

  9. Review that the information is correct and enter yes. Enter no if it is incorrect and regenerate the configuration on the primary/active system before proceeding.
  10. After you enter yes, the following message indicates the system is being configured:
  11. Setting up the environment...

    MariaDB running, sending shutdown

    Pausing SL1 services

    Adjusting /dev/mapper/em7vg-db by extending to 39464 extents to accommodate filesystem and metadata

    Updating firewalld configuration, please be patient...

    Setting up DRBD...

    Setting up and starting Corosync...

    Setting up and starting Pacemaker...

    Waiting on cluster services to come up...

    Waiting on cluster services to come up...

    Waiting on cluster services to come up...

    Cluster services detected as active

    Configuring silo.conf for clustering

    Unpausing SL1

    Setup is complete.Please monitor the DRBD synchronization status (drbdadm status)

    Failover cannot occur until DRBD is fully synced

    Current cluster status

    Stack: corosync

    Current DC: db3 (version 1.1.19.linbit-8+20181129.el7.2-c3c624ea3d) - partition with quorum

    Last updated: <date and time you last updated>

    Last change:<date and time you last changed> by hacluster via crmd on db3

     

    1 node configured

    4 resources configured (2 DISABLED)

     

    Online: [ db3 ]

     

    Active resources:

    Master/Slave Set: ms_drbd_r0 [p_drbd_r0]

    Slaves (target-role): [ db3]

     

    Current DRBD status

    r0 role:Secondary

    disk:Inconsistent

    peer role:Primary

    replication:SyncTarget peer-disk:UpToDate done:1.36

     

    silo-cluster-install has exited

Licensing the Secondary High Availability and Disaster Appliances

Perform the following steps to license the Secondary appliance:

  1. You can log in to the Web Configuration Utility using any web browser supported by SL1. The address of the Web Configuration Utility is in the following format:

https://<ip-address-of-appliance>:7700

Enter the address of the Web Configuration Utility in to the address bar of your browser, replacing "ip-address-of-appliance" with the IP address of the Secondary appliance.

  1. You will be prompted to enter your username and password. Log in as the "em7admin" user with the password you configured using the Setup Wizard.
  2. The Configuration Utilities page appears. Click the Licensing button. The Licensing Step 1 page appears:

The Licensing page step 1

  1. Click the Generate a Registration Key button.
  2. When prompted, save the Registration Key file to your local disk.
  3. Log in to the ScienceLogic Support Site at https://support.sciencelogic.com/s/. Click the License Request tab and follow the instructions for requesting a license key. ScienceLogic will provide you with a License Key file that corresponds to the Registration Key file.
  4. Return to the Web Configuration Utility:

The Licensing page step 2

  1. On the Licensing Step 2 page, click the Upload button to upload the license file. After navigating to and selecting the license file, click the Submit button to finalize the license. The Success message appears:

The Licensing page step 2 with success

Upon login, SL1 will display a warning message if your license is 30 days or less from expiration, or if it has already expired. If you see this message, take action to update your license immediately.

  1. Repeat steps 1 - 8 for the Disaster Recovery appliance.

Configuring Data Collection Servers and Message Collection Servers

If you are using a distributed system, you must configure the Data Collectors and Message Collectors to use the new multi-Database Server configuration.

To configure a Data Collector or Message Collector to use the new configuration:

  1. You can log in to the Web Configuration Utility using any web browser supported by SL1. The address of the Web Configuration Utility is in the following format:

https://<ip-address-of-appliance>:7700

Enter the address of the Web Configuration Utility in the address bar of your browser, replacing "ip-address-of-appliance" with the IP address of the Data Collector or Message Collector.

  1. You will be prompted to enter your user name and password. Log in as the em7admin user with the password you configured using the Setup Wizard.
  2. On the Configuration Utilities page, click the Device Settings button. The Settings page appears:

The Settings page

  1. On the Settings page, enter the following:
    • Database IP Address. Enter the IP addresses of all the Database Servers, separated by commas.
  2. Click the Save button. You may now log out of the Web Configuration Utility for that collector.
  3. Perform steps 1-5 for each Data Collector and Message Collector in your system.

Starting with SL1 version 11.3.0, SL1 will raise a "Collector Outage" event if a Data Collector or Message Collector could not be reached from either a primary or secondary Database Server.

Failover

Failover is the process by which database services are transferred from the active database to the passive database. You can manually perform failover for testing purposes.

If the active database server in the High Availability cluster stops responding to the secondary database server in the High Availability cluster over both network paths, SL1 will automatically perform failover. After failover completes successfully, the previously active database is now passive, and the previously passive database is now active. There is no automatic failback process; the newly active database will remain active until a failure occurs, or failover is performed manually.

If both appliances in the High Availability cluster fail, you can manually failover to the Disaster Recovery appliance. When the High Availability cluster is restored, you can manually failback from the Disaster Recovery appliance to the High Availability cluster.

Manual Failover Between the Appliances in the High Availability Cluster

To manually failover a High Availability cluster, perform the following steps:

  1. Log in to the console of the Primary appliance as the em7admin user.
  2. Upon login, SL1 will display a warning message if your license is 30 days or less from expiration, or if it has already expired. If you see this message, take action to update your license immediately.

  3. First, you should check the status of the appliances. To do this, enter the following at the shell prompt:
  4. cat /proc/drbd

    Your output will look like this:

    1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----

    ns:17567744 al:0 bm:1072 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:12521012

    To failover safely, the output should include "ro:Primary/Secondary ds:UpToDate/UpToDate".

    If your appliances cannot communicate, your output will include "ro:Primary/Unknown ds:UpToDate/DUnknown". Before proceeding with failback, troubleshoot and resolve the communication problem.

    If your output includes "ro:Primary/Secondary", but does not include "UpToDate/UpToDate", data is being synchronized between the two appliances. You must wait until data synchronization has finished before performing failback.

  5. Run the following command, assuming root user privileges:
  6. sudo systemctl stop pacemaker

  7. When prompted, enter the password for the em7admin user.
  8. The Primary database must be shutdown before promoting the Secondary database. To verify MariaDB is shutdown, check the MariaDB log file: /var/log/mysql/mysqld.log. The log file should contain the line: "Shutdown complete".
  9. To verify all services have started on the newly promoted primary database, run the following command:
  10. crm_mon

    If all services have started, each one should be marked: "started".

  11. After all services have started on the newly promoted primary database, run the following command to complete the failover process: sudo systemctl start pacemaker
  12. To verify that there are two nodes in the cluster, run the following command:
  13. crm_mon

For more information about manual failover for High Availability clusters, see the section on Troubleshooting High Availability and Disaster Recovery Configuration.

Manual Failover Between the High Availability Cluster and the Disaster Recovery Appliance

To perform failover between the High Availability cluster and the Disaster Recovery appliance when the High Availability cluster is available, such as to test the failover process:

  1. Log in to the console of the primary appliance in the High Availability cluster as the em7admin user.
  2. Assume root privileges:

    sudo -s
  3. When prompted, enter the password for the em7admin user.
  4. Run the following command:

    coro_config

    The following prompt appears:

    1) Enable Maintenance
    2) Option Disabled
    3) Demote Cluster
    4) Stop Pacemaker			
    5) Resource Status
    6) Quit
  5. Enter "3". The following prompt appears:

    Cluster currently Primary, would you like to make it Secondary? (y/n) y
  6. Enter "y". The following output appears:
  7. Issuing command: crm_resource --resource ms_drbd_r0 --set-parameter target-role --meta --parameter-value Slave

  8. The primary database must be shutdown before promoting the secondary database. To verify MariaDB is shutdown, check the MariaDB log file: /var/log/mysql/mysqld.log. The log file should contain the line: "Shutdown complete".

     

  1. Run the following command and review the output:

    sudo crm status

    After demotion crm status shows the VIP has its own resource group and never demotes after issuing coro_config option 3 to demote as mentioned in step 5.

    2 nodes configured

    7 resources configured (2 DISABLED)

    Online: [ sl1support-ha1 sl1support-ha2 ]

    Full list of resources:

    Resource Group: g_em7

    p_fs_drbd1 (ocf::heartbeat:Filesystem):    Stopped

    mysql      (ocf::sciencelogic:mysql-systemd):      Stopped

    Resource Group: g_vip

    virtual_ip (ocf::heartbeat:IPaddr2):       Started sl1support-ha1

    drbdproxy  (lsb:drbdproxy):        Started sl1support-ha1

    Master/Slave Set: ms_drbd_r0 [p_drbd_r0]

    Slaves (target-role): [ sl1support-ha1 ]

    Master/Slave Set: ms_drbd_r0-L [p_drbd_r0-L]

    Masters: [ sl1support-ha1 ]

    Slaves: [ sl1support-ha2 ]

  2. Log in to the console of the Disaster Recovery appliance as the em7admin user.
  3. Run the following command:
  4. sudo -s

  5. When prompted, enter the password for the em7admin user.
  6. Run the following command:
  7. coro_config

    The following prompt appears:

    1) Enable Maintenance

    2) Option Disabled

    3) Promote DRBD

    4) Stop Pacemaker

    5) Resource Status

    6) Quit

    Please enter the number of your choice:

  1. Enter "3". The following prompt appears:
  2. Node currently Secondary, would you like to make it Primary? (y/n)

  1. Enter "y". The following output appears:
  2. Issuing command: crm_resource --resource ms_drbd_r0 --set-parameter target-role --meta --parameter-value Master

  3. To verify that an appliance is active after failover, ScienceLogic recommends checking the status of MariaDB, which is one of the primary processes on Database Servers. To verify the status of MariaDB, execute the following command on the newly promoted Database Server:
  4. silo_mysql -e "select 1"

    If MariaDB is running normally, you will see a '1' in the console output.

    TIP: Because larger systems can take more time to start the database, verify that MariaDB has started successfully before running the above command. To verify MariaDB has started successfully, check the MariaDB log file: /var/log/mysql/mysqld.log. The log file should contain the line: "/usr/sbin/mysqld: ready for connections.".

  5. If you are using a distributed SL1 system, you must reconfigure all Administration Portals in your system to use the new Database Server. To do this, follow the steps listed in the Reconfiguring Administration Portals section.

Failover when the High Availability Cluster is Inaccessible

To perform failover when the both appliances in the High Availability cluster are inaccessible:

  1. Make sure to power-down the inaccessible Database Servers. This step is required to avoid a split-brain configuration (two primary appliances). A split-brain configuration will cause your data to become corrupted.
  2. The primary database must be shutdown before promoting the secondary database. To verify MariaDB is shutdown, check the MariaDB log file: /var/log/mysql/mysql.log. The log file should contain the line: "Shutdown complete".
  3. Log in to the console of the Disaster Recovery appliance as the em7admin user.
  4. Run the following command:
  5. sudo -s

  6. When prompted, enter the password for the em7admin user.
  7. Run the following command:
  8. coro_config

    The following prompt appears:

    1) Enable Maintenance

    2) Option Disabled

    3) Promote DRBD

    4) Stop Pacemaker

    5) Resource Status

    6) Quit

    Please enter the number of your choice:

  1. Enter "3". The following prompt appears:
  2. Node currently Secondary, would you like to make it Primary? (y/n)

  1. Enter "y". The following output appears:
  2. Issuing command: crm_resource --resource ms_drbd_r0 --set-parameter target-role --meta --parameter-value Master

  3. To verify that an appliance is active after failover, ScienceLogic recommends checking the status of MariaDB, which is one of the primary processes on Database Servers. To verify the status of MariaDB, execute the following command on the newly promoted Database Server:

    silo_mysql -e "select 1"

    If MariaDB is running normally, you will see a '1' in the console output.

  4. TIP: Because larger systems can take more time to start the database, verify that MariaDB has started successfully before running the above command. To verify MariaDB has started successfully, check the MariaDB log file: /var/log/mysql/mysqld.log. The log file should contain the line: "/usr/sbin/mysqld: ready for connections.".

  5. If you are using a distributed SL1 system, you must reconfigure all Administration Portals in your system to use the new Database Server. To do this, follow the steps listed in the Reconfiguring Administration Portals section.

For more information about manual failover on a High Availability and Disaster Recovery Stack with Three Appliances, see the section on Troubleshooting High Availability and Disaster Recovery Configuration.

Manual Failback Between the Disaster Recovery Appliance and the High Availability Cluster

To perform failback between the Disaster Recovery appliance and the High Availability cluster, perform the following steps:

  1. Log in to the console of the Disaster Recovery appliance as the em7admin user.
  2. First, you should check the status of the appliances. To do this, enter the following at the shell prompt:

    drbdadm status

    Your output will look like this:

  3. r0 role:Primary

    disk:UpToDate

    peer role:Secondary

    replication:Established peerdisk:UpToDate

    To failback safely, the output should include:

    peer role:Secondary replication: Established peer-disk:UpToDate

    If your two appliances cannot communicate, your output will include "ro:Primary/Unknown ds:UpToDate/DUnknown". Before proceeding with failback, troubleshoot and resolve the communication problem.

    If your output includes "ro:Primary/Secondary", but does not include "UpToDate/UpToDate", data is being synchronized between the two appliances. You must wait until data synchronization has finished before performing failback.

  4. Assume root privileges:

    sudo -s
  5. When prompted, enter the password for the em7admin user.
  6. Run the following command:

    coro_config

    The following prompt appears:

    1) Enable Maintenance		
    2) Option Disabled
    3) Demote DRBB
    4) Stop Pacemaker			
    5) Resource Status
    6) Quit
    
    Please enter the number of your choice:
  7. Enter "3". The following prompt appears:

    Node currently Primary, would you like to make it Secondary? (y/n) y

  1. Enter "y". The following output appears:
  2. Issuing command: crm_resource --resource ms_drbd_r0 --set-parameter target-role --meta --parameter-value Slave

    The primary database must be shutdown before promoting the secondary database. To verify MariaDB is shutdown, check the MariaDB log file: /var/log/mysql/mysql.log. The log file should contain the line: "Shutdown complete".

    Before promoting the High Availability cluster to primary, ensure that the Disaster Recovery node is completely demoted. DRDB must be connected between the appliance that was formerly the HA primary and the DR node.

  1. Log in to the console of the High Availability cluster that you want to promote to Primary as the em7admin user. This node should be the current High Availability primary node. Please check the output of crm status and verify the node you are on is currently running the g_vip resources.
  2. Run the following code to verify the disk replication is connected and up to date between the High Availability:
  3. drbdadm status

    If you are using Oracle Linux 7, the output should read:

    r0-L role:Primary

    disk:UpToDate

    peer role:Secondary

    replication:Established peer-disk:UpToDate

    On Oracle Linux 7, the resources will be r0-L and the role: should be Primary with the peer role: Secondary with replication Established and UpToDate.

    On the Oracle Linux 7 you must verify that the stacked resource is in the correct state by running the following command:

    drbdadm status --stacked

    Check that the output is similar to:

    r0 role:Secondary

    disk:UpToDate

    peer role:Secondary

    replication:Established peer-

    disk:UpToDate

    On a stacked resource on Oracle Linux 7, the resource will be r0 and both roles should be Secondary with replication Established and peer-disk UpToDate.


    If you are using Oracle Linux 8 , the output should be similar to:

    r0 role: Secondary

    disk:UpToDate

    db2 role:Secondary

    peer-disk:UpToDate

    db3 role:Secondary

    peer-disk:UpToDate

On Oracle Linux 8, the resources will be r0 and both the secondary High Availability node and Disaster Recovery node will be listed. All roles should be Secondary and UpToDate.

 

  1. Assume root privileges:

    sudo -s
  2. When prompted, enter the password for the em7admin user.
  3. Run the following command:

    coro_config

    The following prompt appears:

    1) Enable Maintenance
    2) Option Disabled
    3) Promote DRBD
    4) Stop Pacemaker
    5) Resource Status
    6) Quit
    			
    Please enter the number of your choice:
  4. Enter "3". The following prompt appears:

    Node currently Secondary, would you like to make it Primary? (y/n)
  1. Enter "y". The following output appears:
  2. Issuing command: crm_resource --resource ms_drbd_r0 --set-parameter target-role --meta --parameter-value Master

  3. To verify that an appliance is active after failback, ScienceLogic recommends checking the status of MariaDB, which is one of the primary processes on Database Servers. To verify the status of MariaDB, enter the following command on the newly promoted Database Server:

     

    silo_mysql -e "select 1"

    If MariaDB is running normally, you will see a '1' in the console output.

  4. TIP: Because larger systems can take more time to start the database, verify that MariaDB started successfully before running the above command. To verify MariaDB started successfully, check the MariaDB log file: /var/log/mysql/mysqld.log. The log file should contain the line: "/usr/sbin/mysqld: ready for connections.".

  5. If you are using a distributed SL1 system, you must reconfigure all Administration Portals in your system to use the new Database Server. To do this, follow the steps listed in the Reconfiguring Administration Portals section.

Reconfiguring Administration Portals

If you are using a Distributed system and you did not configure a virtual IP address, you must configure all Administration Portals in your system to use the new Primary Database Server after performing failover or failback. To configure an Administration Portal to use the new Database Server:

You must perform the following steps in the Web Configuration Utility to configure an Administration Portal:

  1. You can log in to the Web Configuration Utility using any web browser supported by SL1. The address of the Web Configuration Utility is in the following format:

https://<ip-address-of-appliance>:7700

 

Enter the address of the Web Configuration Utility in the address bar of your browser, replacing ip-address-of-appliance with the IP address of the Secondary appliance.

  1. Log in as the "em7admin" user with the password you configured using the Setup Wizard. The Configuration Utility page appears.
  2. Click the Device Settings button. The Settings page appears:

The Settings page

  1. On the Settings page, enter the following:
    • Database IP Address. The IP address of the new Primary ScienceLogic Database Server.
  2. Click the Save button. You may now log out of the Web Configuration Utility.
  3. Repeat these steps for each Administration Portal in your system.

Verifying that a Database Server is Primary

To verify that your network is configured correctly and will allow the newly active Database Server to operate correctly, check the following system functions:

  • If you use Active Directory or LDAP authentication, log in to the user interface using a user account that uses Active Directory or LDAP authentication.

  • In the user interface, verify that new data is being collected.
  • If your system is configured to send notification emails, confirm that emails are being received as expected. To test outbound email, create or update a ticket and ensure that the ticket watchers receive an email.

NOTE: On the Behavior Settings page (System > Settings > Behavior, if the field Automatic Ticketing Emails is set to Disabled, all assignees and watchers will not receive automatic email notifications about any tickets. By default, the field is set to Enabled.

  • If your system is configured to receive emails, confirm that emails are being received correctly. To test inbound email, send a test email that will trigger a "tickets form Email" policy or an "events from Email" policy.