Troubleshoot HA/DR clusters

This topic describes common troubleshooting issues and how to solve them.

In this topic:

DR clusters: primary appliance failed
DR clusters: secondary appliance failed
Split-brain situations
Current synchronization operation canceled
Manage nodes automatically removed from clusters
Forcibly remove a node from a cluster
Collect cluster logs for AlgoSec technical support

DR clusters: primary appliance failed

If you have a DR cluster and your primary appliance has failed, perform a manual failover to the secondary appliance by switching appliance roles.

For details, see Switch appliance roles.

DR clusters: secondary appliance failed

If you have a DR cluster and your secondary appliance has failed, do the following:

Fix the secondary appliance.
Re-build your cluster. For details, see Build a cluster.

Split-brain situations

If you've received an email notification that a split-brain situation was detected, do the following:

Break the cluster. For details, see Break a cluster.
Examine any FireFlow tickets and AFA reports on each appliance, and determine which appliance has the most recent data.

Note: If the data on both appliances seem to be equally recent, we recommend choosing the primary appliance.
Re-build the cluster from the appliance with the most recent data. For details, see Build a cluster.

Current synchronization operation canceled

If a new synchronization starts while the previous is still running, the new synchronization is automatically canceled, and the system sends an email notification.

To resolve this issue, configure synchronizations to run less frequently. For details, see Configure HA/DR parameters.

Manage nodes automatically removed from clusters

ASMS automatically removes a secondary cluster in the following scenarios:

If there is less than 10% of disk space found on the Primary data partition.

In this case, a warning message will have been sent by email and to the Issues Center when the Primary was found to have less than 20% free disk space.
If the secondary node is unresponsive for more than 12 hours.

In this case, a warning message will have been sent by email and to the Issues Center when the secondary node had been unresponsive for 6 hours.

When the node is removed, the Central Manager is left as a single-node cluster.

To continue with your cluster, first handle your disk space or connectivity issue, and then re-build the cluster as follows:

Disk space issues

If your node was removed for a disk space issue, do the following:

Log in to the Central Manager and access the Administration menu.
Enter 13 to re-build your cluster and enter the details for your secondary node.

For more details, see Connect to the Administration Interface andBuild a cluster.

Connectivity issues

If your node was removed for a connectivity issue, when the secondary node is available again, it will still be configured to send data to the primary node.

Do the following:

Forcibly remove the cluster configuration from the secondary node, and from any other nodes in the cluster.

For more details, see Forcibly remove a node from a cluster.
Access the Central Manager node to rebuild the cluster again.

Forcibly remove a node from a cluster

This procedure describes how to forcibly remove a node from a cluster, which is sometimes recommended after system or connectivity errors have occurred.

Note: Before you start, we recommend gathering any logs you may need before they are overwritten as the cluster configuration is removed.

Do the following:

If you are recommended to forcibly remove a node from a cluster, do the following:

Log in to the node you want to remove and access the Administration (algosec_conf) menu.
Enter 13 to access the HA/DR configuration.
Enter 1 to forcibly remove the cluster configuration from the node.

Note: This option appears only when the system detects that an error has occurred.

If this option does not appear, you might be trying to break the cluster using the standard procedure. For details, see Break a cluster.
Repeat steps 2-4 on all nodes in the cluster, including the Central Manager.
Log in to the Central Manager and access the Administration (algosec_conf) menu.
Enter 13 to access the HA/DR configuration and rebuild your cluster.

For more details, see Connect to the Administration Interface andBuild a cluster.

Collect cluster logs for AlgoSec technical support

If you've been requested to send cluster logs to AlgoSec technical support for further analysis, do the following:

From the primary or secondary appliance's administration interface main menu, select option 13.
In the HA/DR sub-menu, select Collect HA logs. This is option 3 when there is no cluster configured and option 5 when a cluster is configured.

A *.tar file containing all of the relevant logs will be created in the appliance’s /tmp library.