Troubleshoot HA/DR clusters
This topic describes common troubleshooting issues and how to solve them.
DR clusters: primary appliance failed
If you have a DR cluster and your primary appliance has failed, perform a manual failover to the secondary appliance by switching appliance roles.
For details, see Manual Failover for DR.
DR clusters: secondary appliance failed
If you have a DR cluster and your secondary appliance has failed, do the following:
- Fix the secondary appliance.
- Re-build your cluster. For details, see Build a cluster.
Split-brain situations
If you've received an email notification that a split-brain situation was detected, do the following:
-
Break the cluster. For details, see Break a cluster.
-
Examine any FireFlow tickets and AFA reports on each appliance, and determine which appliance has the most recent data.
Note: If the data on both appliances seem to be equally recent, we recommend choosing the primary appliance.
- Re-build the cluster from the appliance with the most recent data. For details, see Build a cluster.
Current synchronization operation canceled
If a new synchronization starts while the previous is still running, the new synchronization is automatically canceled, and the system sends an email notification.
To resolve this issue, configure synchronizations to run less frequently. For details, see Configure HA/DR parameters.
Manage nodes automatically removed from clusters
ASMS automatically removes a secondary cluster in the following scenarios:
-
If there is less than 10% of disk space found on the Primary data partition.
In this case, a warning message will have been sent by email and to the Notification Center when the Primary was found to have less than 20% free disk space.
-
If the secondary node is unresponsive for more than 12 hours.
In this case, a warning message will have been sent by email and to the Notification Center when the secondary node had been unresponsive for 6 hours.
When the node is removed, the Central Manager is left as a single-node cluster.
To continue with your cluster, first handle your disk space or connectivity issue, and then re-build the cluster as follows:
Disk space issues |
If your node was removed for a disk space issue, do the following:
For more details, see |
Connectivity issues |
If your node was removed for a connectivity issue, when the secondary node is available again, it will still be configured to send data to the primary node. Do the following:
|
Forcibly remove a node from a cluster
This procedure describes how to forcibly remove a node from a cluster, which is sometimes recommended after system or connectivity errors have occurred.
Note: Before you start, we recommend gathering any logs you may need before they are overwritten as the cluster configuration is removed.
Do the following:
If you are recommended to forcibly remove a node from a cluster, do the following:
-
Log in to the node you want to remove and access the Administration (algosec_conf) menu.
-
Enter 13 to access the HA/DR configuration.
-
Enter 1 to forcibly remove the cluster configuration from the node.
Note: This option appears only when the system detects that an error has occurred.
If this option does not appear, you might be trying to break the cluster using the standard procedure. For details, see Break a cluster.
-
Repeat steps 2-4 on all nodes in the cluster, including the Central Manager.
-
Log in to the Central Manager and access the Administration (algosec_conf) menu.
-
Enter 13 to access the HA/DR configuration and rebuild your cluster.
For more details, see
Collect cluster logs for AlgoSec technical support
If you've been requested to send cluster logs to AlgoSec technical support for further analysis, do the following:
- From the primary or secondary appliance's administration interface main menu, select option 13.
- In the HA/DR sub-menu, select Collect HA logs. This is option 3 when there is no cluster configured and option 5 when a cluster is configured.
A *.tar file containing all of the relevant logs will be created in the appliance’s /tmp library.