Control Node: Troubleshooting
Intended for use with
Cassatt Active Response V5.0.
The following material outlines problems you may encounter with your system's control nodes, along with the steps to solve those problems.
Failover continually reboots control nodes
Description
Failover continually boots the control nodes, going back and forth between the two.
Resolution
A Cassatt Active Response service is corrupt and the control nodes begin failover; however, as the control nodes fail over, the corrupt service is restarted and the control node initiates another failover. This causes repeated failover between the nodes, one after the other.
To fix this problem, you will have to intervene and boot the control nodes in single-user mode via the GRUB Boot Screen, use chkconfig to disable clumanager, and then continue with a normal boot. You will then have to resolve the problem that was causing the critical service to fail. Finally, use chkconfig to enable clumanager and verify the cluster status.
The following procedure will show you how to boot the control nodes in single user mode, correct the problem, and restart the clumanager service:
- Shut down and reboot the control nodes.
- When booting the control nodes, the GRUB Boot Control Screen will be displayed.
- Quickly press the 'e' key to edit the boot parameters.
You must enter the 'e' key before the default system boot timeout expires. If you are late you will have to shut down the node and reboot the control node to perform this step again.
- GRUB will display the lines and boot parameters for the boot of the default system. Using the arrow keys, highlight the boot line that begins with 'kernel'.
- Press the 'e' key to edit this line.
- Modify this line to boot the kernel in single user mode by adding ‘-s’ to the end of the line. For example, you might see the following on the screen:
grub edit> kernel /vmlinux-2.4.21-20.ELsmp ro root=LABEL=/ -s
- Press the 'Enter' key when you have successfully modified the kernel line.
- The GRUB Boot Screen will now display the lines and boot parameters for booting the default system with your modification.
- Press the 'b' key to boot the kernel in single user mode.
At this point, normal boot messages will be displayed on the screen until the system is booted in single user mode. You will see a prompt similar to the following:
sh-2.05b#
- Use chkconfig to disable clumanager for runlevel 2, 3, 4, and 5. Enter the following at the single user prompt:
sh-2.05b# chkconfig -level 2345 clumanager off
- Then use the init command to boot to a runlevel 5.
sh-2.05b# init 5
The system will boot to runlevel 5 but the clumanager will not be started. Boot the second control node following these same steps. Once the control nodes have booted, resolve the problem that led to the continual reboots or failovers.
- When the problem has been resolved, use chkconfig to enable clumanager and then restart the clumanager service.
-
Use chkconfig to enable the clumanager for runlevels 2, 3, 4, and 5:
[root@localhost]# chkconfig -level 2345 clumanager on
-
Start the clumanager service:
[root@localhost]# service clumanager start
top
Control node cluster remains disabled after lost network connection
Description
The control node cluster is unable to return to its normal state after both control nodes lose network connectivity.
Resolution
When both control nodes lose connectivity, neither node is recognized as being part of the Cassatt Active Response cluster. Since the control nodes can no longer communicate with one another, to prevent more than one node being active at a time clumanager isolates both nodes. Clumanager then shows collage-core in a disabled state. As part of the implementation of clumanager, it will mark a service disabled if it can not bring up a service on either node. To correct this problem, follow these steps:
- Fix the network connection problem
- Bring both nodes out of isolation by manually enabling the service with the "clusvcadm -e" command. More specifically, you can use:
clusvcadm -e collage-core
Each control node will return to its previous state, either active or standby,
as it was before the network connection was lost, and the Cassatt Active Response cluster
will return to a normal state.
top
Was this article useful? Tell us what you think.
Email infocentral@cassatt.com.
|