SEARCH DOCS
info central: your site for Collage technical info
  CASSATT.COM   INFO CENTRAL
ACTIVE RESPONSE 5.1 TOPICS BLUEPRINTS TROUBLESHOOTING DOC INDEX


 

TOC

Failover continually reboots control nodes
Cassatt Active Response control node cluster remains disabled after lost network connection

know-how:

Control Node: Troubleshooting

Intended for use with Cassatt Active Response Standard Edition, Premium Edition and Data Center Edition V5.1.

The following material outlines problems you may encounter with your system's control nodes, along with the steps to solve those problems.

Failover continually reboots control nodes

Description

Failover continually boots the control nodes, going back and forth between the two.

Resolution

A Cassatt Active Response service is corrupt and the control nodes begin failover; however, as the control nodes fail over, the corrupt service is restarted and the control node initiates another failover. This causes repeated failover between the nodes, one after the other.

To fix this problem, you will have to intervene and boot the control nodes in single-user mode via the GRUB Boot Screen, use chkconfig to disable clumanager, and then continue with a normal boot. You will then have to resolve the problem that was causing the critical service to fail. Finally, use chkconfig to enable clumanager and verify the cluster status.

The following procedure will show you how to boot the control nodes in single user mode, correct the problem, and restart the clumanager service:

  1. Shut down and reboot the control nodes.
  2. When booting the control nodes, the GRUB Boot Control Screen is displayed.
  3. Quickly press the 'e' key to edit the boot parameters.

    You must enter the 'e' key before the default system boot timeout expires. If you are late you will have to shut down the node and reboot the control node to perform this step again.

  4. GRUB displays the lines and boot parameters for the boot of the default system. Using the arrow keys, highlight the boot line that begins with 'kernel'.
  5. Press the 'e' key to edit this line.
  6. Modify this line to boot the kernel in single user mode by adding ‘-s’ to the end of the line.  For example, you might see the following on the screen:

    grub edit> kernel /vmlinux-2.4.21-20.ELsmp ro root=LABEL=/ -s
  7. Press the 'Enter' key when you have successfully modified the kernel line.
  8. The GRUB Boot Screen will now display the lines and boot parameters for booting the default system with your modification.
  9. Press the 'b' key to boot the kernel in single user mode.

    At this point, normal boot messages are displayed on the screen until the system is booted in single user mode. You will see a prompt similar to the following:

    sh-2.05b#
  10. Use chkconfig to disable clumanager for runlevel 2, 3, 4, and 5. Enter the following at the single user prompt:

    sh-2.05b# chkconfig -level 2345 clumanager off
  11. Then use the init command to boot to a runlevel 5.

    sh-2.05b# init 5

    The system will boot to runlevel 5 but the clumanager will not be started. Boot the second control node following these same steps. Once the control nodes have booted, resolve the problem that led to the continual reboots or failovers.
  12. When the problem has been resolved, use chkconfig to enable clumanager and then restart the clumanager service.
    1. Use chkconfig to enable the clumanager for runlevels 2, 3, 4, and 5:

      [root@localhost]# chkconfig -level 2345 clumanager on

    2. Start the clumanager service:

      [root@localhost]# service clumanager start

top

Control node cluster remains disabled after lost network connection

Description

The control node cluster is unable to return to its normal state after both control nodes lose network connectivity.

Resolution

When both control nodes lose connectivity, neither node is recognized as being part of the Cassatt Active Response cluster. Because the control nodes can no longer communicate with one another, Cassatt Active Response shows collage-core in a disabled state if it can not bring up a service on either node. To correct this problem, follow these steps:

  1. Fix the network connection problem.
  2. Bring both nodes out of isolation by manually enabling the service:

    /opt/cassatt/bin/cccoreservice start

Each control node will return to its previous state, either active or standby, as it was before the network connection was lost, and the Cassatt Active Response cluster will return to a normal state.

top