SEARCH DOCS
info central: your site for Collage technical info
  CASSATT.COM   INFO CENTRAL
ACTIVE RESPONSE 5.1 TOPICS BLUEPRINTS TROUBLESHOOTING DOC INDEX


 

TOC

Physical layout
Network switches
Control node hardware and software
Shared disk
The cluster service: collage-core
Failure process
Collage-core service failure
Heartbeat daemon failure
Quorum daemon failure
What conditions cause the failures?
What happens during failover?
Tips for managing dual-control nodes
arrow Troubleshooting
Summary
 

know how:

Understanding Control Node Failover

Intended for use with Cassatt Active Response Standard Edition, Premium Edition and Data Center Edition V5.1.

While Cassatt Active Response minimally requires one control node to operate, using dual-control nodes provides built-in failover capability, so that if one node goes down, the other assumes operational control. For this reason, most enterprise environments opt for a dual-control node configuration. Once your Cassatt Active Response system is up and running, you might care about how control-node failover works. If so, you've come to the right article. If you want the low-level setup details, click the doc index link on the navigation bar and find the Control Node Setup documentation for your make/model control nodes.

Physical layout

To begin, let's look at the layout of physical components. To support failover, dual-control nodes require two network switches and access to a common, shared disk location, as follows:

physical components

Let's take a closer look at each component.

Network switches

The key point about the network switches is the redundant connections, so that if one network connection goes down, another is in place to pick up the traffic. For a couple of reasons, my friends in Quality Engineering recommend a gigabit-ethernet (1000BASE-T) switch that supports link aggregation and that enables you to switch off IGMP snooping:

  1. The gigabit network reduces latency in the network connection between the two nodes.
  2. Switching off IGMP snooping (a network optimization technique) is especially important. The underlying failover software (I'll say more about this soon) uses multicast to communicate between the two control nodes. IGMP snooping limits multicast packets in such a way that the two control nodes may not be able to communicate with each other. This could leave the failover system in a confused state that is best avoided.

top

Control node hardware and software

Your dual-control node configuration should have the following:

  • 2 control nodes of the same make/model
  • 2 NICs per node
  • Remote management controllers with supported firmware (required to power nodes on and off)
  • SCSI disk drives

In terms of software, the control nodes need to run Red Hat Enterprise Linux AS (RHELAS) 4 Update 5. (As new versions of RHELAS become available, check back to see if the Cassatt Quality team has verified they work as expected on the control nodes.)

Failover is based on the GNU General Public License (GPL) clumanager software. Cassatt adds plugins to the GPL clumanager and repackages it as cc_clumanager for use with Cassatt Active Response's dual-control node failover. These plugins enable clumanager to communicate with and manage different types of remote management controllers that are supported in Cassatt Active Response. For the sake of discussion, I'll generically refer to clumanager in this article, but remember that I mean Cassatt's repackaged version, cc_clumanager.

Red Hat also includes a version of the GPL clumanager. If you are doing an OS upgrade on the control nodes, do not install the Red Hat clumanager packages. These conflict with the special Cassatt Active Response cc_clumanager functionality.

You don't need to do any special clumanager software configuration to set up control node failover. The Cassatt Active Response installation program takes care of that for you by configuring the control nodes into a clumanager cluster of 2.

In case you're thinking that Cassatt uses the clumanager software to manage the entire Cassatt Active Response environment, stop. The use of clumanager has nothing to do with the way Cassatt Active Response supervises the many nodes that operate in the larger Cassatt Active Response environment. The clumanager software is used only for control node failover.

top

Shared disk

Cassatt Active Response has two distinct needs for shared disk storage. One is for storing Cassatt Active Response data, the system database, and software images. The other is for storing state information about the two control nodes. As you might have guessed, the latter is the focus of this discussion.

You need to be explicitly aware of a few details about this shared disk. First of all, you can use either a SAN or a dedicated dual-ported disk:

supported disk devices for failover

About NAS

If your site storage solution is NAS, then you must also have a dedicated dual-ported disk to support dual-control node failover.

Whichever disk solution you use, you need to configure two raw partitions (that is, character-based disk device files rather than block-based disk device files).

Raw device access

These raw partitions are used for control node state information, service state information, and configuration information. The clumanager software periodically records state information to the shared disk and ensures data is consistent on both of the raw partitions.

You define these raw partitions in the /etc/sysconfig/rawdevices file on each control node like this:

/dev/raw/raw1 /dev/device1 
/dev/raw/raw2 /dev/device2

Substitute real device names for device1 and device2. A sample /etc/sysconfig/rawdevices file looks something like this:

# raw device bindings
# format: <rawdev> <major> <minor>
/dev/raw/raw1 /dev/sdb1
/dev/raw/raw2 /dev/sdb2

These entries should be the same on both control nodes. For all the nitty-gritty details, look at Installing Cassatt Active Response and Control Node Setup.

These partitions are often referred to as the primary and the shadow partitions, and collectively as the quorum partitions, where quorum indicates the members of the failover cluster. Each raw partition needs to be a minimum of 100 Mbytes. (If you are really pinched for disk space, clumanager says 10-Mbyte partitions are sufficient, but I prefer to err on the high side.)

Oh, and did I say that it's best if this shared disk is infrequently accessed for any other use? That's just to ensure that an I/O bottleneck doesn't get in the way of the failover system maintaining its state information.

top

The cluster service: collage-core

The clumanager software on the active control node monitors the collage-core service:

clumanager - collage-core service relationship

In clumanager terms, the collage-core service is known as the cluster service—the applications and services you want to guarantee are running. The collage-core service is automatically configured during the Cassatt Active Response installation to reference the operating system and Cassatt Active Response software and database, DNS services, DHCP, the Controller, the control node virtual IP address, and the mounted /cassatt file system. The clumanager software records collage-core service state to the shared disk:

clumanager - collage-core service - raw device relationship

The clumanager software is configured in a hot-standby configuration in which the primary node runs the cluster service, and the standby node takes over only if the primary node fails. If a hardware or software failure occurs, clumanager automatically restarts collage-core on the standby node.

top

Failure process

In general, failure is detected by one of the following means:

  • Via a collage-core service failure
  • Via a heartbeat daemon failure
  • Via quorum daemon failure

Let's take a closer look at each.

Collage-core service failure

The clumanager software regularly checks on the status of the collage-core service by running the /etc/init.d/collage-core script. If the script indicates collage-core is failing, then:

  • The clumanager program stops the collage-core service and tries to restart it on the currently active control node.
  • If the service is still failing, the failure-detection daemon initiates failover and starts the collage-core service on the standby control node.

Heartbeat daemon failure

The Cassatt Active Response dual-control node configuration of clumanager employs a watch-dog timer concept to determine that the active control node has failed. Every 10 seconds, a heartbeat daemon on the active control node sends a packet over the Ethernet interface to the standby control node to indicate there's a pulse. If the standby control node does not receive a packet for three successive periods, the failure detection daemon starts triage and initiates failover.

Quorum daemon failure

The clumanager software regularly monitors access to the shared disk via the quorum daemon. Anytime communication is interrupted, a failure is indicated and the failure detection daemon initiates failover.

What conditions cause these failures?

Several conditions can cause either a heartbeat daemon or quorum daemon failure, which initiate a failover:

  • Panic—while rare on Linux systems, a panic is a software error that causes the system to shut down. During a panic, the control node does not update its timestamp on the raw partition and does not communicate with the standby via the heartbeat daemon.
  • Hang—a crash that prevents input to the node or that warrants a reboot to free it up. During a hang, the control node does not update its timestamp on the raw partition and does not communicate with the standby via the heartbeat daemon.
  • Shared disk is inaccessible—a problem with a SCSI adapter connected to the shared disk or with a cable that is disconnected. If the raw partitions are inaccessible, the quorum daemon fails.

A note about total network failure

In the event of a total network failure, in which network cables are disconnected and all the heartbeat network connections between the dual-control nodes fail, both nodes detect the problem. However, they also detect that the SCSI disk connections to the shared partition are still active. Therefore, services continue to run and are not interrupted. For details about how to recover from this scenario, take a look at this troubleshooting topic.

top

What happens during failover?

When one of these conditions initiates a failover, the failure detection daemon powers off the failed control node via the node's remote management controller. This ensures that all Cassatt Active Response services are stopped.

The clumanager software configures the control node virtual IP address onto the standby control node using IP aliasing and then starts up the collage-core service on that control node.

The failure detection daemon powers on the failed control node. If that node starts up successfully and is able to communicate to the raw partitions, it becomes the standby control node.

top

Tips for managing dual-control nodes

If you find yourself in the situation where you need to do hands-on management of the control nodes, here are a few random bits of advice.

  • This may sound obvious, but the clumanager software depends on Cassatt Active Response's ability to start itself, as defined in /etc/init.d/collage-core. In the unlikely event that Cassatt Active Response does not start up, look at /var/log/messages and /var/log/cluster to determine what's causing the start-up problem and fix any issues.
  • To the extent you need to manage the dual-control nodes, you should be familiar with a few commands and operations:
    • cccoreservice start | stop - these commands start or stop the collage-core service. When used with the stop argument, the clumanager software continues to run on both control nodes.
    • cccoreservice status - this command shows the status of the failover cluster, the configured services, and the IP address of the active control node. The first part of the output shows that clumanager software is running (PIDs will vary):

    clumembd (pid 5144) is running...
    cluquorumd (pid 5138) is running...
    clulockd (pid 5155) is running...
    clusvcmgrd (pid 5209) is running...

    The next part of the output shows the IP addresses of the control nodes and shows which one is currently providing collage-core services:

sample clustat output
    • cccoreservice failover - assuming both control nodes are up and running, this command forces a failover to the standby control node.

    The cccoreservice command resides in /opt/cassatt/bin.

    • clusvcadm - this command also allows you to enable, disable, relocate, and restart services in the failover cluster. Using clusvcadm requires that the failover cluster is operational (that is, the daemons are running and able to access the shared disk) from the node on which the command is invoked. A service can have one of the following states:

      - Pending – the service is transitioning to running or disabled state.
      - Running – the service is online and being actively monitored.
      - Disabled – the service is not online and has been stopped. This state warrants your attention.
      - Stopped – the service is disabled, but will start when the failover cluster processes are started up.
      - Failed – the service is not online. Again, look into this one.
    • cluforce - this command causes a a single functioning control node in a cluster to take charge. Use only as a last resort, for example, when you know one control node is going to be down for an extended period of time but the collage-core service needs to be running on the available control node.
  • If clumanager fails over too many times in succession (failing back-and-forth between the two control nodes), it automatically stops the collage-core service and puts it in a disabled state. After you debug and correct any problems with collage-core, you have to manually restart. To do so on a dual-control node system, use the following command on both control nodes:

    cccoreservice start
  • If for some reason you are in a shell and doing something on /cassatt (the mounted file system that houses all of the Cassatt Active Response–specific files, system database, et al.), a failover will pull the rug out from under you. So, if your shell session just disappears, a failover might have occurred. This could leave your editing session or whatever you were doing on /cassatt in an indeterminate state.

Best practice—use separate power supplies

To ensure the highest level of failover, it is recommended that you use two power supplies, each on a different circuit. Connect one to the control node and one to the control node's integrated power controller. Otherwise, a power failure that takes down both a control node and its power controller prevents automatic failover to the standby node. That's because the clumanager software on the standby node needs to access the failed node's power controller to guarantee that node is shut down.

If this situation occurs and you don't have the control node and power controller on separate power supplies, you can manually fail over to the standby node. However, simply using cccoreservice failover does not work in this scenario. Instead, you need to verify that the shared storage device cannot be mounted by the off-line control node, and then follow these steps on the standby control node:

# cccoreservice status
___Status of clumanager___
clumembd (pid 20016) is running...
cluquorumd (pid 20006) is running...
clulockd (pid 20023) is running...
clusvcmgrd (pid 32581) is running...
...

# kill 20016 20006 20023 32581
# service clumanager start
# cluforce

If you have to do this, substitute the real PIDs of those clumanager daemons that are returned by the cccoreservice status command.

Remember, though, that the way to ensure automatic failover is to use two distinct power supplies.

top

Troubleshooting

See Control Node: Troubleshooting.

Summary

There's a lot going on under the covers to enable failover on a dual-control node system. To support failover, Cassatt Active Response enhances the GPL clumanager software and repackages it as cc_clumanager. Cassatt Active Response does the failover setup for you, configuring clumanager during the Cassatt Active Response installation. A handful of things can trigger a failover, but it should be rare for you to have to directly interact with clumanager. If so, I hope the guidelines and tips in this article help you through that scenario.

top