Understanding Control Node Failover
Intended for use with Cassatt Active Response
V5.0.
While Cassatt Active Response minimally requires one control node
to operate, using dual-control nodes provides
built-in failover capability, so that if one node goes down,
the other assumes operational control. For this reason,
most enterprise environments opt for a dual-control node
configuration. Once your Cassatt Active Response system is up and running,
you might care about how control-node failover works. If
so, you've come to the right article. If you want the low-level
setup details, see the info central doc
index and find the Control
Node Setup documentation
for your make/model control nodes.
Physical layout
To begin, let's look at the layout of physical
components. To support failover, dual-control nodes require
two network switches and access to a common, shared disk
location, as follows:

Let's take a closer look at each component.
Network switches
The key point about the network switches is the redundant
connections, so that if one network connection goes down,
another is in place to pick up the traffic. For a couple
of reasons, my friends in Quality Engineering recommend
a gigabit-ethernet (1000BASE-T) switch that supports link
aggregation and that enables you to switch off IGMP snooping:
- The gigabit
network reduces latency in the network connection between
the two nodes.
- Switching
off IGMP snooping (a network
optimization technique) is especially important. The underlying
failover software (I'll say more about this soon) uses
multicast to communicate between the two control nodes.
IGMP snooping limits multicast packets in such a way that
the two control nodes may not be able to communicate with
each other. This could leave the failover system in a
confused state that is best avoided.
top
Control node hardware and software
For details about control node hardware, see the article, Understanding Hardware: What Works Best with Cassatt Active Response. The Cliffs Notes version,
however, is that your dual-control node configuration should
have the following:
- 2 control nodes of the same make/model
- 2 NICs per node
- Remote management controllers with supported firmware
(required to power nodes on and off)
- SCSI disk drives
In terms of software, the control nodes need to run Red
Hat Enterprise Linux AS (RHELAS) 4 Update 4 or later.
(As new versions of RHELAS become available, check
back to see if the Cassatt Quality team has verified they
work as expected on the control nodes.)
Failover is based on the GNU General Public License (GPL)
clumanager software. Cassatt adds plugins to the GPL clumanager
and repackages it as cc_clumanager for use with Cassatt Active Response's
dual-control node failover. These plugins enable clumanager
to communicate with and manage different types of remote
management controllers that are supported in Cassatt Active Response. For the
sake of discussion, I'll generically refer to clumanager
in this article, but remember that I mean Cassatt's repackaged
version, cc_clumanager.
Red Hat also includes
a version of the GPL clumanager. If you are doing an OS
upgrade on the control nodes, do not install the Red Hat
clumanager packages. These conflict with the special Cassatt Active Response
cc_clumanager functionality.
You don't need to do any special clumanager software configuration
to set up control node failover. The Cassatt Active Response installation
program takes care of that for you by configuring the control
nodes into a clumanager cluster of 2.
In case you're thinking that Cassatt uses the clumanager
software to manage the entire Cassatt Active Response environment, stop.
The use of clumanager has nothing to do with the way Cassatt Active Response
supervises the many nodes that operate in the larger Cassatt Active Response
environment. The clumanager software is used only for
control node failover.
top
Shared disk
Cassatt Active Response has two distinct needs for shared disk storage.
One is for storing Cassatt Active Response data, the system database,
and software images. The other is for storing state information
about the two control nodes. As you might have guessed, the
latter is the focus of this discussion. (To learn about shared
storage in general, read Understanding Storage
Hardware Options.
You need to be explicitly aware of a few details
about this shared disk. First of all, you can use either
a SAN or a dedicated dual-ported disk:

About NAS
If your site storage solution is NAS, then you must
also have a dedicated dual-ported disk to support
dual-control node failover. |
Whichever disk solution you use, you need
to configure two raw partitions
(that is, character-based disk device files rather than
block-based disk device files).

These raw partitions are used for control node state information,
service state information, and configuration information.
The clumanager software periodically records state
information to the shared disk and ensures data is consistent
on both of the raw partitions.
You define these raw partitions in the /etc/sysconfig/rawdevices file on each control node like this:
/dev/raw/raw1 /dev/device1 /dev/raw/raw2 /dev/device2
Substitute real device names for device1 and device2.
A sample /etc/sysconfig/rawdevices file looks something like
this:
# raw device bindings
# format: <rawdev> <major> <minor>
/dev/raw/raw1 /dev/sdb1
/dev/raw/raw2 /dev/sdb2
These entries should be the same on both control nodes. For all the
nitty-gritty details, look at the Setting
Up Control Node documentation
for your hardware make and model, which you can find in
the info central doc index.
These partitions are often referred to as the primary and
the shadow partitions, and collectively as the quorum partitions,
where quorum indicates the members of the failover cluster.
Each raw partition needs to be a minimum of 100 Mbytes. (If
you are really pinched for disk space, clumanager says
10-Mbyte partitions are sufficient, but I prefer to err on
the high side.)
Oh,
and did I say that it's best if this shared disk is infrequently
accessed for any other use? That's just to ensure that an
I/O bottleneck doesn't get in the way of the failover system
maintaining its state information.
top
The cluster service: collage-core
The clumanager software on the active control node monitors
the collage-core service:

In clumanager terms, the collage-core service is
known as the cluster
service—the
applications and services you want to guarantee are running.
The collage-core service is automatically configured during
the Cassatt Active Response installation to reference the operating system
and Cassatt Active Response software and database, DNS services, DHCP,
the Controller, the control node virtual IP address,
and the mounted /cassatt file
system. The clumanager software records collage-core service
state to the shared disk:

The clumanager software is configured in a hot-standby
configuration in which the primary node runs the cluster
service, and the standby node takes over
only if the primary node fails. If a hardware or software
failure occurs, clumanager automatically restarts collage-core
on the standby
node.
top
Failure process
In general, failure is detected by one of the following
means:
- Via a collage-core service failure
- Via a heartbeat daemon failure
- Via quorum daemon failure
Let's take a closer look at each.
Collage-core service failure
The clumanager software regularly checks on the status
of the collage-core service by running the /etc/init.d/collage-core
script. If the script indicates collage-core
is failing, then:
- The clumanager program stops
the collage-core service and tries to restart it on
the currently active control node.
- If the service is still failing,
the failure-detection daemon initiates failover and starts
the collage-core service on the standby control node.
Heartbeat daemon failure
The Cassatt Active Response dual-control node configuration of clumanager
employs a watch-dog timer concept to determine that
the active control node has failed. Every 10 seconds, a heartbeat daemon
on the active control node sends a packet over the Ethernet
interface to the standby control node to indicate there's
a pulse. If the standby control node does not receive a packet
for three successive periods, the failure detection daemon
starts triage and initiates failover.
Quorum daemon failure
The clumanager software regularly monitors access to the
shared disk via the quorum daemon. Anytime communication
is interrupted, a failure is indicated and the
failure detection daemon initiates failover.
What conditions cause these failures?
Several conditions can cause either a heartbeat daemon or
quorum daemon failure, which initiate a failover:
- Panic—while rare on Linux systems, a panic is a software
error that causes the system to shut down. During a panic,
the control node does not update its timestamp on the
raw partition and does not communicate with the standby
via the heartbeat daemon.
- Hang—a crash that prevents input to
the node or that warrants a reboot to free it up. During
a hang, the control node does not update its timestamp
on the raw partition and does not communicate with the
standby via the heartbeat daemon.
- Shared disk is inaccessible—a problem with a SCSI adapter
connected to the shared disk or with a cable that is disconnected.
If the raw partitions are inaccessible, the quorum daemon
fails.
A note about total network failure
In the event of a total network failure, in which
network cables are disconnected and
all the heartbeat network connections between the
dual-control nodes fail,
both nodes detect the problem. However, they
also detect that the SCSI disk connections to the
shared partition are still active. Therefore, services
continue to run and are not interrupted. For details
about how to recover from this scenario, take a look
at this troubleshooting topic. |
top
What happens during failover?
When one of these conditions initiates a failover, the
failure detection daemon powers off the failed control node
via the node's remote management controller. This ensures
that all Cassatt Active Response services are stopped.
The clumanager software configures
the control node virtual IP address onto the standby
control node using IP aliasing and then starts up the collage-core
service on that control node.
The failure detection daemon powers on the failed control
node. If that node starts up successfully and is able to
communicate to the raw partitions, it becomes the standby
control node.
top
Tips for managing dual-control
nodes
If you find yourself in the situation where you need to
do hands-on management of the control nodes, here are a
few random bits of advice.
- This may sound obvious, but the clumanager software
depends on Cassatt Active Response's
ability to start itself, as defined in /etc/init.d/collage-core.
In the unlikely event that Cassatt Active Response does not start up,
look at /var/log/messages and /var/log/cluster to
determine what's causing the start-up problem and fix
any issues.
- To the extent you need to manage the dual-control nodes,
you should be familiar with a few commands and operations:
- cccoreservice start | stop -
these commands start or stop the collage-core
service. When used with the stop argument, the clumanager
software continues to run on both control nodes.
- cccoreservice status - this
command shows the status of the failover cluster, the
configured services, and the IP address of the active
control node. The first part of the output shows that
clumanager software is running (PIDs will vary):
clumembd (pid 5144) is running...
cluquorumd (pid 5138) is running...
clulockd (pid 5155) is running...
clusvcmgrd (pid 5209) is running...
The next part of the output shows the IP addresses of the control
nodes and shows which one is currently providing
collage-core services:
Best practice—use separate
power supplies
To ensure the highest level of failover, it is
recommended that you use two power supplies, each
on a different circuit. Connect one to the control
node and one to the control node's integrated power
controller. Otherwise, a power failure that takes
down both a control node and its power controller
prevents automatic failover to the standby node.
That's because the clumanager software on the standby
node needs to access the failed node's power controller
to guarantee that node is shut down.
If this situation occurs and you don't have the
control node and power controller on separate power
supplies, you can manually fail over to
the standby node. However, simply using cccoreservice
failover does not work in this scenario.
Instead, you need to verify that the shared storage
device cannot be mounted by the off-line control
node, and then follow these steps on the standby
control node:
# cccoreservice status
___Status of clumanager___
clumembd (pid 20016) is running...
cluquorumd (pid 20006) is running...
clulockd (pid 20023) is running...
clusvcmgrd (pid 32581) is running...
...
# kill 20016 20006 20023 32581
# service clumanager start
# cluforce
If you have to do this, substitute
the real PIDs of those clumanager daemons that
are returned by the cccoreservice
status command.
Remember, though, that the way
to ensure automatic failover
is to use two distinct power supplies. |
top
Summary
There's a lot going on under the covers to enable failover
on a dual-control node system. To support failover, Cassatt Active Response
enhances the GPL clumanager software and repackages it as
cc_clumanager. Cassatt Active Response does
the failover setup for you, configuring clumanager during
the Cassatt Active Response installation. A handful of things can trigger
a failover, but it should be rare for you to have to directly
interact with clumanager. If so, I hope the guidelines and
tips in this article help you through that scenario.
top
Was this article useful? Tell us what you think.
Email infocentral@cassatt.com.
|