Application Nodes: Troubleshooting
Intended for use with Cassatt Active Response Standard Edition, Premium Edition and Data Center Edition V5.1.
The following material outlines problems you may encounter with your system's application nodes, along with the steps to solve those problems.
Nodes with external power controllers not booting
Description
Nodes with external power controllers are not booting, or are inconsistently booting.
Resolution
Problems with nodes booting could mean that you did not validate the BIOS power settings (Wake on LAN and/or "Boot when power is restored") on the application nodes. See, External Power Controllers: Qualifying Application Nodes and Other Best Practices.
If you get alerts on several nodes attached to an external power controller, the problem could be with 1) the external power controller itself or 2) the configuration of the external power controller.
- Validate the configuration of the external power controller including type, IP address, network, and nodes.
- If the configuration is correct, log into the external power controller and check that the outlets are working.
- Try restarting power on the external power controller. If this doesn't work, follow the instructions to replace the failed external power controller.
Nodes in a tier with its own application network do not boot
Description
Application tiers running on networks other than the base Cassatt Active Response network require BOOTP relays (sometimes referred to as DHCP forwarding) on their network so their DHCP requests are routed back to the Cassatt Active Response control node (the DHCP server). On some routers and switches, there are IP helper features that you can use to specify the address of the DHCP server. However, some of these devices (for example, Cisco 3750 switches) modify the gateway address of the DHCP response to be the IP address of the BOOTP relay device. This is fine if the device is a valid gateway, in that it correctly routes to the Cassatt Active Response control node and the NFS server. However, if the device does not route correctly to the Cassatt Active Response control node and the NFS server, application nodes will not boot.
Resolution
If you use BOOTP relays on devices that modify the DHCP response to make the BOOTP relay device the gateway, ensure that that device properly routes to the Cassatt Active Response control node and NFS server.
top
HP DL 360s intermittently fail on boot
Description
HP DL 360s with Smart Array raid controller experience intermittent node failures on reboot.
When the boot process tries to initialize the Smart Array controller, it will hang with a spinning \ for approximately ten minutes. Eventually, the boot process will get past this, but will hang in other sections of the boot process for 2–3 minutes. It take approximately 25 minutes before the node will boot and come up so it's a boot prompt. The hang will take long enough that Cassatt Active Response will declare the node failed and will take appropriate measures with it based on tier or pool settings. However, when you look at the node, it will be sitting at a login prompt. If you reboot the node itself, in all likelihood it will reboot just fine.
Resolution
Disable the Smart Array raid controller.
top
Power operation fails on properly configured device
Description
The power operation failed even though the device seems to be configured correctly.
Resolution
This is likely due to a failure in the power controller. Reset the power controller by shutting off its power, waiting a few seconds, and then turning on its power. If the problem persists, run the command line tool ccpower against the power controller, using the verifyConfiguration, verifyPartialOperation, or verifyFullOperation options in order to further diagnose the problem.
The application freezes or fails to respond
Description
After running just fine and with no obvious changes to the application configuration, the application mysteriously freezes for fails to respond.
Resolution
One possible reason for this behavior is that the application uses /dev/random to seed encryption functions on which the application relies. On Red Hat Enterprise Linux 4, /dev/random leverages things like mouse and keyboard events to guarantee randomness, but applications running in a Cassatt Active Response environment are unlikely to have a mouse or keyboard attached. When an application attempts to read /dev/random when it has run out of random numbers, /dev/random blocks the application. The symptom is that the application freezes or fails to respond.
To work around this problem, you can link /dev/random to /dev/urandom, which avoids the problem because /dev/urandom produces unlimited random numbers. To do this, use the following commands:
rm /dev/random
mknod /dev/random c 1 9
(The c parameter indicates to create a character device and the 1 and 9 specify the major and minor device number for /dev/urandom.) After executing these commands, restart the application. (Note, however, that using /dev/urandom has the inherent problem that the numbers it generates may be less and less random over time.)
Deleting IBM blades causes them to rediscover
Description
Blades take a long time to delete and ultimately get rediscovered.
Resolution
Because operations on the BCCM are serialized, blades can take a long time to delete. To ensure blades are not rediscovered, on the Delete confirmation screen, check the Quarantine checkbox at the bottom of the page.
Inventory fails after fixing switch problems on BCMM
Description
After fixing switch problems on the BCMM, some IBM blades do not complete inventory.
Resolution
Whenever you change switch settings on a BCMM, you are changing the node's NIC connectivity. You must delete the node and allow it to rediscover.
SPARC/Solaris application nodes fail with permission problems
Description
Using ssh to access SPARC/Solaris 8, 9, or 10 application nodes fails with
permission problems. This occurs when the Cassatt Active Response control node is the NFS server
for the Cassatt Active Response image matrix. There is a known Solaris bug (4171523) with the
chown command that modifies permissions such that ssh access fails.
Resolution
To resolve this problem, use a NAS device to serve the Cassatt Active Response image matrix.
top
SPARC Hardware may not Support VLAN interfaces
Description
Some SPARC hardware does not support VLAN interfaces.
Resolution
Check your Sun documentation to verify whether your hardware is appropriate for use with Network Manager–actuated options.
top
Sun Booting Errors with Multiple NICs
Description
You have Sun booting errors and have multiple NICs installed.
Resolution
Cassatt recommends you run IPMP to control the NICs.
top
Sun - IPMP not supported for Solaris 8
Description
IP Multipathing (IPMP) is not supported for Solaris 8.
Resolution
Do not check the 'IP Multipathing' box when creating tiers in Cassatt Active Response.
top
Automatic discovery problems
Description
Application node hardware is not being discovered.
Resolution
If you encounter problems where Cassatt Active Response is not discovering application node hardware with supported power controllers, try the following paths to isolate, debug, and correct the problem:
- Power?
Unplug the power from the node and then plug it back in (to reset the power controller).
- Network problems?
Make sure that the network interface card (NIC) light is working properly on any node that is not being discovered by Cassatt Active Response. If not, diagnose and correct the network problem.
Make sure that all control nodes, power controllers, and application nodes are on the same VLAN.
- Enough IP addresses?
Make sure Cassatt Active Response has not already used all of the IP addresses that were specified during the install process. Cassatt Active Response is not able to discover more nodes than it has IP addresses to dole out. If Cassatt Active Response has already used all available IP addresses, you'll need to add more. See Network Addresses: Calculating Requirements.
- Node settings?
Make sure that the boot order in the node's BIOS is set to boot from the network first.
Make sure that the Ethernet interfaces are set to enable PXE boot.
Make sure the power controller is set to DHCP.
See the setup document for your hardware to check that all other settings are correct, especially those that relate to communication with the power controller:
Dell: Nodes with Integrated Power Controllers
Dell Blade Server: Nodes in a Blade Enclosure
HP: Nodes with Integrated Power Controllers
IBM Blade Center: Nodes in a Blade Enclosure
Sun: Nodes with Integrated Power Controllers
- Power controller authentication?
To discover power controllers and their attached nodes, Cassatt Active Response must know the power controller user names and passwords. Verify that the default power controller user names and passwords are set on each node and in Cassatt Active Response: Discovered Pool > Properties > Power Controller Authentication.
- I/O operations interrupting the automatic discovery process?
Make sure there aren't any I/O intensive operations such as an image clone, image import (ccimport), or tier creation that might interrupt the discovery process. If so, let these operations complete and then restart the nodes that were not discovered.
- Firmware version?
Contact support@cassatt.com to check that your firmware version is supported. If not, do one of the following:
- Change to a supported version.
- Field qualify your version by running the ccpower command from the control node:
/opt/cassatt/bin/ccpower -t <power_controller_type> -a <IP_Address> -u <power_controller_username> -p <power_controller_password> -v verifyfulloperation
If the command completes successfully, then your firmware version is added to your local supported firmware file, and your problem lies elsewhere.
If the command fails, use the output in the next step.
- Exceptions in the ccpower output?
From the top of the output file, search down for the first Java Exception. The exceptions usually tell you a lot about what issue ccpower is encountering (network problem, timeout, firmare revision). When you've identified the exception, copy and paste the output and email it to support@cassatt.com.
- Errors in the logs?
Try the following:
- Look on the control node in /opt/cassatt/logs. Use the grep command to find any strings with the term ERROR in them. Email the results to support@cassatt.com.
- Look on the control node in /var/log/messages for problems with the node Cassatt Active Response is not discovering. Use the grep command to find any strings with the power controller's MAC address. Email the results to support@cassatt.com.
top
Automatic discovery problems with IBM BladeCenter hardware
Description
IBM BladeCenter hardware is not being discovered.
Resolution
If Cassatt Active Response fails to discover IBM Bladecenter hardware, try the following paths to isolate, debug, and correct the problem:
- Follow the guidelines in General problems.
- Validate that the BCMM is set up as described in IBM Blade Center: Nodes in a Blade Enclosure .
- Verify that the BCMM has not timed out and is still issuing DHCP requests. DHCP is required to trigger discovery. Reset the BCMM by one of these methods:
- Access the BCMM web administration interface:
- Open a web browser and enter the BCMM's default IP address (the default BCMM IP address is 192.168.70.125) in the URL field.
- Log into the BCMM web interface. The default IBM login and password are:
Login: USERID
PASSWORD: PASSW0RD
Note that in "PASSW0RD," the 0 is a zero and not a capital O.
- Restart the BCMM.
- Telnet into the BCMM using the "USERID" user name and "PASSW0RD". Issue the following command:
telnet 192.168.70.125
reset -T system:mm[1]
The command will not provide any feedback. Eventually, when the telnet connection fails you will see the "Connection closed by foreign host" message, indicating that the BCMM is restarting.
If the BCCM does not respond to the telnet command, the BCMM is probably not set up to use its default IP address. Determine the BCMM's IP address by looking at the control node's /etc/dhcpd.conf file. Use the BCMM's mac address to look up the IP address in the /etc/dhcpd.conf file. For example, assume the BCMM mac address is 00:0D:60:F6:34:2A. The /etc/dhcpd.conf file will have an entry like the following:
host 10.0.84.100 {
hardware ethernet 00:0D:60:F6:34:2A;
option routers 10.0.84.1;
fixed-address 10.0.84.100;
}
For the above example, the BCMM can be reached by using the IP address of 10.0.84.100 instead of 192.168.70.125.
- Verify that the "USERID"/"PASSW0RD" account on the BCMM has sufficient privileges. Cassatt Active Response uses this username/password to communicate with the BCMM.
To test if the BCMM's user account is working, use the ccpower command. If you see authorization or permission problems when issuing the ccpower command you may have to access the BCMM and give privileges to the "USERID" account.
The following are examples of ccpower commands that test the BCMM:
# cd /opt/cassatt/bin
# ./ccpower -u USERID -p PASSWORD -a 10.0.84.100 - t ibm_bcmm version
Driver firmware: BRET82H.16.BRRG82H.16
# .ccpower -u USERID -p PASSWORD -a 10.0.84.100 -t ibm_bcmm status
SSH key propogation fails
Description
When adding power only nodes, you get the error, "Error propogating the SSH key for <nodename>."
Resolution
Check the following: Is SSH installed? Is the SSH daemon running? Does the user exist? Do the passwords match? Are the SSH parameters properly configured for Cassatt Active Response?
top
Was this article useful? Tell us what you think.
Email infocentral@cassatt.com.
|