SEARCH DOCS
info central: your site for Collage technical info
  CASSATT.COM   INFO CENTRAL
ACTIVE RESPONSE 5.1 TOPICS BLUEPRINTS TROUBLESHOOTING DOC INDEX


 

TOC

Nodes with external power controllers not booting
Nodes on their own application network do not boot
HP DL 360 intermittently fail on boot
Power operation fails on properly configured device
Application freezes or fails to respond
Deleting IBM blades causes them to rediscover
Inventory fails after fixing switch problems on BCMM
arrow

SPARC/Solaris application nodes fail with permission problems

SPARC Hardware May Not Support VLAN Interfaces
Sun Booting Errors with Multiple NICs
Sun - IPMP not supported for Solaris 8
arrow Automatic discovery problems
arrow Automatic discovery problems with IBM BladeCenter problems
arrow SSH key propogation fails
know-how:

Application Nodes: Troubleshooting

Intended for use with Cassatt Active Response Standard Edition, Premium Edition and Data Center Edition V5.1.

The following material outlines problems you may encounter with your system's application nodes, along with the steps to solve those problems.

Nodes with external power controllers not booting

Description

Nodes with external power controllers are not booting, or are inconsistently booting.

Resolution

Problems with nodes booting could mean that you did not validate the BIOS power settings (Wake on LAN and/or "Boot when power is restored") on the application nodes. See, External Power Controllers: Qualifying Application Nodes and Other Best Practices.

If you get alerts on several nodes attached to an external power controller, the problem could be with 1) the external power controller itself or 2) the configuration of the external power controller.

  1. Validate the configuration of the external power controller including type, IP address, network, and nodes.
  2. If the configuration is correct, log into the external power controller and check that the outlets are working.
  3. Try restarting power on the external power controller. If this doesn't work, follow the instructions to replace the failed external power controller.

Nodes in a tier with its own application network do not boot

Description

Application tiers running on networks other than the base Cassatt Active Response network require BOOTP relays (sometimes referred to as DHCP forwarding) on their network so their DHCP requests are routed back to the Cassatt Active Response control node (the DHCP server). On some routers and switches, there are IP helper features that you can use to specify the address of the DHCP server. However, some of these devices (for example, Cisco 3750 switches) modify the gateway address of the DHCP response to be the IP address of the BOOTP relay device. This is fine if the device is a valid gateway, in that it correctly routes to the Cassatt Active Response control node and the NFS server. However, if the device does not route correctly to the Cassatt Active Response control node and the NFS server, application nodes will not boot.

Resolution

If you use BOOTP relays on devices that modify the DHCP response to make the BOOTP relay device the gateway, ensure that that device properly routes to the Cassatt Active Response control node and NFS server.

top

HP DL 360s intermittently fail on boot

Description

HP DL 360s with Smart Array raid controller experience intermittent node failures on reboot.

When the boot process tries to initialize the Smart Array controller, it will hang with a spinning \ for approximately ten minutes. Eventually, the boot process will get past this, but will hang in other sections of the boot process for 2–3 minutes. It take approximately 25 minutes before the node will boot and come up so it's a boot prompt.

The hang will take long enough that Cassatt Active Response will declare the node failed and will take appropriate measures with it based on tier or pool settings. However, when you look at the node, it will be sitting at a login prompt. If you reboot the node itself, in all likelihood it will reboot just fine.

Resolution

Disable the Smart Array raid controller.

top

Power operation fails on properly configured device

Description

The power operation failed even though the device seems to be configured correctly.

Resolution

This is likely due to a failure in the power controller. Reset the power controller by shutting off its power, waiting a few seconds, and then turning on its power. If the problem persists, run the command line tool ccpower against the power controller, using the verifyConfiguration, verifyPartialOperation, or verifyFullOperation options in order to further diagnose the problem.

The application freezes or fails to respond

Description

After running just fine and with no obvious changes to the application configuration, the application mysteriously freezes for fails to respond.

Resolution

One possible reason for this behavior is that the application uses /dev/random to seed encryption functions on which the application relies. On Red Hat Enterprise Linux 4, /dev/random leverages things like mouse and keyboard events to guarantee randomness, but applications running in a Cassatt Active Response environment are unlikely to have a mouse or keyboard attached.  When an application attempts to read /dev/random when it has run out of random numbers, /dev/random blocks the application. The symptom is that the application freezes or fails to respond.

To work around this problem, you can link /dev/random to /dev/urandom, which avoids the problem because /dev/urandom produces unlimited random numbers. To do this, use the following commands:

rm /dev/random
mknod /dev/random c 1 9 

(The c parameter indicates to create a character device and the 1 and 9 specify the major and minor device number for /dev/urandom.) After executing these commands, restart the application. (Note, however, that using /dev/urandom has the inherent problem that the numbers it generates may be less and less random over time.)

Deleting IBM blades causes them to rediscover

Description

Blades take a long time to delete and ultimately get rediscovered.

Resolution

Because operations on the BCCM are serialized, blades can take a long time to delete. To ensure blades are not rediscovered, on the Delete confirmation screen, check the Quarantine checkbox at the bottom of the page.

Inventory fails after fixing switch problems on BCMM

Description

After fixing switch problems on the BCMM, some IBM blades do not complete inventory.

Resolution

Whenever you change switch settings on a BCMM, you are changing the node's NIC connectivity. You must delete the node and allow it to rediscover.

SPARC/Solaris application nodes fail with permission problems

Description

Using ssh to access SPARC/Solaris 8, 9, or 10 application nodes fails with permission problems. This occurs when the Cassatt Active Response control node is the NFS server for the Cassatt Active Response image matrix. There is a known Solaris bug (4171523) with the chown command that modifies permissions such that ssh access fails.

Resolution

To resolve this problem, use a NAS device to serve the Cassatt Active Response image matrix.

top

SPARC Hardware may not Support VLAN interfaces

Description

Some SPARC hardware does not support VLAN interfaces.

Resolution

Check your Sun documentation to verify whether your hardware is appropriate for use with Network Manager–actuated options.

top

Sun Booting Errors with Multiple NICs

Description

You have Sun booting errors and have multiple NICs installed.

Resolution

Cassatt recommends you run IPMP to control the NICs.

top

Sun - IPMP not supported for Solaris 8

Description

IP Multipathing (IPMP) is not supported for Solaris 8.

Resolution

Do not check the 'IP Multipathing' box when creating tiers in Cassatt Active Response.

top

Automatic discovery problems

Description

Application node hardware is not being discovered.

Resolution

If you encounter problems where Cassatt Active Response is not discovering application node hardware with supported power controllers, try the following paths to isolate, debug, and correct the problem:

  1. Power?
    Unplug the power from the node and then plug it back in (to reset the power controller).
  2. Network problems?
    Make sure that the network interface card (NIC) light is working properly on any node that is not being discovered by Cassatt Active Response. If not, diagnose and correct the network problem.
    Make sure that all control nodes, power controllers, and application nodes are on the same VLAN.
  3. Enough IP addresses?
    Make sure Cassatt Active Response has not already used all of the IP addresses that were specified during the install process. Cassatt Active Response is not able to discover more nodes than it has IP addresses to dole out. If Cassatt Active Response has already used all available IP addresses, you'll need to add more. See Network Addresses: Calculating Requirements.
  4. Node settings?
    Make sure that the boot order in the node's BIOS is set to boot from the network first.
    Make sure that the Ethernet interfaces are set to enable PXE boot.
    Make sure the power controller is set to DHCP.
    See the setup document for your hardware to check that all other settings are correct, especially those that relate to communication with the power controller:
    Dell: Nodes with Integrated Power Controllers
    Dell Blade Server: Nodes in a Blade Enclosure
    HP: Nodes with Integrated Power Controllers
    IBM Blade Center: Nodes in a Blade Enclosure
    Sun: Nodes with Integrated Power Controllers
  5. Power controller authentication?
    To discover power controllers and their attached nodes, Cassatt Active Response must know the power controller user names and passwords. Verify that the default power controller user names and passwords are set on each node and in Cassatt Active Response: Discovered Pool > Properties > Power Controller Authentication.
  6. I/O operations interrupting the automatic discovery process?
    Make sure there aren't any I/O intensive operations such as an image clone, image import (ccimport), or tier creation that might interrupt the discovery process. If so, let these operations complete and then restart the nodes that were not discovered.
  7. Firmware version?
    Contact support@cassatt.com to check that your firmware version is supported. If not, do one of the following:
    • Change to a supported version.
    • Field qualify your version by running the ccpower command from the control node:

      /opt/cassatt/bin/ccpower -t <power_controller_type> -a <IP_Address> -u <power_controller_username> -p <power_controller_password> -v verifyfulloperation

      If the command completes successfully, then your firmware version is added to your local supported firmware file, and your problem lies elsewhere.

      If the command fails, use the output in the next step.
  8. Exceptions in the ccpower output?
    From the top of the output file, search down for the first Java Exception. The exceptions usually tell you a lot about what issue ccpower is encountering (network problem, timeout, firmare revision). When you've identified the exception, copy and paste the output and email it to support@cassatt.com.
  9. Errors in the logs?
    Try the following:
    • Look on the control node in /opt/cassatt/logs. Use the grep command to find any strings with the term ERROR in them. Email the results to support@cassatt.com.
    • Look on the control node in /var/log/messages for problems with the node Cassatt Active Response is not discovering. Use the grep command to find any strings with the power controller's MAC address. Email the results to support@cassatt.com.

top

Automatic discovery problems with IBM BladeCenter hardware

Description

IBM BladeCenter hardware is not being discovered.

Resolution

If Cassatt Active Response fails to discover IBM Bladecenter hardware, try the following paths to isolate, debug, and correct the problem:

  1. Follow the guidelines in General problems.
  2. Validate that the BCMM is set up as described in IBM Blade Center: Nodes in a Blade Enclosure .
  3. Verify that the BCMM has not timed out and is still issuing DHCP requests. DHCP is required to trigger discovery. Reset the BCMM by one of these methods:
    1. Access the BCMM web administration interface:
      • Open a web browser and enter the BCMM's default IP address (the default BCMM IP address is 192.168.70.125) in the URL field.
      • Log into the BCMM web interface. The default IBM login and password are:

        Login: USERID
        PASSWORD: PASSW0RD

        Note that in "PASSW0RD," the 0 is a zero and not a capital O.
      • Restart the BCMM.
    2. Telnet into the BCMM using the "USERID" user name and "PASSW0RD". Issue the following command:

      telnet 192.168.70.125
      reset -T system:mm[1]


      The command will not provide any feedback. Eventually, when the telnet connection fails you will see the "Connection closed by foreign host" message, indicating that the BCMM is restarting.

      If the BCCM does not respond to the telnet command, the BCMM is probably not set up to use its default IP address. Determine the BCMM's IP address by looking at the control node's /etc/dhcpd.conf file. Use the BCMM's mac address to look up the IP address in the /etc/dhcpd.conf file. For example, assume the BCMM mac address is 00:0D:60:F6:34:2A. The /etc/dhcpd.conf file will have an entry like the following:

      host 10.0.84.100 {
      hardware ethernet 00:0D:60:F6:34:2A;
      option routers 10.0.84.1;
      fixed-address 10.0.84.100;
      }


      For the above example, the BCMM can be reached by using the IP address of 10.0.84.100 instead of 192.168.70.125.
    3. Verify that the "USERID"/"PASSW0RD" account on the BCMM has sufficient privileges. Cassatt Active Response uses this username/password to communicate with the BCMM.

      To test if the BCMM's user account is working, use the ccpower command. If you see authorization or permission problems when issuing the ccpower command you may have to access the BCMM and give privileges to the "USERID" account.

      The following are examples of ccpower commands that test the BCMM:

      # cd /opt/cassatt/bin
      # ./ccpower -u USERID -p PASSWORD -a 10.0.84.100 - t ibm_bcmm version
      Driver firmware: BRET82H.16.BRRG82H.16
      # .ccpower -u USERID -p PASSWORD -a 10.0.84.100 -t ibm_bcmm status

SSH key propogation fails

Description

When adding power only nodes, you get the error, "Error propogating the SSH key for <nodename>."

Resolution

Check the following: Is SSH installed? Is the SSH daemon running? Does the user exist? Do the passwords match? Are the SSH parameters properly configured for Cassatt Active Response?

top