Perform the following configuration procedures on all nodes in the cluster!
Oracle9i Release 1 (9.0.1) and Oracle9i Release 2 ( 9.2.0.1) used a userspace watchdog daemon called watchdogd to monitor the health of the cluster and to restart a RAC node in case of a failure. Starting with Oracle9i Release 2 (9.2.0.2) (and still available in Oracle 10g Release 2), the watchdog daemon has been deprecated by a Linux kernel module named hangcheck-timer which addresses availability and reliability problems much better. The hang-check timer is loaded into the Linux kernel and checks if the system hangs. It will set a timer and check the timer after a certain amount of time. There is a configurable threshold to hang-check that, if exceeded will reboot the machine. Although the hangcheck-timer module is not required for Oracle Clusterware (Cluster Manager) operation, it is highly recommended by Oracle.
The hangcheck-timer.ko Module
The hangcheck-timer module uses a kernel-based timer that periodically checks the system task scheduler to catch delays in order to determine the health of the system. If the system hangs or pauses, the timer resets the node. The hangcheck-timer module uses the Time Stamp Counter (TSC) CPU register, which is incremented at each clock signal. The TCS offers much more accurate time measurements because this register is updated by the hardware automatically.
Much more information about the hangcheck-timer project can be found here.
Installing the hangcheck-timer.ko Module
The hangcheck-timer was originally shipped only by Oracle; however, this module is now included with Red Hat Linux starting with kernel versions 2.4.9-e.12 and higher. If you followed the steps in Section 8 ("Obtain & Install New Linux Kernel / FireWire Modules"), then the hangcheck-timer is already included for you. Use the following to confirm:
# find /lib/modules -name "hangcheck-timer.ko"
/lib/modules/2.6.9-11.0.0.10.3.EL/kernel/drivers/char/hangcheck-timer.ko
/lib/modules/2.6.9-22.EL/kernel/drivers/char/hangcheck-timer.ko
In the above output, we care about the hangcheck timer object (
hangcheck-timer.ko) in the
/lib/modules/2.6.9-11.0.0.10.3.EL/kernel/drivers/char directory.
Configuring and Loading the hangcheck-timer Module
There are two key parameters to the hangcheck-timer module:
- hangcheck-tick: This parameter defines the period of time between checks of system health. The default value is 60 seconds; Oracle recommends setting it to 30 seconds.
- hangcheck-margin: This parameter defines the maximum hang delay that should be tolerated before hangcheck-timer resets the RAC node. It defines the margin of error in seconds. The default value is 180 seconds; Oracle recommends setting it to 180 seconds.
NOTE: The two
hangcheck-timer module parameters indicate how long a RAC node must hang before it will reset the system. A node reset will occur when the following is true:
system hang time > (hangcheck_tick + hangcheck_margin)
Configuring Hangcheck Kernel Module Parameters
Each time the hangcheck-timer kernel module is loaded (manually or by Oracle), it needs to know what value to use for each of the two parameters we just discussed: (hangcheck-tick and hangcheck-margin). These values need to be available after each reboot of the Linux server. To do that, make an entry with the correct values to the /etc/modprobe.conf file as follows:
# su -
# echo "options hangcheck-timer hangcheck_tick=30 hangcheck_margin=180" >> /etc/modprobe.conf
Each time the hangcheck-timer kernel module gets loaded, it will use the values defined by the entry I made in the
/etc/modprobe.conf file.
Manually Loading the Hangcheck Kernel Module for Testing
Oracle is responsible for loading the hangcheck-timer kernel module when required. For that reason, it is not required to perform a modprobe or insmod of the hangcheck-timer kernel module in any of the startup files (i.e. /etc/rc.local).
It is only out of pure habit that I continue to include a modprobe of the hangcheck-timer kernel module in the /etc/rc.local file. Someday I will get over it, but realize that it does not hurt to include a modprobe of the hangcheck-timer kernel module during startup.
So to keep myself sane and able to sleep at night, I always configure the loading of the hangcheck-timer kernel module on each startup as follows:
# echo "/sbin/modprobe hangcheck-timer" >> /etc/rc.local
(Note: You don't have to manually load the hangcheck-timer kernel module using modprobe or insmod after each reboot. The hangcheck-timer module will be loaded by Oracle automatically when needed.)
Now, to test the hangcheck-timer kernel module to verify it is picking up the correct parameters we defined in the /etc/modprobe.conf file, use the modprobe command. Although you could load the hangcheck-timer kernel module by passing it the appropriate parameters (e.g. insmod hangcheck-timer hangcheck_tick=30 hangcheck_margin=180), we want to verify that it is picking up the options we set in the /etc/modprobe.conf file.
To manually load the hangcheck-timer kernel module and verify it is using the correct values defined in the /etc/modprobe.conf file, run the following command:
# su -
# modprobe hangcheck-timer
# grep Hangcheck /var/log/messages | tail -2
Sep 27 23:11:51 linux2 kernel: Hangcheck: starting hangcheck timer 0.5.0 (tick is 30 seconds, margin is 180 seconds)