|
Sun Grid Engine Information Center Configuring Shadow Master HostsAbout Shadow Master HostsShadow master hosts are machines in the cluster that can detect a failure of the master daemon and take over its role as master host. When the shadow master daemon detects that the master daemon sge_qmaster has failed abnormally, it starts up a new sge_qmaster daemon on the host where the shadow master daemon is running.
The automatic failover start of a sge_qmaster on a shadow master host takes approximately one minute. Meanwhile, you get an error message whenever a Grid Engine system command is run.
Shadow Master Host RequirementsTo prepare a host as a shadow master, the following requirements must be met:
As soon as these requirements are met, the shadow-master-host facility is activated for this host. You do not have to restart the Grid Engine system daemons to activate the feature. Shadow Master Host FileThe shadow master host file, $SGE_ROOT/$SGE_CELL/common/shadow_masters, contains the following:
The format of the shadow master host file is as follows:
The order of the shadow master hosts is significant. The primary master host is the first line in the file. If the primary master host fails to proceed, the shadow master defined in the second line takes over. If this shadow master also fails, the shadow master defined in the third line takes over, and so forth. Starting Shadow Master HostsTo start a shadow sge_qmaster, the system must be sure either that the old sge_qmaster has terminated, or that it will terminate without performing actions that interfere with the newly started shadow sge_qmaster. In very rare circumstances, you might not be able to determine that the old sge_qmaster has terminated or that it will terminate. In such cases, an error message is logged to the messages log file of the sge_shadowd daemons on the shadow master hosts. See Chapter 10, Fine Tuning, Error Messages, and Troubleshooting for further information. Also, any attempts to open a tcp connection to a sge_qmaster daemon permanently fails. If this occurs, make sure that no master daemon is running, and then restart sge_qmaster manually on any of the shadow master machines. See Restarting Daemons From the Command Line for further details. Configuring Shadow Master Hosts Environment VariablesThree environment variables affect the takeover time for a shadow master:
These variables interact in the following ways:
A reasonable configuration might be to set the SGE_CHECK_INTERVAL to 45 seconds and the SGE_GET_ACTIVE_INTERVAL to 90 seconds. So, after about 2 minutes, the takeover will occur. If you want to check the operation of the shadow host after you have configured these environment variables, you will have to disconnect the master host's network cable to simulate a failure. |