Configuring Shadow Master Hosts

Searching Sun Grid Engine 6.2

Sun Grid Engine Information Center
Administering Sun Grid Engine
Index


Configuring Shadow Master Hosts

About Shadow Master Hosts

Shadow master hosts are machines in the cluster that can detect a failure of the master daemon and take over its role as master host. When the shadow master daemon detects that the master daemon sge_qmaster has failed abnormally, it starts up a new sge_qmaster daemon on the host where the shadow master daemon is running.

Note
If the master daemon is shut down gracefully, the shadow master daemon does not start up. If you want the shadow master daemon to take over after you shut down the master daemon gracefully, remove the lock file that is located in the sge_qmaster spool directory. The default location of this spool directory is $SGE_ROOT/$SGE_CELL/spool/qmaster.

The automatic failover start of a sge_qmaster on a shadow master host takes approximately one minute. Meanwhile, you get an error message whenever a Grid Engine system command is run.

Note
The file $SGE_ROOT/$SGE_CELL/common/act_qmaster contains the name of the host that is actually running the sge_qmaster daemon.

Shadow Master Host Requirements

To prepare a host as a shadow master, the following requirements must be met:

  • The shadow master host must run sge_shadowd.
  • The shadow master host must share sge_qmaster status information, job configuration, and queue configuration that resides in a log file. In particular, a shadow master host needs read/write root access to the master host's spool directory and to the directory $SGE_ROOT/$SGE_CELL/common.
  • Either the Berkeley DB RPC server or classic Grid Engine system spooling must be used for sge_qmaster spooling. For more information, see Database Server and Spooling Host.
  • The shadow master host file must contain a line that defines the host as shadow master host.

As soon as these requirements are met, the shadow-master-host facility is activated for this host. You do not have to restart the Grid Engine system daemons to activate the feature.

Shadow Master Host File

The shadow master host file, $SGE_ROOT/$SGE_CELL/common/shadow_masters, contains the following:

  • The name of the primary master host, which is the machine where the master daemon sge_qmaster initially runs
  • The names of the shadow master hosts

The format of the shadow master host file is as follows:

  • The first line of the file defines the primary master host
  • The following lines define the shadow master hosts, one host per line

The order of the shadow master hosts is significant. The primary master host is the first line in the file. If the primary master host fails to proceed, the shadow master defined in the second line takes over. If this shadow master also fails, the shadow master defined in the third line takes over, and so forth.

Starting Shadow Master Hosts

To start a shadow sge_qmaster, the system must be sure either that the old sge_qmaster has terminated, or that it will terminate without performing actions that interfere with the newly started shadow sge_qmaster.

In very rare circumstances, you might not be able to determine that the old sge_qmaster has terminated or that it will terminate. In such cases, an error message is logged to the messages log file of the sge_shadowd daemons on the shadow master hosts. See Chapter 10, Fine Tuning, Error Messages, and Troubleshooting for further information.

Also, any attempts to open a tcp connection to a sge_qmaster daemon permanently fails. If this occurs, make sure that no master daemon is running, and then restart sge_qmaster manually on any of the shadow master machines. See Restarting Daemons From the Command Line for further details.

Configuring Shadow Master Hosts Environment Variables

Three environment variables affect the takeover time for a shadow master:

Variable Description
SGE_DELAY_TIME This variable controls the interval in which sge_shadowd pauses if a takeover bid fails. This value is used only when there are multiple sge_shadowd instances that are contending to be the master (the default is 600 seconds).
SGE_CHECK_INTERVAL This variable controls the interval in which the sge_shadowd checks the heartbeat file (the default is 60 seconds).
SGE_GET_ACTIVE_INTERVAL This variable controls the interval when a sge_shadowd instance tries to take over when the heartbeat file has not changed.

These variables interact in the following ways:

  1. The master host updates the heartbeat file every 30 seconds.
  2. The sge_shadowd daemon checks for changes to the heartbeat file every number of seconds defined by the SGE_CHECK_INTERVAL variable. This value must be greater than 30 seconds.
    • If the heartbeat file has been updated, the sge_shadowd daemon restarts the waiting clock.
    • If the heartbeat file has not been updated, the sge_shadowd daemon continues to wait until the number of seconds defined by the SGE_CHECK_INTERVAL variable expires. This action ensures that the sge_shadowd daemon is not too agressive in trying to take over and allows the master host some leeway in updating the heartbeat file.
  3. When the SGE_GET_ACTIVE_INTERVAL has expired, the sge_shadowd daemon takes over if the heartbeat file is still not updated.

A reasonable configuration might be to set the SGE_CHECK_INTERVAL to 45 seconds and the SGE_GET_ACTIVE_INTERVAL to 90 seconds. So, after about 2 minutes, the takeover will occur. If you want to check the operation of the shadow host after you have configured these environment variables, you will have to disconnect the master host's network cable to simulate a failure.


Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.

Sign up or Log in to add a comment or watch this page.


The individuals who post here are part of the extended Sun Microsystems community and they might not be employed or in any way formally affiliated with Sun Microsystems. The opinions expressed here are their own, are not necessarily reviewed in advance by anyone but the individual authors, and neither Sun nor any other party necessarily agrees with them.

Copyright 1994-2009 Sun Microsystems, Inc.
Powered by Atlassian Confluence
Sun Guidelines on Public Discourse Privacy Policy Terms of Use Trademarks Site Map Employment Investor Relations Contact