|
Sun Grid Engine Information Center
Installing Sun Grid Engine
Index
Upgrading From a Previous Version of Sun Grid Engine Software
 | Note
- The following instructions will work only on the Sun Grid Engine 6.2 RR release.
- The upgrade procedure is only able to upgrade your software from version 6.0 update 2 or higher. If you are running an older version of the Sun Grid Engine software, such as 5.3 or 6.0, you must upgrade to version 6.0 update 2 or higher and then upgrade again to version 6.2 as explained below. See Upgrading from 5.3 to 6.0.
|
About Upgrading the Software
 | Note
- The upgrade procedure is now partly destructive. See the constraints.
- The LD_LIBRARY_PATH variable is not set in Grid Engine 6.2 software. Remove the existing LD_LIBRARY_PATH settings from 6.0 before you start a 6.2 installation.
- Before you begin the upgrade process, make sure that you source the existing $SGE_ROOT/$SGE_CELL/common/settings.sh or $SGE_ROOT/$SGE_CELL/common/settings.csh file.
|
The upgrade procedure uses the cluster configuration information from the older version of the software to install the Grid Engine 6.2 software on the master host. Beginning with the Sun Grid Engine 6.2 release, you can install 6.2 to a different $SGE_ROOT or $SGE_CELL and transfer the old configuration to this cluster. This method is called cloned cluster configuration. You might want to use this method to accomplish the following:
- To test the upgrade before making the real upgrade.
- To keep the old cluster running.
Before You Upgrade
Choose one of the following methods to upgrade to 6.2:
- New 6.2 installation (different $SGE_ROOT or $SGE_CELL) using the same configuration as was used for the old cluster (cloned cluster configuration).
If you use the cloned cluster configuration, you do not have to stop or in any way affect the original cluster. You simply install a new qmaster and transfer the configuration from the old cluster to the new one. Then, you manually restart the new execution daemons on all the original execution hosts.
The disadvantage of the cloned configuration method is that you have to install the new qmaster and might loose some of the configuration information during the upgrade (see the constraints). Another disadvantage is that the original execution host will now have twice as many slots - one set for the old cluster and one for the new one.
- Real upgrade of the existing cluster (same $SGE_ROOT and $SGE_CELL.)
Constraints
The following constraints apply to both upgrade methods:
- Dynamic and static load values will be lost (only static values will be recreated).
- The sharetree usage will be lost.
- Neither jobs nor advanced reservations (ARs) will be replicated.
- There might be running or pending jobs in the cluster when the configuration is saved. If you decide to install the new Sun Grid Engine version in the same $SGE_ROOT and $SGE_CELL, then you must remove all jobs from the old cluster before the old cluster is shutdown and the new software is installed.
- The previous state of a disabled queue will be lost if the queue config initial_state is set to default.
Additional Constraints for the New 6.2 Installation with Cloned Configuration
For the cloned cluster configuration, you must also define several new variables and directories that must be different from the original settings:
- $SGE_ROOT
- $SGE_CELL
- $SGE_CLUSTER_NAME
- $SGE_QMASTER_PORT
- $SGE_EXECD_PORT
- Master daemon spooling directory (qmaster_spool_dir)
- Execution daemon spooling directory (execd_spool_dir)
- Group ID range for the jobs (gid_range)
 | Caution Only one SGE_Helper_Service.exe can run on an execution host. You cannot use the same Windows execution host for a 6.0 or 6.1 cluster and a 6.2 cluster. |
 | Note
- Because there have been significant changes in the Grid Engine 6.2 software, loading the configuration adds and removes some configuration attributes. Adding and removing configuration attributes might affect the operation of the cluster.
- To ensure stability, you should always follow this process:
- Upgrade to the new $SGE_ROOT or $SGE_CELL (cloned cluster configuration).
- Test that the original cluster configuration did not change and that the functionality of the cluster remains intact.
- Perform the real upgrade of the original cluster, if desired.
|
Back Up the Configuration of the Old Cluster
You can create this backup at any time before you start the upgrade procedure. The upgrade is the same for both types of the upgrade procedures. To create the backup, at least the qmaster daemon must be running.
What the Backup Contains
The backup saves the following files:
- arseqnum
- jobseqnum
- act_qmaster
- bootstrap
- cluster_name
- host_aliases
- qtask
- sge_aliases
- sge_ar_request
- sge_request
- sge_qstat
- sge_qquota
- sge_qstat
- shadow_masters
- accounting
- dbwriter.conf
- jmx directory
 | Caution
- During the upgrade procedure, you can select the next job ID. Do not select a job ID that is less than the last job ID in the accounting file in the backup. If you do, the accounting file will contain some job IDs twice. This leads to unexpected behaviors.
- To avoid the problem, accept the suggested default for the next job ID. The upgrade procedure calculates a safe value for the default.
|
The backup process creates the following files:
- sge_root - old $SGE_ROOT
- sge_cell - old $SGE_CELL
- ports - old $SGE_QMASTER_PORT and $SGE_EXECD_PORT
- win_hosts - A list of registered windows execution hosts at the time of the backup
The standard qconf client is used to save the complete cluster configuration.
How to Back Up the Cluster
- Either download the backup script or get the backup script from the Sun Grid Engine 6.2 common package (util/upgrade_modules/save_sge_config.sh).
- (Optional) Verify that the script is executable.
- Source the $SGE_ROOT/$SGE_CELL/common/settings.sh (or .csh) file of the original cluster.
- Run the backup script.
The backup script has one argument, which is the path to the directory in which to store the backup. The directory must not already exist, but the user must have permission to create it.
 | Note You must run the backup script on an admin host (qconf -sh) as a manager or operator user (typically sgeadmin). |
# ./save_sge_config.sh /backups/sge_6.1_June10_2008
The backup process displays a message confirming that the backup succeeded.
How to Install the 6.2 Software Using the Cloned Cluster Configuration Method
 | Caution Do not make both the new cluster and the old cluster available to your users. If you do, execution hosts would offer the original amount of slots for both clusters and might become overloaded. |
- Back up the original cluster settings as described in How to Back Up the Cluster.
- (Optional) ARCo Upgrade Prerequisites
If you use ARCo and you want to have the data from the old and new cluster in the same ARCo database, you cannot install the dbwriter on the new cluster, specifying the old dbwriter's database parameters, unless the dbwriter from the old cluster is stopped and all the data from the old cluster are inserted in the database. After installing dbwriter (with the same database parameters) on the new cluster, you must not again start the dbwriter on the old cluster, otherwise your database will be compromised.
- Wait to install ARCo on the new cluster until all the jobs are drained from the old cluster, the cluster is stopped and the old reporting file is processed completely.
There should be no reporting or reporting.processing file in the $SGE_ROOT/$SGE_CELL/common directory of the old cluster.
 | Note Jobs can be submitted and the reporting file generated on the new cluster, as long as there is no dbwriter installed on the new cluster. |
 | Caution
- There cannot be more than one dbwriter process writing into the same ARCo database and schema.
- If you create a new ARCo database for the new cluster, you cannot later merge it with the old ARCo database, due to the primary key constraints.
|
Once the reporting file on the old cluster is processed, on dbwriter host:
- Source the cluster settings.sh (or .csh) file.
- Stop the dbwriter:
# $SGE_ROOT/$SGE_CELL/common/sgedbwriter stop
- Extract the new 6.2 binaries and common files to the new $SGE_ROOT directory.
- Start the new upgrade installation of the qmaster from the new $SGE_ROOT directory.
This starts the upgrade procedure. See the Example Upgrade for Cloned Cluster Configuration.
 | Tip To enable or disable some additional features like JMX, CSP, or use old IJS, you must provide additional flags to the upgrade script the same way you would for qmaster installation. For example, to upgrade a cluster and enable JMX thread in qmaster and CSP mode run:
./inst_sge -upd -jmx -csp |
- Accept the displayed license.
- Enter the complete path to the backup directory.
For example, /backups/sge_6.1_June10_2008. See Step 6 in the example.
- Enter the new $SGE_ROOT directory.
The default is the current directory. For more information, see SGE_ROOT. See Step 7 in the example.
- Select a new $SGE_CELL directory.
The default is the $SGE_CELL directory from the backup. For more information, see SGE_CELL. See Step 8 in the example.
- Select a new SGE_QMASTER_PORT number.
The default is the $SGE_QMASTER_PORT number from the backup + 2. See Step 9 in the example.
- Select a new SGE_EXECD_PORT number.
The default is the $SGE_EXECD_PORT number from the backup + 2. See Step 10 in the example.
- Select a new qmaster spooling directory
The default is $SGE_ROOT/$SGE_CELL/spool/qmaster. See Step 11 in the example.
- Select a new $SGE_CLUSTER_NAME.
The default is p$SGE_QMASTER_PORT. For more information, see SGE_CLUSTER_NAME. See Step 12 in the example.
- (Optional) Choose the JMX configuration.
For more information about JMX, see JMX guide.
If you started the upgrade using the -jmx option, one of the following choices appears:
- Choose if you want to use JMX settings from the backup or use new settings.
This question appears when JMX exists in the backup.
- Choose a JMX port.
This question appears when JMX does not exist in the backup.
- Select a spooling method.
For more information on choosing a spooling mechanism, see Choosing Between Classic Spoooling and Database Spooling. See Step 14 in the example.
- Choose if you want to use interactive jobs support (IJS) settings from the backup or use the new defaults for 6.2.
In most cases, you should use the new defaults which enable the new interactive jobs support. Step 15 in the example shows the new defaults.
 | Caution If you changed QLOGIN_DAEMON, QLOGIN_COMMAND, RLOGIN_DEAMON, RLOGIN_COMMAND, RSH_DEAMON, or RSH_COMMAND configuration attributes, you should verify that the new IJS will not break your site-specific settings. |
- Choose the group id range
The default is the last group id from the backup + 100 and same range. See Step 16 in the example.
- Select the next job ID.
The default is old jobseqnum + 1000, rounded up to the nearest 1000. See Step 17 in the example.
- (Optional) Select the next AR ID.
This question appears only if arseqnum is in the backup. The default is old arseqnum + 1000, rounded up to the nearest 1000. See Step 18 in the example.
- Select automatic startup options.
See Step 19 in the example.
One of the following choices appears:
- Choose whether to run qmaster as an SMF service.
This question appears only on systems that run at least version 10 of the Solaris OS.
- Choose whether to use RC scripts for qmaster.
This question appears on platforms that are not running at least version 10 of the Solaris OS or if you started the upgrade using the -nosmf option.
- Load the old configuration.
See Step 20 in the example.
If this step fails with a critical error:
- Check the log file /tmp/sge_backup_date.log.
- Try to reload the configuration through the $SGE_ROOT/util/upgrade_modules/load_sge_config.sh script and the arguments displayed in the previous step.
- If the preceding steps do not resolve the problem, stop the upgrade process.
- (Optional) Upgrade ARCo.
If you use ARCo, you need to upgrade it. If you want to use the same ARCo database, copy the $SGE_ROOT/$SGE_CELL/common/dbwriter.conf from the old cluster into the same directory on the new cluster, it will be sourced and you will be only prompted to enter any missing information during the installtion of dbwriter. See Upgrading ARCo step 6.
- Run the post upgrade procedures
 | Info The post-upgrade procedures are easier when you have root access to all machines through ssh or rsh without having to enter a password. To use rsh instead of the default ssh, run the ./inst_sge command with -rsh argument. Example:
# ./inst_sge -upd-execd -rsh |
- Initialize the local execd spool directories
This step creates the local execd spool directories on the execd hosts with the correct permissions. Run the following command as root from the master host in $SGE_ROOT directory:
- (Optional) Create new RC scripts for the whole cluster.
 | Caution This command removes old RC scripts. To keep the old RC scripts, do not run this command. |
To start the services automatically after a reboot, run the following command as root from the master host in $SGE_ROOT directory:
- (Optional) Install or update the Windows helper service.
Perform this step to use the Windows execution hosts with the 6.2 cluster. When connecting to each Windows execution host, you are prompted for an administrator user to connect to the Windows host. If all your Windows hosts share the same administrative user, set the environment variable SGE_WIN_ADMIN to that user to access all Windows hosts without additional user intervention. Example:
(sh, bash)# export SGE_WIN_ADMIN=Administrator
(csh,tcsh)# setenv SGE_WIN_ADMIN Administrator
To install or update the Windows helper service, run the following command as root from the master host in $SGE_ROOT directory:
 | Caution Only one SGE_Helper_Service.exe can run on an execution host. You cannot use the same Windows execution host for a 6.0 or 6.1 cluster and a 6.2 cluster. |
- Start the new execution daemons.
Optionally, if you can login without typing a password, you can start the whole cluster as root user from the $SGE_ROOT directory with a single command:
This command starts the master daemon, shadow daemons, and all execution daemons.
Upgrade is complete.
How to Upgrade the Original Cluster to 6.2 Software (Real Upgrade)
- (Optional) Test the cloned cluster, if you used the cloned cluster configuration method to transfer the configuration to a new 6.2 cluster.
- Back up the original cluster settings as described in How to Back Up the Cluster.
- Stop the scheduler:
- Verify that no jobs are running on the cluster.
- Stop the old cluster:
# qconf -ke all
# $SGE_ROOT/$SGE_CELL/common/sgemaster stop
- (Optional) Stop the Berkeley DB server, if your cluster uses Berkeley DB server spooling.
On the BDB server host:
- Source the cluster settings.sh (or .csh) file.
- Type the following command:
# $SGE_ROOT/$SGE_CELL/common/sgebdb stop
- (Optional) If you use ARCo, ensure that the reporting file has been completely processed by the dbwriter.
There should be no reporting or reporting.processing file in the $SGE_ROOT/$SGE_CELL/common directory.
Once the reporting file is processed, on dbwriter host:
- Source the cluster settings.sh (or .csh) file.
- Stop the dbwriter:
# $SGE_ROOT/$SGE_CELL/common/sgedbwriter stop
 | Warning If you use ARCo, you must completely process the reporting file and stop the dbwriter before you continue. |
- Extract the new 6.2 binaries and common files to the $SGE_ROOT directory.
 | Caution Do not remove any of the $SGE_ROOT directory contents, except for the case where the new Sun Grid Engine 6.2 binaries differ from the existing installation. For example, you might have used your custom lx26-amd64 binaries, but Sun Grid Engine 6.2 uses lx24-amd64 even for 2.6 kernels. In that case you must remove the old binaries manually!
You must ensure that all binaries for all used architectures were updated and no architecture with the old version remains in the $SGE_ROOT directory. |
- Start the new upgrade on the original qmaster host from the $SGE_ROOT directory.
 | Tip To enable or disable some additional features like JMX, CSP, or to use the old IJS, you must provide additional flags to the upgrade script in the same way that you would for qmaster installation. For example, to upgrade a cluster and enable the JMX thread in qmaster and use CSP mode, run the following command: ./inst_sge -upd -jmx -csp |
- Accept the displayed license.
- Enter the complete path to the backup directory.
For example, /backups/sge_6.1_June10_2008.
 | Caution In case you you don't specify the original $SGE_ROOT and $SGE_CELL in the next two steps, the upgrade type attempted will not be the real upgrade! Instead the clone cluster configuration method will be used. |
- Enter the $SGE_ROOT directory.
The default is the current directory. For more information, see SGE_ROOT.
- Enter the $SGE_CELL directory.
The default is default. For more information, see SGE_CELL.
- Select a new $SGE_CLUSTER_NAME.
The default value is one of the following, depending on which is found first:
- The existing SGE_CLUSTER_NAME ($SGE_ROOT/$SGE_CELL/common/cluster-name)
- The SGE_CLUSTER_NAME from the backup
- p$SGE_QMASTER_PORT
For more information, see SGE_CLUSTER_NAME.
- (Optional) Select the JMX configuration.
For more information about JMX, see JMX guide.
If you started the upgrade using the -jmx option, one of the following choices appears:
- Choose if you want to use JMX settings from the backup or use new settings.
This question appears when JMX exists in the backup.
- Choose a JMX port.
This question appears when JMX does not exist in the backup.
- Choose if you want to keep the spooling method from the backup.
- (Optional) Select a spooling method.
This is displayed if you chose not to use backup in the previous screen. See example. For more information on choosing a spooling mechanism, see Choosing Between Classic Spooling and Database Spooling.
- Choose if you want to use interactive jobs support (IJS) settings from the backup or use the new defaults for 6.2.
In most cases, you should use the new defaults which enable the new interactive jobs support.
 | Caution If you changed QLOGIN_DAEMON, QLOGIN_COMMAND, RLOGIN_DEAMON, RLOGIN_COMMAND, RSH_DEAMON, or RSH_COMMAND configuration attributes, you should verify that the new IJS will not break your site-specific settings. |
- Select the next job ID.
The default is old jobseqnum + 1000, rounded up to the nearest 1000.
- (Optional) Select the next AR ID.
This question appears only if arseqnum is in the backup. The default is old arseqnum + 1000, rounded up to the nearest 1000.
- Choose automatic startup options.
One of the following choices appears:
- Choose whether to run qmaster as an SMF service.
This question appears only on systems that run at least version 10 of the Solaris OS.
- Choose whether to use RC scripts for qmaster.
This question appears on platforms that are not running at least version 10 of the Solaris OS or if you started the upgrade using the -nosmf option.
- Load the old configuration.
If this step fails with a critical error:
- Check the log file /tmp/sge_backup_date.log.
- Try to reload the configuration through the $SGE_ROOT/util/upgrade_modules/load_sge_config.sh script and the arguments displayed in the previous step.
- If the preceding steps do not resolve the problem, stop the upgrade process.
- (Optional) Copy the binaries and the common directory to all the hosts in the cluster, if not on a shared file system
If you use local binaries or a local common directory for each host, you must copy all the new binaries and the common directory locally to each host. Ensure that all binaries are updated and no architecture with the old version remains in the $SGE_ROOT directory.
 | Note If you do not perform this operation the qmaster host will have Sun Grid Engine 6.2 binaries, while the rest of the cluster will still have the old version and will not work as desired! |
- (Optional) Upgrade ARCo.
If you use ARCo, you need to upgrade it.See Upgrading ARCo step 6.
- Run the post upgrade procedures
 | Info The post-upgrade procedures are easier when you have root access to all machines through ssh or rsh without having to enter a password. To use rsh instead of the default ssh, run the ./inst_sge command with -rsh argument. Example:
# ./inst_sge -upd-execd -rsh |
- Initialize the local execd spool directories
This step creates the local execd spool directories on the execd hosts with the correct permissions. Run the following command as root from the master host in $SGE_ROOT directory:
- (Optional) Create new RC scripts for the whole cluster.
 | Caution This command removes old RC scripts. To keep the old RC scripts, do not run this command. |
To start the services automatically after a reboot, run the following command as root from the master host in $SGE_ROOT directory:
- (Optional) Install or update the Windows helper service.
Perform this step to use the Windows execution hosts with the 6.2 cluster. When connecting to each Windows execution host, you are prompted for an administrator user to connect to the Windows host. If all your Windows hosts share the same administrative user, set the environment variable SGE_WIN_ADMIN to that user to access all Windows hosts without additional user intervention. Example:
(sh, bash)# export SGE_WIN_ADMIN=Administrator
(csh,tcsh)# setenv SGE_WIN_ADMIN Administrator
To install or update the Windows helper service, run the following command as root from the master host in $SGE_ROOT directory:
 | Caution Only one SGE_Helper_Service.exe can run on an execution host. You cannot use the same Windows execution host for a 6.0 or 6.1 cluster and a 6.2 cluster. |
- Start the new execution daemons.
Optionally, if you can login without typing a password, you can start the whole cluster as root user from the $SGE_ROOT directory with a single command:
This command starts the master daemon, shadow daemons, and all execution daemons.
Upgrade is complete.
|
Comments (2)
Aug 20, 2008
Dean_Stanton says:
Real Upgrade step 8 says Extract the new 6.2 binaries and common files to the $...Real Upgrade step 8 says
Extract the new 6.2 binaries and common files to the $SGE_ROOT directory.
Caution
Do not remove any of the $SGE_ROOT directory contents. The files will be overwritten as required.
This wording seems to forbid using pkgrm prior to pkgadd (on a Solaris host),
but I think this is acceptable (and I think it should be). I suggest you mean
Do not remove configured files from under $SGE_ROOT (but you may remove the packages before adding new ones).
In any case, installation will overwrite older files as required.
Oct 13, 2008
surajp says:
Thanks for the comment. I will make the required changes to the doc.Thanks for the comment. I will make the required changes to the doc.