Grid Engine Documentation (Printable)

This page is designed to enable you to easily export all of the Grid Engine documentation to a PDF or Word document. It includes all of the Grid Engine documentation in the same order it is presented on the wiki.

To print a version of this document, log in to wikis.sun.com, click Tools, then select Export to PDF or Export to Word.

Grid Engine Documentation


Getting Started

Getting Started Guide (Printable)

Introducing the Sun Grid Engine System

A grid is a collection of computing resources that perform tasks. In its simplest form, a grid appears to users as a large system that provides a single point of access to powerful distributed resources. In its more complex form, a grid can provide many access points to users.

The Sun Grid Engine software enables you to apply resource management strategies to distribute jobs across a grid. Users can submit millions of jobs at a time without being concerned about where the jobs run. The system supports clusters with up to 63,000 cores.

Sites configure the system to maximize usage and throughput, while the system supports varying levels of timeliness and importance. Job priority and user share are instances of importance.

The Sun Grid Engine software provides advanced resource management and policy administration for UNIX and Windows environments that are composed of multiple shared resources. For more on Sun Grid Engine's features, see the product site.

To familiarize yourself more with the product and the wiki, explore the resources below:

Topic Description
How the System Operates Familiarize yourself with the Sun Grid Engine System components.
How Resources Are Matched to Requests Learn how the Grid Engine system matches resources to requests, what jobs and queues are, and how usage policies assist in managing workload.
Choosing a User Interface Learn about Sun Grid Engine's graphical user interface (QMON), the command-line interface, and the Distributed Resource Management Application API.
Users and User Categories Learn about Sun Grid Engine's user categories: managers, operators, users, and owners.

Using the Wiki

Topic Description
Using the Wiki Learn how to monitor documentation changes, how to print the wiki, and and the Wiki FAQ.

To print this section, see the Getting Started Guide (Printable).


How the System Operates

The Grid Engine system does the following:

  • Accepts jobs. Jobs are users' requests for computer resources. Each job includes a description of what to do and a set of property definitions that that describe how the job should be run. Users can submit jobs via the command line interface or Grid Engine's graphical user interface, QMON. Users can also use the optional Distributed Resource Management Application API (DRMAA) to automate grid engine functions by writing scripts to submit and control jobs.
  • Holds jobs. The Sun Grid Engine master daemon holds jobs until the needed compute resources become available.
  • Sends jobs. When the compute resources become available, the master daemon sends the job to the appropriate execution host. The execution daemon on that host then executes the job.
  • Manages running jobs. The master daemon manages running jobs. At a fixed interval, the master daemon receives reports from each execution daemon.
  • Logs the record of job execution when the jobs are finished. The master daemon stores raw data. Users can also use the Accounting and Reporting Console (ARCo) to gather live reporting data from the Grid Engine system and to store the data for historical analysis in the reporting database, which is a standard SQL database.

Component Description More Info
Cluster A collection of machines, called hosts, on which Grid Engine system functions occur. See Configuring Clusters.
Master Host The master host is central to cluster activity. The master host runs the master daemon and usually also runs the scheduler. The master host requires no further configuration other than that performed by the installation procedure. By default, the master host is also an administration host and a submit host. For information about how to initially set up the master host, see How to Install the Master Host. For information about how to configure dynamic changes to the master host, see Configuring Hosts.
Master Daemon The master daemon does the following:
  • Accepts incoming jobs from users.
  • Maintains tables about hosts, queues, jobs, system load, and user permissions.
  • Performs scheduling functions and requests actions from execution daemons on the appropriate execution hosts.
  • Decides which jobs are dispatched to which queues and how to reorder and reprioritize jobs to maintain share, priority, or deadline
See Configuring Hosts.
Execution Host Systems that have permission to run Grid Engine system jobs. These systems host queue instances, and run the execution daemon. Execution hosts are systems that have permission to execute jobs. Therefore, queue instances are attached to the execution hosts. An execution host is initially set up by the installation procedure, as described in How to Install Execution Hosts. For installation planning guidance, see Host System Requirements. See Configuring Hosts for more information on managing your cluster.
Execution Daemon The execution daemon receives jobs from the master daemon and executes them locally on its host. An execution daemon is responsible for the queue instances on its host and for the running of jobs in these queue instances. Periodically, the execution daemon forwards information such as job status or load on its host to the master daemon. See Configuring Hosts.
Scheduler The scheduler is responsible for prioritizing pending jobs and deciding which jobs to schedule to which resources. For more information on the scheduler, see Managing the Scheduler.
Administration Host Administration hosts are hosts that have permission to carry out any kind of administrative activity for the Grid Engine system. See Configuring Hosts.
Submit Host Submit hosts enable users to submit and control batch jobs only. In particular, a user who is logged in to a submit host can submit jobs with the qsub command, can monitor the job status with the qstat command, and can use the Grid Engine system OSF/1 Motif graphical user interface QMON, which is described in QMON, the Grid Engine System's Graphical User Interface. See Configuring Hosts.
Shadow Master Host Shadow master hosts reduce unplanned cluster downtown. One or more shadow master hosts may be running on additional nodes in a cluster. In the case that the master daemon or the host on which it is running fails, one of the shadow masters will promote the host on which it is running to the new master daemon system by locally starting a new master daemon. See How to Configure Shadow Master Hosts.
DRMAA The optional Distributed Resource Management Application API (DRMAA) automates Sun Grid Engine functions by writing scripts that run Sun Grid Engine commands and parse the results. See Automating Grid Engine Functions Through DRMAA.
ARCo The optional Accounting and Reporting Console (ARCo) enables you to gather live reporting data from the Grid Engine system and to store the data for historical analysis in the reporting database, which is a standard SQL database. For more information, see Accounting and Reporting Console.
SDM The optional Service Domain Manager (SDM) module distributes resources between different services according to configurable Service Level Agreements (SLAs). The SLAs are based on Service Level Objectives (SLOs). SDM functionality enables you to manage resources for all kind of scalable services. See Service Domain Manager for more information.

How Resources Are Matched to Requests

A Banking Analogy

As an analogy, imagine a large "money-center" bank in one of the world's capital cities. In the bank's lobby are dozens of customers waiting to be served. Each customer has different requirements. One customer wants to withdraw a small amount of money from his account. Arriving just after him is another customer, who has an appointment with one of the bank's investment specialists. She wants advice before she undertakes a complicated venture. Another customer in front of the first two customers wants to apply for a large loan, as do the eight customers in front of her.

Different customers with different needs require different types of service and different levels of service from the bank. Perhaps the bank on this particular day has many employees who can handle the one customer's simple withdrawal of money from his account. But at the same time the bank has only one or two loan officers available to help the many loan applicants. On another day, the situation might be reversed.

The effect is that customers must wait for service unnecessarily. Many of the customers could receive faster service if only their needs were immediately recognized and then matched to available resources.

If the Grid Engine system were the bank manager, the service would be organized differently:

  • On entering the bank lobby, customers would be asked to declare their name, their affiliations, and their service needs.
  • Each customer's time of arrival would be recorded.
  • Based on the information that the customers provided in the lobby, the bank might serve the following customers in the following order:
    1. Customers whose needs match suitable and immediately available resources
    2. Customers whose requirements have the highest priority
    3. Customers who were waiting in the lobby for the longest time
  • In a "Grid Engine system bank," one bank employee might be able to help several customers at the same time. The Grid Engine software would try to assign new customers to the least-loaded and most-suitable bank employee.
  • As bank manager, the Grid Engine system would allow the bank to define service policies. Typical service policies might be the following:
    • To provide preferential service to commercial customers because those customers generate more profit
    • To make sure a certain customer group is served well, because those customers have received bad service in the past
    • To ensure that customers with an appointment get a timely response
    • To provide preferential treatment to certain customers because those customers were identified by a bank executive as high priority customers
  • These policies would be implemented, monitored, and adjusted automatically by a Grid Engine system manager. Customers that have preferential access would be served sooner. Such customers would receive more attention from employees. The Grid Engine manager would recognize if the customers do not make progress. The manager would immediately respond by adjusting service levels to comply with the bank's service policies.
Jobs and Queues

In a Grid Engine system, jobs correspond to bank customers. Jobs wait in a computer holding area instead of a lobby. Queues, which provide services for jobs, correspond to bank employees. As in the case of bank customers, the requirements of each job, such as available memory, execution speed, available software licenses, and similar needs, can be very different. Only certain queues might be able to provide the corresponding service.

To continue the analogy, the Grid Engine software arbitrates available resources and job requirements in the following way:

  • A user who submits a job through the Grid Engine software declares a requirement profile for the job. In addition, the software retrieves the identity of the user. The software also retrieves the user's affiliation with projects or user groups. The time that the user submitted the job is also stored.
  • The moment that a queue is available to run a new job, the Grid Engine software determines what are the suitable jobs for the queue. The software immediately dispatches the job that has either the highest priority or the longest waiting time.
  • Queues allow concurrent execution of many jobs. The Grid Engine software tries to start new jobs in the least loaded and most suitable queue.

Usage Policies

The administrator of a cluster can define high-level usage policies that are customized according to the site. Four usage policies are available:

  • Urgency – Using this policy, each job's priority is based on an urgency value. The urgency value is derived from the job's resource requirements, the job's deadline specification, and how long the job waits before it is run.
  • Functional – Using this policy, an administrator can provide special treatment because of a user's or a job's affiliation with a certain user group, project, and so forth.
  • Share-based – Under this policy, the level of service depends on an assigned share entitlement, the corresponding shares of other users and user groups, the past usage of resources by all users, and the current presence of users within the system.
  • Override – This policy requires manual intervention by the cluster administrator, who modifies the automated policy implementation.

Policy management automatically controls the use of shared resources in the cluster to best achieve the goals of the administration. High priority jobs are dispatched preferentially. Such jobs receive higher CPU entitlements if the jobs compete for resources with other jobs. The Grid Engine software monitors the progress of all jobs and adjusts their relative priorities correspondingly and with respect to the goals defined in the policies.

Using Tickets to Administer Policies

The functional, share-based, and override policies are defined through a Grid Engine concept that is called tickets. You might compare tickets to shares of a public company's stock. The more shares of stock that you own, the more important you are to the company. If shareholder A owns twice as many shares as shareholder B, A also has twice the votes of B. Therefore shareholder A is twice as important to the company. Similarly, the more tickets that a job has, the more important the job is. If job A has twice the tickets of job B, job A is entitled to twice the resource usage of job B.

Jobs can retrieve tickets from the functional, share-based, and override policies. The total number of tickets, as well as the number retrieved from each ticket policy, often changes over time.

The administrator controls the number of tickets that are allocated to each ticket policy in total. Just as ticket allocation does for jobs, this allocation determines the relative importance of the ticket policies among each other. Through the ticket pool that is assigned to particular ticket policies, the Grid Engine software can run in different ways. For example, the software can run in a share-based mode only. Or the software can run in a combination of modes, for example, 90% share-based and 10% functional.

Using the Urgency Policy to Assign Job Priority

The urgency policy can be used in combination with two other job priority specifications:

  • The number of tickets assigned by the functional, share-based, and override policies
  • A priority value specified by the qsub -p command

A job can be assigned an urgency value, which is derived from three sources:

  • The job's resource requirements
  • The length of time that a job must wait before the job runs
  • The time at which a job must finish running

The administrator can separately weight the importance of each of these sources to arrive at a job's overall urgency value. For more information, see Managing Policies.

The following figure shows the correlation among policies in a Grid Engine system.

"Graphic shows functional


Choosing a User Interface

To meet the needs of your environment, the following interface tools are available:

QMON - The Graphical User Interface

If you prefer using a graphical user interface, you can use QMON to accomplish most Grid Engine system tasks. The QMON Main Control window, which is show below, is often the starting point for user and administrator functions.
Picture of QMON main control window with callouts

For more information on QMON if you are an administrator, see Interacting With Sun Grid Engine as an Administrator.
For more information on QMON if you are an user, see Interacting With Sun Grid Engine as a User.

The Command Line Interface

If you prefer using the command line, the command line user interface includes a flexible a set of ancillary programs (commands) that enable you to interact with the Sun Grid Engine system.

For more information on the command line if you are an administrator, see Interacting With Sun Grid Engine as an Administrator.
For more information on the command line if you are an user, see Interacting With Sun Grid Engine as a User.
For information on the ancillary programs that Sun Grid Engine provides and which users have access to these commands, see Command Line Interface Ancillary Programs.

The Distributed Resource Management Application API (DRMAA)

You can automate Sun Grid Engine functions by writing scripts that run Sun Grid Engine commands and parse the results. However, for more consistent and efficient results, you can use the Distributed Resource Management Application API (DRMAA). For more information about the DRMAA concept and how to use it with the C and Java TM languages, see Automating Grid Engine Functions Through the Distributed Resource Management Application API.


Command Line Interface Ancillary Programs

For more information on available ancillary programs, see the Grid Engine Man Pages.

List of Ancillary Programs

The Grid Engine system provides the following set of ancillary programs:

Program Description
qacct Extracts arbitrary accounting information from the cluster log file. For more information, see Generating Accounting Statistics.
qalter Changes the attributes of submitted but pending jobs.
qconf Provides the user interface for cluster configuration and queue configuration. For more information, see Using qconf.
qdel Enables a user to delete one or more jobs. A manager or operator can delete jobs belonging to any user, while regular users can only delete their own jobs. For more information, see How to Control Jobs.
qhold Holds back submitted jobs from execution.
qhost Displays status information about execution hosts.
qlogin Initiates a login session with automatic selection of a low-loaded, suitable host.
qmake A replacement for the standard UNIX make facility. qmake extends make by its ability to distribute independent make steps across a cluster of suitable machines. For more information, see Parallel Makefile Processing With qmake.
qmod Enables the owner to suspend or enable a queue. All currently active processes that are associated with this queue are also signaled. For more information, see How to Monitor and Control Queues and How to Control Jobs.
qmon Provides an X Windows Motif command interface and monitoring facility.
qping Checks application status of Sun Grid Engine daemons.
qquota Shows current usage of Sun Grid Engine resource quotas. For more information, see How to Monitor Resource Quota Utilization From the Command Line.
qrdel Deletes Sun Grid Engine advance reservations. For more information, see How to Configure Advance Reservations From the Command Line.
qresub Creates new jobs by copying jobs that are running or pending.
qrls Releases jobs from holds that were previously assigned to them, for example, through qhold.
qrsh Can be used for various purposes, such as the following:
  • To provide remote execution of interactive applications through the Grid Engine system. qrsh is comparable to the standard UNIX facility rsh. For more information, see Remote Execution With qrsh.
  • To allow for the submission of batch jobs that, upon execution, support terminal I/O and terminal control. Terminal I/O includes standard output, standard error, and standard input.
  • To provide a submission client that remains active until the batch job finishes.
  • To allow for the Grid Engine software-controlled remote execution of the tasks of parallel jobs.
qrstat Shows the status of Sun Grid Engine advance reservations. For more information, see How to Configure Advance Reservations From the Command Line.
qrsub Submits an advance reservation to Sun Grid Engine. For more information, see How to Configure Advance Reservations From the Command Line.
qselect Prints a list of queue names corresponding to specified selection criteria. The output of qselect is usually sent to other Grid Engine system commands to apply actions on a selected set of queues.
qsh Opens an interactive shell in an xterm on a lightly loaded host. Any kind of interactive jobs can be run in this shell. For more information, see How to Submit Interactive Jobs From the Command Line.
qstat Provides a status listing of all jobs and queues associated with the cluster. For more information, see How to Monitor Jobs From the Command Line.
qsub The user interface for submitting batch jobs to the Grid Engine system.
qtcsh A fully compatible replacement for the widely known and used UNIX C shell (csh) derivative, tcsh. qtcsh provides a command shell with the extension of transparently distributing execution of designated applications to suitable and lightly loaded hosts through Grid Engine software. For more information see, Transparent Job Distribution With qtcsh.

User Access to the Ancillary Programs

The following table shows the command capabilities that are available to the different user categories:

Command Manager Operator Owner User
qacct Full Full Own jobs only Own jobs only
qalter Full Full Own jobs only Own jobs only
qconf Full No system setup modifications Show only configurations and access permissions Show only configurations and access permissions
qdel Full Full Own jobs only Own jobs only
qhold Full Full Own jobs only Own jobs only
qhost Full Full Full Full
qlogin Full Full Full Full
qmod Full Full Own jobs and owned queues only Own jobs only
qmon Full No system setup modifications No configuration changes No configuration changes
qrexec Full Full Full Full
qselect Full Full Full Full
qsh Full Full Full Full
qstat Full Full Full Full
qsub Full Full Full Full


Users and User Categories

There are four categories of users that each have access to their own set of Grid Engine system commands:

  • Managers – Managers have full capabilities to manipulate the Grid Engine system. By default, the superusers of all administration hosts have manager privileges.
  • Operators – Users who can perform the same commands as managers except that they cannot change the configuration. Operators are supposed to maintain operation.
  • Users – People who can submit jobs to the grid and run them if they have a valid login ID on at least one submit host and one execution host. Users have no cluster management or queue management capabilities.
  • Owners – Users who can suspend or resume and disable or enable the queues they own. Typically, users are owners of the queue instances that reside on their workstations. Queue owners can be managers, operators, or users. Users are commonly declared to be owners of the queue instances that reside on their desktop workstations. See How to Configure Owners Parameters With QMON for more information.

For information on which command capabilities are available to the different user categories, see Command Line Interface Ancillary Programs.


Using the Wiki

Why a Wiki?

Since the release of Sun Grid Engine 6.2, all documentation for the product can be found on wikis.sun.com. This was done for the following reasons:

  • To keep the documentation as up-to-date as possible. The product team and select community members can edit the information in real time, which ensures that the information will stay as current as possible.
  • To encourage community input. Anyone can comment on the documentation. Let us know what you think.
  • To build a library of community contributions. The simple wiki markup language makes it easy for anyone in the Grid Engine community to add their own input to the wiki. See the Expert Advice section to add your expertise or to survey current contributions.

Wiki Tasks

Topic Description
How to Monitor Documentation Changes Learn how to monitor documentation changes using email notifications or RSS feeds.
How to Add a Comment Learn how to add a comment to any page in the Sun Grid Engine wiki space.
How to Add a Page Learn how to add pages to the Expert Advice section.
How to Print Learn how to print what you need from the Sun Grid Engine wiki.

Questions?

Do you have any questions about how to use the wiki? Please comment on this page with your question or send us an email.


How to Add a Comment

  1. Log in to wikis.sun.com.

  2. On the bottom of the page on which you would like to comment, Click Add Comment.
    The comment box is displayed.

  3. Enter your comment.

  4. Click Post to save and publish your comment.
    Click Cancel to close the comment box without saving your comment.

The page How to Add a Page does not exist.

How to Monitor Documentation Changes

If you would like to watch a parent page and all of its child pages, you must set watches individually on the parent page and each of its child pages.
Function Procedure
Setting up email notifications for the grid engine space
  1. On the top, right hand side of the page, click Space and then click Advanced.

  2. On the left hand side of the page, click on "Start watching this space."

    The string will immediately change to "Stop watching this space."
Removing email notifications for the grid engine space Follow the above directions but instead of clicking "Start watching this space," click "Stop watching this space."
Setting up email notifications for a page
  1. Click Tools, on the top right hand side of the page, and then click Watch.

    The envelope icon will change colors to indicate the watch has been set.
Removing email notifications for a page Click Tools and then click Watch. The envelope icon will change back to white to indicate the watch has been removed.
Review your email notifications
  1. Click on your name on the top of the page and then click Preferences on the drop-down menu.
  2. Click the Watches tab.
Setting up RSS notifications of documentation changes
  1. On the top, right hand side of the page, click Space and then click Advanced.
  2. On the left hand side of the page, click on RSS Feeds.
    You can then choose to receive RSS notification for the following:
    • Pages
    • News
    • Mail (This feature is currently disabled.)
    • Comments
    • Attachments
    • All Content
  3. Click on Pages and then select your favorite aggregator.

How to Print

You can print a specific page, a section, or a PDF copy of the entire wiki. See below for guidance on how to print what you need.

How to Print a Page

If you would like to print a specific page, do the following:

  1. Click on Tools in the upper left hand corner of the page that you would like to print.

  2. Select either 'Export to PDF' or 'Export to Microsoft Word.'


  3. Print from the application that you selected.

How to Print a Section of the Documentation

If you would like to print one of the major sections, select one of the following printables and then print directly from your browser:

How to Print the Entire Documentation Set

For PDF copy of the entire documentation set, click here.

This PDF snapshot was taken on March 3, 2009, the day that Grid Engine 6.2u2 was released.

Planning the Installation

Planning the Installation Guide (Printable)

Before you install the Grid Engine software, you must plan how to achieve the results that fit your environment. This section helps you make the decisions that affect the rest of the procedure.

System Requirements
Topic Description
System Requirements Review Grid Engine's disk space requirements and supported operating platforms.
Installation Considerations
Topic Description
Planning Checklist Document your installation plan.
Cluster Design Determine whether to set up your Grid Engine system as a single cluster or a collection of loosely coupled clusters.
Queue Structure Create a default cluster queue structure.
Host System Requirements Learn about different hosts system requirements.
User Account Considerations Determine whether you have the required permissions to submit jobs on the desired execution hosts.
Spooling Options Determine whether you want to use classic spooling or Berkeley DB spooling during installation
Scheduler Profiles Learn about the three scheduler profiles during installation for Grid Engine tuning.
Network Services Determine whether to define network services as an NIS file or as local to each workstation in /etc/services.
Installation Methods Learn about the different methods available for installing Grid Engine.
Directory Organization Determining the directory organization for Grid Engine.
Other Installation Issues If you are installing Grid Engine on a Linux system or on a system with IPMP, see Other Grid Engine Installation Issues for important information.
Getting the Software
Topic Description
Getting the Software Learn how to get your copy of the Grid Engine software.

To print this section, see the Planning the Installation Guide (Printable).


System Requirements

To verify that the systems on which you intend to install Sun Grid Engine conform to required hardware and software specifications, review the system requirements listed below.

Disk Space Requirements

The Grid Engine software directory tree has the following fixed disk space requirements:

  • 50 Mbytes for the installation files without any binaries
  • Between 60 and 100 Mbytes for each set of binaries

The ideal disk space for Grid Engine system spool directories is as follows:

  • 50-200 Mbytes for the master host spool directories
  • 50-200 Mbytes for the Berkeley DB spool directories

The spool directories of the master host and of the execution hosts are configurable and need not reside under the default location, sge-root.

Note
You must satisfy several Windows platform-specific prerequisites before you can install Grid Engine on hosts that are running the Windows operating system. You might need to install additional software on your computer which might require additional disk space. See Microsoft Services For UNIX and Microsoft Subsystem for UNIX-based Applications.

Supported Operating Platforms

The Sun Grid Engine 6.2 software supports the following operating systems and platforms:

  • Solaris 10, 9, and 8 Operating Systems (SPARC Platform Edition)
  • Solaris 10 and 9 Operating Systems (x86 Platform Edition)
  • Solaris 10 Operating System (x64 Platform Edition)
  • Apple Mac OS X 10.4 (Tiger), PPC platform
  • Apple Mac OS X 10.4 (Tiger), x86 platform
  • Apple Mac OS X 10.5 (Leopard), x86 platform
  • Hewlett Packard HP-UX 11.00 or higher, 32 bit
  • Hewlett Packard HP-UX 11.00 or higher, 64 bit (including HP-UX on IA64)
  • IBM AIX 5.1, 5.3
  • Linux x86, kernel 2.4, 2.6, glibc >= 2.3.2
  • Linux x64, kernel 2.4, 2.6, glibc >= 2.3.2
  • Linux IA64, kernel 2.4, 2.6, glibc >= 2.3.2
  • Microsoft Windows Server 2003, Windows XP Professional with at least Service Pack 1, Windows 2000 Server with at least Service Pack 3, or Windows 2000 Professional with at least Service Pack 3
  • Microsoft Windows Server 2003 Release 2, Windows Server 2008, Windows Vista Enterprise, Windows Vista Ultimate

Planning Checklist

Before you install the Grid Engine software, you must plan how to achieve the results that fit your environment. This section helps you make the decisions that affect the rest of the procedure. Write down your installation plan in a table similar to the following example.

Parameter                  Value                 
$SGE_ROOT directory  
Cell name  
Administrative user  
sge_qmaster port number (6444 is recommended)  
sge_execd port number (6445 is recommended)  
Master host  
Shadow master hosts  
Execution hosts  
Spooling for each execution host (global or local)  
Windows execution hosts (yes or no)  
Administration hosts  
Submit hosts  
Group ID range for jobs  
Spooling mechanism (Berkeley DB or Classic spooling)  
Berkeley DB server host (the master or another host)  
Berkeley DB spooling directory on the database server  
Scheduler tuning profile (Normal, High, Max)  
Installation method (interactive, secure, automated, or upgrade)  

If you are going to install Grid Engine 6.2 on Microsoft Windows Server 2003, Windows XP Professional with at least Service Pack 1, Windows 2000 Server with at least Service Pack 3, or Windows 2000 Professional with at least Service Pack 3, acquire and install Microsoft Services For UNIX. See Microsoft Services for UNIX for more information.

If you are going to install Grid Engine 6.2 on Microsoft Windows Server 2003 Release 2, Windows Server 2008, Windows Vista Enterprise or Windows Vista Ultimate, acquire and install Microsoft Subsystem for UNIX-based Applications. See Microsoft Subsystem for UNIX-based Applications for more information.

If you are going to install Grid Engine 6.2 on a Windows system, create the required Certificate Security Protocol (CSP) certificates before installing Grid Engine. See How to Install a CSP-Secured System for information about CSP certificates.

Check Other Grid Engine Installation Issues for applicability.


Cluster Design

Cells

You can set up the Grid Engine system as a single cluster or as a collection of loosely coupled clusters called cells. The $SGE_CELL environment variable indicates the cluster being referenced. When the Grid Engine system is installed as a single cluster, $SGE_CELL is not set, and the value default is assumed for the cell value.

Cluster Name

The $SGE_CLUSTER_NAME environment variable supports unique naming of the cluster. Unlike the $SGE_CELL variable, there are restrictions on $SGE_CLUSTER_NAME. If you decide to use Grid Engine SMF services on Solaris 10 or later hosts, you must select a new $SGE_CLUSTER_NAME. This name becomes part of the name of the Sun Grid Engine SMF services. The $SGE_CLUSTER_NAME is also used to distinguish multiple rc files for different clusters.

note
If your $SGE_CELL name already reflects the desired cluster name and also satisfies $SGE_CLUSTER_NAME restrictions, set the cluster name to the $SGE_CELL value. Otherwise, the proposed default value is pSGE_QMASTER_PORT, which uniquely identifies the running cluster by the port on which its qmaster daemon is running. See SMF administration for more information.

Queue Structure

The installation procedure creates a default cluster queue structure, which is suitable for getting acquainted with the system. The default queue can be removed after installation.

Note
No matter what directory is used for the installation of the software, the administrator can change most settings that were created by the installation procedure. This change can be made while the system is running.

Consider the following when determining a queue structure:

  • Whether you need cluster queues for sequential, interactive, parallel, and other job types
  • Which queue instances to put on which execution hosts
  • How many job slots are needed in each queue

For more detailed information on administering cluster queues, see Configuring Queues.


Host System Requirements

Master Host

The master host controls the Grid Engine system. This host runs the master daemon sge_qmaster.

The master host must comply with the following requirements:

  • The host must be a stable platform.
  • The host must not be excessively busy with other processing.
  • At least 60 to 120 Mbytes of unused main memory must be available to run the Grid Engine system daemons. For very large clusters that include many hundreds or thousands of hosts and tens of thousands of jobs in the system at any time, 1 GByte or more of unused main memory might be required and 2 CPUs might be beneficial.
  • The master host must be installed before shadow master execution, administration, or submit hosts.
  • (Optional) The Grid Engine software directory, $SGE_ROOT, should be installed locally to cut down on network traffic.
    Note
    Windows hosts cannot act as master hosts.

For more information, see How to Install the Master Host.

Shadow Master Hosts

These hosts back up the functionality of sge_qmaster in case the master host or the master daemon fails. To be a shadow master host, a machine must have the following characteristics:

  • It must run sge_shadowd.
  • It must share sge_qmaster status, job information, and queue configuration information that is logged to disk. In particular, the shadow master hosts need read/write root or administration user access to the sge_qmaster spool directory and to the $SGE_ROOT/$SGE_CELL/common directory.
  • The $SGE_ROOT/$SGE_CELL/common/shadow_masters file must contain a line defining the host as a shadow master host.
    Note
    If no cell name is specified during installation, the value of $SGE_CELL is default.

The shadow master host facility is activated for a host as soon as these conditions are met. You do not need to restart the Grid Engine system daemons to make a host into a shadow master host.

Note
Windows hosts cannot act as shadow master hosts.

For more information, see Configuring a Shadow Master Host From the Command Line.

Execution Hosts

Execution hosts run the jobs that users submit to the Grid Engine system. An execution host must first be set up as an administration host. You run an installation script on each execution host. For more information, see How to Install Execution Hosts.

Administration Hosts

Operators and managers of the Grid Engine system use administration hosts to perform administrative tasks such as reconfiguring queues or adding Grid Engine users.

The master host installation script automatically makes the master host an administration host. During the master host installation process, you can add other administration hosts. You can also manually add administration hosts on the master host at any time after installation.

Submit Hosts

Jobs can be submitted and controlled from submit hosts. The master host installation script automatically makes the master host a submit host.


Spooling Options

During the installation, you are given the option to choose between classic spooling and Berkeley DB spooling. If you choose Berkeley DB spooling, you are then given the option to spool to a local directory or to a separate host, known as a Berkeley DB spooling server.

Using a Berkeley DB spooling server might provide better performance than classic spooling. Part of this performance increase is because the master host can make non-blocking writes to the database, but has to make blocking writes to the text file used by classic spooling. Also consider file format and data integrity. Writing to the Berkeley DB provides a greater level of data integrity than writing to a text file. However, a text file stores data in a format that you can read and edit. Normally, you do not need to read these files, but the spooling directory contains the messages from the system daemons, which can be useful for debugging.

Database Server and Spooling Host

The master host can store its configuration and state to a Berkeley DB spooling database. The spooling database can be installed on the master server or on a separate host. When the Berkeley DB spools into a local directory on the master host, the performance is better. If you want to set up a shadow master host, you need to use a separate Berkeley DB spooling server (host). In this case, you have to choose a host with a configured RPC service. The master host connects through RPC to the Berkeley DB.

Note
This configuration does not provide a High-Availability (HA) solution. For example, scripts of pending jobs are not spooled through BDB spool server and thus are not available for a shadow master.

With the introduction of NFS4 software available with the Solaris TM 10 operating system, you can use Berkeley DB spooling on a network file system. You could not use Berkeley DB spooling on previous NFS versions. This circumstance allows a shadow host installation spooled on Berkeley DB without setting up an additional Berkeley DB Spooling Server.

Caution
Although using a shadow master host is more reliable, using a separate Berkeley DB spooling host results in a potential security hole. RPC communication as used by the Berkeley DB can be easily compromised. Only use this alternative if your site is secure and if users can be trusted to access the Berkeley DB spooling host by means of TCP/IP communication.

If you choose to use Berkeley DB spooling without a shadow master, you do not need to set up a separate spooling server. Likewise, if you choose not to use Berkeley DB spooling, you can set up a shadow master host without setting up a separate spooling server.

Once you determine whether you need a separate spooling server, you will also need to determine the location for the spooling directory. The spooling directory must be local to the spooling server. A default value for the location of the spooling directory is recommended during installation, but this default value is not suitable when the file server is different from the master host.

The requirements for the Berkeley DB spooling host are similar to the requirements for the master host:

  • The host must be a stable platform.
  • The host must not be excessively busy with other processing.
  • At least 60 to 120 Mbytes of unused main memory must be available to run the Grid Engine system daemons. For very large clusters that include many hundreds or thousands of hosts and tens of thousands of jobs in the system at any time, one GByte or more of unused main memory might be required and two CPUs might be beneficial.
  • (Optional) A separate spooling host must be installed before the master host.
  • (Optional) The $SGE_ROOT directory should be installed locally, to cut down on network traffic.

Scheduler Profiles

You can choose from three scheduler profiles during the installation process: normal, high, and max. You can use these predefined profiles as a starting point for Grid Engine tuning.

Using these profiles, you can optimize the scheduler for one or more of the following:

  • The amount of information that is tracked about a scheduling run
  • The load adjustment during a scheduling run
  • Interval scheduling (the default) or immediate scheduling

You can choose from three scheduler profiles:

  • normal – This profile uses load adaptation and interval scheduling, and reports all the information that the scheduler gathers during the dispatch cycle. This profile is the starting point for most grids. Use this profile if your highest priority is gathering and reporting information about a scheduling run.
  • high – This profile is more appropriate for a large cluster, where throughput is more important than gathering and reporting all the information from the scheduler. This profile also uses interval scheduling. Use this profile if you want to get better performance at the cost of getting less information about your scheduling runs.
  • max – This profile disables all information gathering and reporting, enables immediate scheduling, and disables load adaptation. Immediate scheduling is very useful for sites with high throughput and very short running jobs. The advantage of immediate scheduling decreases as runtime of the jobs increases. This profile can be used in clusters of any size where only throughput is important and everything else is a lower priority.

For more information on how to configure scheduling, see Administering the Scheduler.


Network Services

Determine whether your site's network services are defined in an NIS database or in an /etc/services file that is local to each workstation. If your site uses NIS, determine the host name of your NIS server so that you can add entries to the NIS services map.

The Grid Engine system services are sge_execd and sge_qmaster. To add the services to your NIS map, choose reserved, unused port numbers. The following examples show sge_qmaster and sge_execd entries.

sge_qmaster 6444/tcp
sge_execd 6445/tcp

Installation Methods

Several methods are available for installing the Grid Engine software:

  • Interactive
  • Interactive, with increased security
  • Automated, using the inst_sge script and a configuration file
  • Upgrade

To decide which installation method you should use, consider the following factors.


User Account Considerations

User Names

For the Grid Engine system to verify that users submitting jobs have permission to submit them on the desired execution hosts, users' names must be identical on the submit and execution hosts. You might therefore have to change user names on some machines, because Grid Engine user names map directly to system user accounts.

Note
User names on the master host are not relevant for permission checking. These user names do not have to match or even exist.

Installation Accounts

You can install the Grid Engine software either as the root user or as an unprivileged user, for example, your own user account. However, if you install the software when you are logged in as an unprivileged user, the installation allows only that user to run Grid Engine jobs. Access is denied to all other accounts. Installing the software when you are logged in as root resolves this restriction. However, root permission is required for the complete installation procedure. Also, if you install as an unprivileged user, you are not allowed to use the qrsh, qtcsh, or qmake commands, nor can you run tightly integrated parallel jobs.

Note
To use SMF on Solaris 10 or later hosts and run the Grid Engine software as an unprivileged user, perform the following additional steps as root user (or user with appropriate permissions):
For a local user:
  1. Create the new role sgeadmin:
    roleadd -c "Grid Engine SMF Administrator" -g <group> -d <home_dir> -u <UID> -s <profile_shell> -P "solaris.smf.manage.sge" "sgeadmin"
    
  2. Assign the just-created role sgeadmin to the user:
    usermod -R "sgeadmin" <login>
    

For a distributed name service, such as NIS, NIS+, or LDAP:

  • Create the new role sgeadmin and assign it to the user:
    /usr/sadm/bin/smrole add -D <domain_name> - -n "sgeadmin" -a "normal_user" -d <home_dir> -c "Grid Engine SMF Administrator" -p "solaris.smf.manage.sge"
    

File Access Permissions

If you install the software logged in as root, you might have a problem configuring root read/write access for all hosts on a shared file system. Therefore, you might have problems putting the $SGE_ROOT files onto a network-wide file system.

You can force Grid Engine software to run all Grid Engine system components through a non-root administrative user account, for example sgeadmin. With this setup, this particular user needs only read/write access to the shared $SGE_ROOT file system.

The installation procedure asks whether files should be created and owned by an administrative user account. If you answer "Yes" and provide a valid user name, files are created by this user. Otherwise, the user name under which you run the installation procedure is used. Create an administrative user, and answer "Yes" to this question.

Make sure in all cases that the account used for file handling on all hosts has read/write access to the $SGE_ROOT directory. Also, the installation procedure assumes that the host from which you access the Grid Engine software distribution media can write to the $SGE_ROOT directory.

Note
  • The name of the root user on Windows hosts depends on the system language of the Windows operating system. You can even change the name of the root user. The default name for many languages is the name Administrator.
  • If your Windows host is a member of a Windows domain, only the local Administrator is the root user. Neither the members of the Administrators group, nor the domain Administrator, nor a member of the Domain Admins group are the root user. See User Management on Windows Hosts for more information about users on Windows hosts.

Directory Organization

When determining the directory organization, you must decide the following:

  • The directory organization, for example, whether you will install a complete software tree on each workstation, cross-mounted directories, or a partial directory tree on some workstations.
  • Where to locate each $SGE_ROOT root directory.
Note
Because changing the installation directory or the spool directories requires a new installation of the system, use extra care to select a suitable installation directory. Note that all important information from a previous installation can be preserved.

By default, the installation procedure installs the Grid Engine software, man pages, spool areas, and the configuration files in a directory hierarchy under the installation directory as shown in the following figure. If you accept this default behavior, you should install or select a directory with the access permissions that are described in File Access Permissions.

Figure – Sample Directory Hierarchy

Browser window. Displays directory hierarchy of sge-root installation directory.

You can choose to put the spool areas in other locations during the primary installation. See Configuring Queues for more detailed instructions.

Spool Directories under the Root Directory

During the installation of the master host, you must specify the location of a spooling directory. This directory is used to spool jobs from execution hosts that do not have a local spooling directory.

Note
If you are using a Windows execution host, you must use the local spooling directory.
  • On the master host, spool directories are maintained under qmaster-spool-dir. The location of qmaster-spool-dir is defined during the master host installation process. The default value of qmaster-spool-dir is $SGE_ROOT/$SGE_CELL/spool/qmaster.
  • On each execution host, a spool directory called execd-spool-dir is defined during the execution host installation processes. The default value of execd-spool-dir is $SGE_ROOT/$SGE_CELL/spool/exec-host. You will get better performance from execution hosts with local spooling directories than from execution hosts that have NFS mounted the master host's spooling directory.
    Note
    If no cell name is specified during installation, the value of $SGE_CELL is default.

You do not need to export these directories to other machines. However, exporting the entire $SGE_ROOT tree and making it write-accessible for the master host and all executable hosts makes administration easier.

Note
If you use a Lustre fileshare as the spool directory, you should disable file striping for these directories. Refer to the Lustre Operations Manual for information on how to disable file striping.

Choosing Between Classic Spooling and Database Spooling

During the installation, you are given the option to choose between classic spooling and Berkeley DB spooling. If you choose Berkeley DB spooling, you are then given the option to spool to a local directory or to a separate host, known as a Berkeley DB spooling server.

Using a Berkeley DB spooling server might provide better performance than classic spooling. Part of this performance increase is because the master host can make non-blocking writes to the database, but has to make blocking writes to the text file used by classic spooling. Also consider file format and data integrity. Writing to the Berkeley DB provides a greater level of data integrity than writing to a text file. However, a text file stores data in a format that you can read and edit. Normally, you do not need to read these files, but the spooling directory contains the messages from the system daemons, which can be useful for debugging.

$SGE_ROOT Directory

You must create a directory into which to load the contents of the distribution media. This directory is called the root directory, or $SGE_ROOT. When the Grid Engine system is running, this directory stores the current cluster configuration and all other data that must be spooled to disk.

Note
For efficient spooling, place the spooling directories somewhere other than within $SGE_ROOT.

Use a valid path name for the directory that is network-accessible on all hosts. For example, if the file system is mounted using automounter, set $SGE_ROOT to /usr/SGE6, not to /tmp_mnt/usr/SGE6.

Note
Throughout this information space, the $SGE_ROOT environment variable is used to refer to the directory into which the Sun Grid Engine software is installed.

The $SGE_ROOT directory is the top level of the Grid Engine software directory tree. On startup, each Grid Engine software component in a cell needs read access to the $SGE_ROOT/$SGE_CELL/common directory. When Grid Engine software is installed as a single cluster, the value of $SGE_CELL is default.

For ease of installation and administration, this directory should be readable on all hosts on which you intend to run the Grid Engine software installation procedure. For example, you can select a directory that is available across a network file system, such as NFS. If you choose to select file systems that are local to the hosts, you must copy the installation directory to each host before you start the installation procedure for the particular machine. See File Access Permissions for a description of required permissions.


Getting the Software

The software is distributed through electronic download and on CD-ROM.

Electronic Download

Sun Grid Engine

To electronically download a copy of the Sun Grid Engine software, visit sun.com. The product distribution is in pkadd format for the Solaris Operating System (Solaris OS).

Grid Engine Open Source

If you would like to download a copy of the open source Grid Engine software, visit the download center.

CD-ROM Distribution

For information on how to access CD-ROMs, ask your system administrator or refer to your local system documentation. For instructions, see Loading the Distribution Files on a Workstation.


Installing Sun Grid Engine

Installing Guide (Printable)

To effectively install Sun Grid Engine, perform the following tasks in the order that they are listed:

Topic Description
Release Notes Learn about product enhancements that have been added since the release.
Planning the Installation Strategically plan your installation to achieve results that fit your environment.
Loading the Distribution Files on a Workstation Unpack and load the distribution files onto a workstation.
Installing the Software With the GUI Installer Learn how to run the new GUI installer and install whole cluster.
Installing the Software From the Command Line Learn how to run an installation script on the master host and on every execution host in the Grid Engine system and to register information about administration hosts and submit hosts.
Installing Security Features Set up your system more securely.
Installing the Accounting and Reporting Console Install the Accounting and Reporting Console, an optional feature that enables you to gather live reporting data from the Grid Engine system.
SDM Installation Overview Install the Service Domain Manager module, an optional feature that distributes resources between different services according to configurable Service Level Agreements (SLAs).
Verifying the Installation Verify that the daemon is running on the master host and on the Execution Hosts and how to run simple commands and submit test jobs.

In addition, you might need to perform one or more related tasks:

Topic Description
Automating the Installation Process Learn how to automate the Grid Engine installation process.
Installing SMF Services Learn how to install the Service Management Facility (SMF) services.
Installing a JMX-Enabled System Learn how to install a JMX-enabled system.
Removing the Software Learn how to remove the Sun Grid Engine software.
Additional Software for the Microsoft Operating System Learn how to install Sun Grid Engine on Microsoft Windows operating system.
User Management on Windows Hosts Learn how to manage user accounts on Windows hosts.
Other Installation Issues Learn how to identify additional considerations for installing Sun Grid Engine software.

To print this section, see the Installing Guide (Printable).


Loading the Distribution Files on a Workstation

The Sun Grid Engine 6.2 software is distributed on CD-ROM and through electronic download. The CD-ROM distribution contains a directory named Sun_Grid_Engine_6_2. The product distribution is in this directory, in both tar.gz format and the pkgadd format. The pkgadd format is provided for the Solaris Operating System (Solaris OS). For all supported operating systems, the software is distributed in tar.gz format. For more on how to obtain the distribution files, see Getting the Software.

How to Load the Distribution Files on a Workstation

Ensure that the file systems and directories that are to contain the Grid Engine software distribution and the spool and configuration files are set up properly by setting the access permissions as defined in File Access Permissions.

Steps
  1. Provide access to the distribution media.
    If you downloaded the software, rather than getting it on CD-ROM, just unzip the files into a directory. This directory must be located on a file system that has at least 350 MBytes free space.

  2. Log in to a system.
    Log in preferably on a system that has a direct connection to a file server.

  3. Create the installation directory.
    Create an installation directory as described in $SGE_ROOT Installation Directory.
    # mkdir /opt/sge6-2
    

    In these instructions, the installation directory is abbreviated as sge-root.

  4. Install the binaries for all binary architectures that are to be used by any of your master, execution, and submit hosts in your Grid Engine system cluster.
    You can use either the pkgadd method or the tar method.

pkgadd Method

The pkgadd format is provided for the Solaris Operating System. To facilitate remote installation, the pkgadd directories are also provided in zip files.

You can install the following packages:

Package Description
SUNWsgeec Architecture independent files
SUNWsgeex Solaris (SPARC platform) 64-bit binaries for Solaris 8, Solaris 9, and Solaris 10 Operating Systems
SUNWsgeei Solaris (x86 platform) binaries for Solaris 8, Solaris 9, and Solaris 10 Operating Systems
SUNWsgeeax Solaris (x64 platform) binaries for Solaris 10 Operating System
SUNWsgeea Accounting and Reporting Console (ARCo) packages for the Solaris and Linux Operating systems.

As you type the following commands, you must be prepared to respond to script questions about your base directory, sge-root, and the administrative user. The script requests the choices that you made during the planning steps of this installation. See Decisions That You Must Make for further details.

At the command prompt, type the following commands, responding to the script questions.

# cd cdrom_mount_point/Sun_Grid_Engine_6_2
# pkgadd -d ./Common/Packages SUNWsgeec

Depending on the Solaris binary that you need, type one of the following commands:

# pkgadd -d ./Solaris_sparc/Packages SUNWsgee 
# pkgadd -d ./Solaris_sparc/Packages SUNWsgeex 
# pkgadd -d ./Solaris_x86/Packages SUNWsgeei 
# pkgadd -d ./Solaris_x64/Packages SUNWsgeeax 

tar Method

For all supported operating systems, the software is distributed in tar.gz format.

The following table contains files that you need to install, regardless of platform.

File Description
Common/tar/sge-6_2-common.tar.gz Architecture independent files

The tar files that contain platform-specific binaries use the naming convention of sge-6_2-bin-architecture.tar.gz.

The following table lists the platform-specific binaries. Install the file for each platform that you need to support. Note that each platform has its own directory under Sun_Grid_Engine_6_2.

Platform-Specific File Platform
Solaris_sparc/tar/sge-6_2-bin-solaris-sparcv9.tar.gz Solaris (SPARC platform) 64-bit binaries for Solaris 8, Solaris 9, and Solaris 10 Operating Systems
Solaris_x86/tar/sge-6_2-bin-solaris-i586.tar.gz Solaris (x86 platform) binaries for Solaris 8, Solaris 9, and Solaris 10 Operating Systems
Solaris_x64/tar/sge-6_2-bin-solaris-x64.tar.gz Solaris (x64 platform) 64-bit binaries for Solaris 10
Windows/tar/sge-6_2-bin-windows-x86.tar.gz Microsoft Windows (x86 platform) 32-bit binaries for Windows 2000, XP and Windows Server 2003
Linux24_i586/tar/sge-6_2-bin-linux24-i586.tar.gz Linux (x86 platform) binaries for the 2.4 and 2.6 kernel
Linux24_amd64/tar/sge-6_2-bin-linux24-ia64.tar.gz Linux (Itanium platform) binaries for the 2.4 and 2.6 kernel
Linux24_amd64/tar/sge-6_2-bin-linux24-x64.tar.gz Linux binaries for the 2.4 and 2.6 kernel
MacOSX/tar/sge-6_2-bin-darwin-ppc.tar.gz Apple Mac OS/X (PowerPC platform)
MacOSX/tar/sge-6_2-bin-darwin-x64.tar.gz Apple Mac OS/X (Intel-based platform)
HPUX11/tar/sge-6_2-bin-hp11.tar.gz Hewlett-Packard HP-UX 11 or higher
HPUX11/tar/sge-6_2-bin-hp11-64.tar.gz 64-bit binaries for Hewlett-Packard HP-UX 11 or higher
Aix43/tar/n1ge-6_1-bin-aix51.tar.gz IBM AIX 5.1 and 5.3

Type the following commands at the command prompt. In the example, <basedir> is the abbreviation for the full directory, cdrom-mount-point/Sun_Grid_Engine_6_2.

% su 
# cd <sge-root>
# gzip -dc <basedir>/Common/tar/sge-6_2-common.tar.gz | tar xvpf -
# gzip -dc <basedir>/Solaris_sparc/tar/sge-6_2-bin-solsparc32.tar.gz | tar xvpf -
# gzip -dc <basedir>/Solaris_sparc/tar/sge-6_2-bin-solsparc64.tar.gz | tar xvpf -
# SGE_ROOT=<sge-root>; export SGE_ROOT
# util/setfileperm.sh $SGE_ROOT

Installing the Software With the GUI Installer

Sun Grid Engine 6.2u2 comes with a new GUI installer to simplify the installation process. The GUI installer enables you to easily install a whole cluster interactively. To install a cluster, you need to set up the environment in a similar way to an automatic installation.

Requirements

  • The GUI installer requires at least Version 5 of the Java™ platform.
  • Screen resolution of 1024x768 or larger.
  • (Optional) Password-less ssh or rsh access as root user to all remote hosts that you want to install. If this requirement is not met you can only install Grid Engine components on a local host. For more information, see How to Configure Password-less Access for the root User. You can still use the GUI installer by starting it locally from each remote host.
Recommendations
  1. Start the installer as root user.
  2. Ensure that you start the installation from the qmaster host when password-less root access is available.

For information on installation modes supported by the Sun Grid Engine 6.2u2 GUI installer, see these topics:

Topic Description
Express Installation Enables first-time users to install the software easily. Provides a significantly reduced set of parameters that need to be configured. Requires password-less ssh access as root user to all remote hosts that you want to install.
Custom Installation Enables you to configure almost all existing options that are available during the command-line installation. Offers more advanced features for the cluster host selection. Requires password-less ssh or rsh access as root user to all remote hosts that you want to install.

For additional reference information, see these topics:

Topic Description
How to Configure Password-less Access for the root User Procedure for configuring a password-less ssh or rsh access for the root user to install a whole Sun Grid Engine (SGE) cluster by using the GUI Installer.
Understanding Host and Installation States Describes the different installation states that you might encounter while using the GUI installer.
Tweaking start_gui_installer Describes the command-line options of the start_gui_installer command and how to use them to fine tune the performance of the installer.
Troubleshooting the GUI Installer Contains known issues and their workarounds.

Express Installation

The express installation mode is targeted at first-time users and provides a significantly reduced set of parameters to configure. This mode also provides reasonable default values for most of the parameters. You must have a password-less ssh or scp access if you are planning to install Sun Grid Engine on remote hosts. The following steps describe a complete cluster installation and assume that the password-less access is configured. (Click any of the screen captures in the following steps to view more details.)

Using the Express Installation Mode

The express installation steps are as follows.

Steps

  1. Start the GUI installer. On the welcome screen, click Next.
    Note

    Ensure that you start the GUI installer on the qmaster host.

    As root, run the start_gui_installer command in your sge-root directory. For example:

    qmaster:/sge# ./start_gui_installer
    Starting Installer ...
    



  2. Agree to the terms of the license. Click Next.


  3. Choose components to install. Click Next.

    See the following table for a brief explanation of options displayed on this screen.
    Host type Description
    Qmaster host Main component in Sun Grid Engine software. You must install exactly one qmaster component per Grid Engine installation.
    Execution host(s) Hosts that execute the tasks (jobs).
    Shadow host(s) Hosts that provide a high availability feature to the cluster. In case the qmaster fails (for example, due to a crash or network issue), one of the shadow hosts takes over the qmaster responsibility.
    Berkeley db host Host that implies a Berkeley db host spooling option. The Grid Engine then spools data to a remote server. Not recommended as the default option.

    If you are not sure what you want to install, keep the components selected by default.

  4. Modify the main configuration details. Click Next.
    Option Description
    Admin user Grid Engine processes will be executed under this user name, and certain directories will be owned by this user.
    Qmaster host Host that will run qmaster daemon (main component). It can be changed later in the host selection.
    Grid Engine root directory Directory where you unpacked the Grid Engine tar.gz archive or installed a package (for example, rpm, pkg). It must not contain an automounter prefix.
    Cell name Name of this Grid Engine cell, a value that identifies an instance of a Grid Engine when several instances run simultaneously.
    Cluster name Name of this Grid Engine instance used by SMF on Solaris machines. In express installation mode, this instance is hidden and has a value of p6444. The following naming restrictions apply to this field: The cluster name must start with a letter ([A-Za-z]), followed by letters, digits ([0-9]), dashes ("-"), or underscores ("_").
    Qmaster port Port that will be used by the qmaster daemon. Default value is 6444.
    Execd port Port that will be used by the execution daemon. Default value is 6445.
    Administrator mail Email address used by Grid Engine to report issues to the grid administrator. Default value is none (no emails will be sent).
    Automatically start service(s) at machine boot Component (service) will be automatically started at machine boot. By default, this is selected.

    Typically, one would provide a valid administrator email and click next.

  5. Select hosts to be installed and fix reported problems. Click Install to start the installation on the reachable hosts.

    This screen allows you to select the hosts and components that you would like to install. Express installation mode has a slightly simplified selection model. Custom installation mode enables you to change the components that will be selected once new hosts are added. The qmaster host is added based on the qmaster host value from the main configuration screen by default. You can select the hosts in one of two different ways:
    1. By a host name, host name pattern, or by an IP address or IP address pattern
    2. From a file that you create using the installer's save action
    The patterns do not support regular expressions. The supported expressions are lists and numeric ranges. For more information, see the following table:
    Description Input Resolved Value
    Host name grid00 grid00
    IP address 192.168.0.1 192.168.0.1
    List of hosts grid00 grid01 grid05 grid00 grid01 grid05
    List of IP addresses 192.168.0.1 192.168.0.2 192.168.0.5 192.168.0.1 192.168.0.2 192.168.0.5
    Host ranges grid[00-10] grid00, grid01, ..., grid10
    Range of IP addresses 192.[168-169].0.[50-60] 192.168.0.50 ... 192.168.0.60, 192.169.0.50 ... 192.169.0.60


    In the following screen sequence, hosts grid01 to grid07 are added as execution and submit hosts. However, three hosts have errors. See Understanding Host and Installation States for a complete list of errors and possible solutions. Note that each state has a tooltip that displays a better error message. Once the errors are resolved on the problematic hosts, select hosts that you want to verify and right-click. A pop-up menu enables you to refresh selected hosts. Optionally, invalid hosts can be removed. Once the states have been refreshed, a different error state or reachable state will be displayed.



  6. (Optional) Fix problems reported during pre-install validation, then click Install.
    When you click the Install button as described in Step 5, the installation does not start immediately. First, the installer executes a series of advanced checks for each host to verify that there is no misconfiguration. If the validation fails, host states are updated and you are presented with an option to return to the host selection or to continue with the installation.
    Note

    Continuing the installation after the installer reports errors will likely result in a failed installation. Before restarting the installation, you should return to the host selection and either resolve the reported problems or remove the hosts that have configuration errors.

    In the following screen sequence, three hosts have configuration errors. See Understanding Host and Installation States for a complete list of errors and possible solutions. Notice that each state includes a tooltip that displays an error message.



  7. Monitor the progress of the installation, then click Next.


    If there were any failures during the installation, the Failed tab is selected. See Understanding Host and Installation States for a complete list of installation states. Click the Log button for each failed installation for more information.

    This error was produced because the cluster name p6444 already exists on this host (installation was not attempted).



  8. Review the overview information, then click Done.

    Optionally, print or save the information about the Sun Grid Engine configuration for future reference. The page is also automatically saved to the $SGE_ROOT/$SGE_CELL/Readme_TIMESTAMP.html file. If the page could not be saved there, due to root being mapped to nobody on NFS shared file system, it is saved to /tmp/Readme_TIMESTAMP.html.
    To verify the installation, go to Verifying the Installation.

Custom Installation

The custom installation mode is targeted at the experienced users. It offers more advanced customization of Sun Grid Engine installation than the express installation. It provides default values for most of the parameters. You must have a password-less ssh or rsh access if planning to install Sun Grid Engine on remote hosts. The following steps assume that the password-less access is configured and describe a cluster installation consisting of:

  • Qmaster host with JMX feature enabled
  • Eight execution hosts on various architectures
  • Three shadow hosts
  • Three administrative hosts
  • Nine submit hosts

Using the Custom Installation Mode

The custom installation steps are as follows.

Steps

  1. Start the GUI Installer. On the welcome screen, click Next.
    Note

    Ensure that you start the GUI installer on the qmaster host.

    As root, run the start_gui_installer command in your sge-root directory. For example:

    qmaster:/sge# ./start_gui_installer
    Starting Installer ...
    



  2. Agree to the terms of the license. Click Next.


  3. Choose components to install, including a shadow host and the custom installation option, and click Next.

    See the following table for a brief explanation of options displayed on this screen.
    Host type Description
    Qmaster host Main component in Sun Grid Engine software. Exactly one qmaster component must be installed per Grid Engine installation.
    Execution host(s) Hosts that execute the tasks (jobs).
    Shadow host(s) Shadow hosts provide a high availability feature to the cluster. In case that the qmaster fails (crash, network issue), one of the shadow hosts will take over the qmaster responsibility.
    Berkeley db host Selecting it implies a Berkeley db host spooling option. The Grid Engine than spools data to a remote server. Not recommended as default option.



  4. Modify the main configuration details. Click Next.
    Option Description
    Admin user Grid Engine processes will be executed under this user name, and certain directories will be owned by this user.
    Qmaster host Host that will run qmaster daemon (main component). It can be changed later in the host selection.
    Grid Engine root directory Directory where you unpacked the Grid Engine tar.gz archive or installed a package (for example, rpm, pkg). It must not contain an automounter prefix.
    Cell name Name of this Grid Engine cell, a value that identifies an instance of a Grid Engine when several instances run simultaneously.
    Cluster name Name of this Grid Engine instance used by SMF on Solaris machines. The following naming restrictions apply to this field: The cluster name must start with a letter ([A-Za-z]), followed by letters, digits ([0-9]), dashes ("-"), or underscores ("_").
    Qmaster port Port that will be used by the qmaster daemon. Default value is 6444.
    Execd port Port that will be used by the execution daemon. Default value is 6445.
    Group id range Range of additional group IDs. The group IDs in this range must not be used anywhere else. The size of the range determines how many concurrent jobs can run in Sun Grid Engine. Choose a large value.
    Shell name Shell to be used while connecting to remote hosts (with ssh or rsh syntax). Expected values for this field are ssh or rsh.
    Copy command Command to be used while copying files to remote hosts (with scp or rcp syntax). Expected values for this field are scp or rcp.
    Administrator mail Email address used by the Sun Grid Engine to report issues to the grid administrator. Default value is none (no emails will be sent).
    Automatically start service(s) at machine boot Component (service) will be automatically started at machine boot. By default, this is selected.
    Use JMX Triggers installation of a JVM thread in qmaster. Currently only needed when you plan to install Service Domain Manager. By default, this is selected.
    Ignore domain names Sun Grid Engine will ignore domain names when comparing host names. By default, this is selected.
    Use CSP product mode Sun Grid Engine will be installed with certificate security protocol (CSP). Communication between Sun Grid Engine daemons will be protected by an SSL certificate. Has impact on cluster throughput. By default, this is not selected.

    Typically, one would customize the default values and click Next.

  5. Modify the JMX configuration details. Click Next.
    Option Description
    JMX port Port number to be used by JVM thread in qmaster process.
    Enable SSL server authentication Once enabled, SSL certificate configuration will be presented later. The server certificate will be used for authentication and encryption.
    Enable SSL client authentication Client authentication will be used.
    Path to the keystore Path to Java keystore file that will be created during the qmaster installation.
    Keystore password Keystore password. Default value is changeit.
    Retype password Password to retype. Default value is changeit.
    JVM library path Path to the JVM library on the qmaster host. Be sure to enter a correct path. If you are on a 64-bit system, you might have executed the GUI installer in default Java 32-bit mode. In that case, the detected path to the library will be for 32-bit Java, and the thread will fail to start later in 64-bit mode.
    Additional JVM args Additional arguments to be used when starting the JVM in qmaster. Default is -Xmx256m.



  6. Modify the spooling configuration. Click Next.
    Option Description
    Qmaster spool directory Directory for qmaster spooling data.
    Global execd spool directory Directory for execution daemon spooling directory used by default for all execution hosts. Unless overridden in the host selection screen, each execution host creates a subdirectory in the global execd spool directory.
    Classic spooling method Spooling is done in human readable format.
    Berkeley db spooling method Spooling is done to local Berkley db.
    Berkeley db spooling server spooling method Spooling is done to Berkley db server.
    Berkeley db host Host for Berkeley db server, enabled only when Berkeley db spooling server method is selected.
    Db directory Berkeley db spooling directory, either on local host or Berkeley db host, if Berkeley db spooling server method is selected.



  7. (Optional) Provide SSL certificate information. Click Next.

    This screen is displayed only when you have previously selected the JMX or CSP features. An SSL certificate will be generated as part of qmaster installation. This certificate will then be used throughout the Grid Engine.
    Option Description
    Country code Two-character country code. Default value is DE.
    State State. Default value is GERMANY.
    Location Location. Default value is Building.
    Organization Organization. Default value is Organisation.
    Organization unit Organization unit. Default value is Organisation_unit.
    Email address Email address. Default value is name@yourdomain.com.



  8. Select hosts to be installed and fix reported problems. Click Install to start the installation on the reachable hosts.

    This screen allows you to select the hosts and components that you would like to install. The qmaster host is added based on the qmaster host value from the main configuration screen by default. You can select the hosts in one of two different ways:
    1. By a host name, host name pattern, or by an IP address or IP address pattern
    2. From a file that you create using the installer's save action
    The patterns do not support regular expressions. The supported expressions are lists and numeric ranges. For more information, see the following table:
    Description Input Resolved Value
    Host name grid00 grid00
    IP address 192.168.0.1 192.168.0.1
    List of hosts grid00 grid01 grid05 grid00 grid01 grid05
    List of IP addresses 192.168.0.1 192.168.0.2 192.168.0.5 192.168.0.1 192.168.0.2 192.168.0.5
    Host ranges grid[00-10] grid00, grid01, ..., grid10
    Range of IP addresses 192.[168-169].0.[50-60] 192.168.0.50 ... 192.168.0.60, 192.169.0.50 ... 192.169.0.60


    In the following screen sequence, eight execution and submit hosts are added from a file. One host has an error; it is unreachable. See Understanding Host and Installation States for a complete list of errors and possible solutions. Note that each state has a tooltip that displays a better error message. Hosts can be refreshed or removed using a context menu. In addition, two hosts are added as shadow and administrative hosts. Before actually adding the hosts by clicking the Add button, the default component selection must be changed from execution and submit host to shadow and admin host.



  9. (Optional) Fix problems reported during pre-install validation. Click Install.

    When you click the Install button as described in Step 8, the installation does not started immediately. First, the installer executes a series of advanced checks for each host to verify that there is no misconfiguration. If the validation fails, host states are updated and you may return to the host selection or continue with the installation.
    Note

    Continuing the installation after the installer reports errors will likely result in a failed installation. Before restarting the installation, you should return to the host selection and either resolve the reported problems or remove the hosts that have configuration errors.

    See Understanding Host and Installation States for a complete list of errors and possible solutions. An example of a pre-install validation with errors can be found in Express Installation.

  10. Monitor the progress of the installation, then click Next.


    If there were any failures during the installation, see Understanding Host and Installation States for a complete list of installation states. Click the Log for each failed installation for more information as shown in this example.

  11. Review the overview information, then click Done.

    Optionally, print or save the information about the Sun Grid Engine configuration for future reference. The page is also automatically saved to the $SGE_ROOT/$SGE_CELL/Readme_TIMESTAMP.html file. If the page could not be saved there, due to root being mapped to nobody on NFS shared file system, it is saved to /tmp/Readme_TIMESTAMP.html.
    To verify the installation, go to Verifying the Installation.

How to Configure Password-less Access for the root User

This section describes how to set up a password-less ssh or rsh access for the root user to install a whole Sun Grid Engine cluster at once by using the GUI Installer. The Sun Grid Engine installation must be started on the qmaster host, so you need to first decide which host is going to be the qmaster host. The following instructions use qmaster as the qmaster host name. You must replace qmaster with your qmaster host name.

Recommendation
You can skip this procedure if you plan to install Sun Grid Engine only on a local host.
Warning
Enabling root login without a password can be a security risk!
The commands and configuration files used in the following procedure are applicable only to the Solaris 10 operating system. You can substitute these with commands and configuration files that are appropriate for your operating system.
Note
Installing Grid Engine cluster with CSP option may additionally require password-less access to the localhost (qmaster host to the qmaster host).

Configuring Password-less ssh Access for the root User

  1. Enable root login.
    For security reasons, using ssh as root is disabled on many platforms by default. Perform the following for each host on which you will log in using password-less ssh as the root user:
    1. As root, open the /etc/ssh/sshd_config file.
    2. Modify PermitRootLogin no to PermitRootLogin without-password.
  2. Restart ssh service on all remote hosts.
    As root type the following command.
    svcadm disable -st ssh ; svcadm enable ssh


  3. Generate a certificate on the qmaster host.
    As root, type the following command to generate the RSA key on the qmaster host. You should leave the passphrase empty.
    # ssh-keygen -t rsa
    Generating public/private rsa key pair.
    Enter file in which to save the key (/root/.ssh/id_rsa):
    Created directory '/root/.ssh'.
    Enter passphrase (empty for no passphrase):
    Enter same passphrase again:
    Your identification has been saved in /root/.ssh/id_rsa.
    Your public key has been saved in /root/.ssh/id_rsa.pub.
    The key fingerprint is:
    ec:fa:48:55:c4:3d:59:40:a6:27:10:a2:90:11:de:dc root@qmaster
    


  4. Copy the certificate to all remote hosts.
    Copy the generated public key contained in a id_rsa.pub file to every remote host that should accept root login without a password from this host.
    The following example enables root access to host grid05 from host qmaster.
    qmaster# cat /root/.ssh/id_rsa.pub
    ssh-rsa ACCCB3NzaC1yc2EBBBBBIwAAAIEA1xfRiZMV7xt8EMDollLQH5RTAVz3lIXkr/FTfcbwjuMa0t/PdO9gBnJY03e1mIIpjDPiqT2IWfdrzHZB4xvl0MBNhMTWf8Gd3WDO4T7/zw7VhlqT6wUl0ncrhzE5BTIMB0i0X/amgidEzFbL+hE3RvPuowapNZUv+JC1IjDVmmE= root@qmaster
    qmaster# ssh grid05
    he authenticity of host 'grid05 (192.168.1.5)' can't be established.
    RSA key fingerprint is ec:fa:48:55:c4:3d:59:40:a6:27:10:a2:90:11:de:dc.
    Are you sure you want to continue connecting (yes/no)? yes
    Password:
    grid05# mkdir -p ~/.ssh
    grid05# echo "ssh-rsa ACCCB3NzaC1yc2EBBBBBIwAAAIEA1xfRiZMV7xt8EMDollLQH5RTAVz3lIXkr/FTfcbwjuMa0t/PdO9gBnJY03e1mIIpjDPiqT2IWfdrzHZB4xvl0MBNhMTWf8Gd3WDO4T7/zw7VhlqT6wUl0ncrhzE5BTIMB0i0X/amgidEzFbL+hE3RvPuowapNZUv+JC1IjDVmmE= root@qmaster" >> ~/.ssh/authorized_keys
    


  5. Verify if you are able to connect to the hosts as root without a password.
    As root, type the following command.
    ssh <remote_password-less_host>

    If you are able to connect to the hosts without being prompted, password-less access to the hosts has been set up. Now, you can invoke the GUI installer using the start_gui_installer command from your sge-root directory.

Configuring Password-less rsh Access for the root User

  1. Enable root login.
    Normally, the root user can only log in to the console /dev/console. You can remove this restriction by performing the following.
    1. Open the /etc/default/login file.
    2. Comment out the CONSOLE=/dev/console line by inserting a # character at the beginning of the line.
      You need to perform this for each remote host you would like to log in to.
  2. Set up access without a password.
    1. Create a .rhosts file.
    2. Add a single line that contains the qmaster's host name preceded by a + sign.
      For example, if foo is the qmaster's host name, add the line +foo to the .rhosts file.
    3. Copy this file to the root user's home directory on each of the remote hosts where you wish to install Sun Grid Engine.
      This will allow root to log in from the qmaster host without a password to any machine that will be part of the cluster.
  3. Restart rlogin service on all remote hosts.
    As root, type the following command.
    svcadm disable -st rlogin ; svcadm enable rlogin


  4. Verify if you are able to connect to the hosts as root without a password.
    As root, type the following command.
    rlogin <remote_password-less_host>

    If you are able to connect to the hosts without being prompted, password-less access to the hosts has been set up. Now, you can invoke the GUI installer using the start_gui_installer command from your sge-root directory. Choose the custom installation mode and replace ssh with rsh and scp with rcp in the Main configuration panel.


Understanding Host and Installation States

This section lists the different installation states that you might encounter while using the GUI installer. The installation states can be divided into the following three categories.

Host Resolving

When a new host is added in the Select hosts screen, the host name State field is immediately set to New unknown host and host name resolving process is initiated. The host name is marked as Reachable only if the architecture of the host can be retrieved. All the other states specify an error. The GUI installer cannot perform any installation on such a host. The following table lists all possible states.

State Description
New unknown host Initial state. When the host name is added, the GUI installer immediately starts resolving the host name or IP address of the host, if there are available threads in the resolve pool.
Resolving Temporary state. The host is being resolved based on the host name or IP address by using the default name service.
Unknown host Final state. The host cannot be resolved by the name service.
Resolvable Temporary state. After the host is resolved, the GUI installer immediately tries to retrieve the host's architecture, if there are available threads in the resolve pool.
Contacting Temporary state. The host has been resolved and the host's architecture is being retrieved.
Missing remote file Final state. Missing file '$SGE_ROOT/util/arch' on remote host.
Is the sge-root path the same for the remote host and the local host? If not, fix the path or refer Using Path Aliasing.
Reachable Final state. The host architecture cannot be retrieved. Password-less ssh or rsh access to remote hosts is working properly.
Unreachable Final state. The host architecture cannot be retrieved. Password-less ssh or rsh access to remote hosts is not working properly. See How to Configure Password-less Access for more information.
Canceled Final state. The user has canceled the host resolving process.

Host Validating

After the hosts have been resolved and their architecture has been retrieved, they are moved to the Reachable tab in the Select hosts screen. You can install Sun Grid Engine on a host that is in the Reachable state. While clicking the Install button, the GUI installer first invokes additional remote host validation. If the installer discovers any configuration errors (see RED and ORANGE states in the list below), the installation is not initiated and the appropriate error message is displayed. You can return to the Select hosts screen and proceed with the installation if you wish.

State Description Problem Resolution
Copy timeout Timeout occurred when copying check_host or install_component files.
See tooltip for the exact file name.
Try again (press Install button one more time).
If timeout reoccurs, save your host list to a file, stop the installer and restart it with increased timeout values. See tweaking start_gui_installer.
Copy failed Copying files check_host or install_component to the remote host failed.
See tooltip for the exact file name.
Try again (press Install button one more time).
If problems reoccurs try to copy a any file with scp or rcp to verify these commands work properly. If not make sure they do before new installation attempt.
Permission denied Either of Berkeley DB, qmaster, execution daemon spool directory or JMX keystore file is not writable. See tooltip for the exact message.
Installation will most likely fail, if you proceed anyway.
Did you start the installation as root?
What permissions are for the first existing directory?
Are you on a NFS file system with root mapped to nobody?
Is the UID for the admin user the same on the local and remote machine?
Admin user missing The admin user entered in the main configuration screen does not exist on the remote machine. Setup the host properly so that name service provides the name properly to the remote machine (or create the user locally).
Directory exists Berkeley DB spool directory already exists! Check the remote host for existing Berkeley DB installations.
Remove the existing directory.
Wrong FS type Specified Berkeley DB spool directory is on a local file system. Go back to the spooling configuration screen and choose a proper local directory.
Unknown error Unknown error has occurred. Try again (press Install button one more time).
If reoccurring, ignore and try to install anyway.
Reachable Validation did not discover any issues for this remote host.  
Canceled User canceled further host validation.  

Installation States

When the installation is started the host list with the chosen components is transformed to a task list. The task list is better suited to handle dependencies. These are the states one may encounter during the installation.

State Description
Waiting Task is waiting to be executed.
Processing Temporary state. Task is being processed.
Timeout Task did not finish before timeout value has been reached.
Success Task finished successfully.
Failed Task finished unsuccessfully. Click the Log button to get more information.
Failed due to dependency Task was not started, because it depended on a task that failed. Click the Log button to get more information.
Component already exists Task was not started. The installation detected a previous conflicting component installation. Click the Log button to get more information. Remove any remains of the old installation, before trying again.
Canceled User canceled the installation process.

Tweaking start_gui_installer

The start_gui_installer command will start the Java™ GUI installer. This section describes the command-line options of start_gui_installer, that you might use to affect the performance of the installer in your environment or possibly use as a workaround for yet unknown issues.

The Help text can be invoked by calling the -help option.

master:/sge62u2 # ./start_gui_installer -help
Usage: start_gui_installer [-help] [-resolve_pool=<num>] [-resolve_timeout=<sec>]
       [-install_pool=<num>] [-install_timeout=<sec>] [-connect_user=<usr>]
       [-connect_mode=windows]

   <num> ... decimal number greater than zero
   <sec> ... number of seconds, must be greater then zero
   <usr> ... user id

If no parameter is specified, the start_gui_installer command is started as if the following command was called:

master:/sge62u2 # ./start_gui_installer -resolve_pool=12 -resolve_timeout=20 -install_pool=8 -install_timeout=120

Every installation generates installation logs in the sge_root/sge_cell/install_logs directory. In addition, a GUI log file is created in a $TEMP directory (usually /var/tmp or /tmp) named SGE_Gui-Installer_Log_<date>.txt.

Description of start_gui_installer Options

Option Description
-help displays help for start_gui_installer
-resolve_pool=<num> Defaults to 12. Defines how many hosts can be resolved in parallel when adding new hosts, refreshing their states or when validating hosts. The higher the value the higher load will be generated when resolving hosts, refreshing host states, copying an installation script to remote hosts or validating hosts.
-resolve_timeout=<sec> Defaults to 20 seconds. A timeout value for any operation in a resolve_pool (resolving hosts, refreshing host states, copying an installation script to remote host). Host validation has a timeout which is always equal to 2*resolve_timeout value. Increase the default value if you see hosts with Unreachable state and you are sure that password-less access is working correctly for the connect_user.
-install_pool=<num> Defaults to 8. Defines how many execution daemons can be installed in parallel. The higher the value the higher load will be generated when performing installation tasks.
-install_timeout=<sec> Defaults to 120 seconds. A timeout value for any installation task. Increase the default value if you see that the installation tasks are failing with a Timeout state.
-connect_user=<user> Defaults to current user. User name that will be used when connecting to remote hosts.
-connect_mode=windows When set, each connect_user is prefixed by a host domain (see examples below). This is useful when installing multiple windowd execution hosts that require a different connect_user.
-debug Starts the installer in a debug mode. Prints a lot of output to the terminal. Intended for developer purposes, but may provide additional information when unexpected circumstances occur.

Using start_gui_installer Options

Installing as a different connect_user
Suppose that you cannot log in as the root user, but can log in as another privileged user with uid=0, called admin. In this case an attempt for a remote connection would be done as current user, but due to uid=0 we would connect as root if root is the primary user with uid=0 on the remote host. Users admin and root would have different home directories and we assume that the password-less access was setup only for the user admin, so the connection without a password as currect user would fail. Invoking the following command will enforce that every remote connection is established as the admin user.

master:/sge62u2 # ./start_gui_installer -connect_user=admin


Installing single Windows execution host
Suppose you want to use the installer to add a single windows execution hosts to the existing cluster. The host is called win-01 and belongs to the WIN-01 domain. Also, the privileged user in this case is Admin (part of Administrators group).

The windows hosts can only be installed remotely from a UNIX/LINUX system and you cannot become an Admin user there. So you might use -connect_user=WIN01+Admin to connect as the correct user directly.

master:/sge62u2 # ./start_gui_installer -connect_user=WIN01+Admin


Installing multiple Windows execution hosts
Suppose you have additional hosts win-02 belonging to the WIN-02 domain and win_vista-01 belonging to the WIN_VISTA-01 domain. All hosts have Administrator user privileges. In this case, you can use the following command to start the GUI installer that will allow you to install all the three Windows execution hosts simultaneously.

master:/sge62u2 # ./start_gui_installer -connect_user=Administrator -connect_mode=windows

Every remote connection to host win-01 would be done as WIN-01+Administrator user.
Every remote connection to host win-02 would be done as WIN-02+Administrator user.
Every remote connection to host win_vista-01 would be done as WIN_VISTA-01+Administrator user.


Troubleshooting the GUI Installer

You will find the known issues and their workarounds in this section as well as additional answers to some frequently asked questions.

FAQs

  1. I cannot start the installer. It throws an exception!
    Most likely a general problem with any GUI application in your current environment. You are probably starting the installer on a remote host and either did not export the DISPLAY variable properly or did not allow displaying remote GUI applications on the target system (where the GUI should pop-up).
    1. Display variable is not set.
      If your DISPLAY variable is not set and you are not locally on the system you will see a similar message:
      hostA# ./start_gui_installer 
      Starting Installer ...
      java.awt.HeadlessException: 
      No X11 DISPLAY variable was set, but this program performed an operation which requires it.
              at java.awt.GraphicsEnvironment.checkHeadless(GraphicsEnvironment.java:159)
              at java.awt.Window.<init>(Window.java:317)
              at java.awt.Frame.<init>(Frame.java:419)
              at java.awt.Frame.<init>(Frame.java:384)
              at javax.swing.JFrame.<init>(JFrame.java:150)
              at com.izforge.izpack.installer.GUIInstaller.loadLangPack(Unknown Source)
              at com.izforge.izpack.installer.GUIInstaller.access$000(Unknown Source)
              at com.izforge.izpack.installer.GUIInstaller$1.run(Unknown Source)
              at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:199)
              at java.awt.EventQueue.dispatchEvent(EventQueue.java:461)
              at java.awt.EventDispatchThread.pumpOneEventForHierarchy(EventDispatchThread.java:242)
              at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:163)
              at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:157)
              at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:149)
              at java.awt.EventDispatchThread.run(EventDispatchThread.java:110)
      java.lang.NullPointerException
              at com.izforge.izpack.installer.GUIInstaller.loadGUI(Unknown Source)
              at com.izforge.izpack.installer.GUIInstaller.access$100(Unknown Source)
              at com.izforge.izpack.installer.GUIInstaller$2.run(Unknown Source)
              at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:209)
              at java.awt.EventQueue.dispatchEvent(EventQueue.java:461)
              at java.awt.EventDispatchThread.pumpOneEventForHierarchy(EventDispatchThread.java:242)
              at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:163)
              at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:157)
              at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:149)
              at java.awt.EventDispatchThread.run(EventDispatchThread.java:110)
      

      If you start the installer on hostA, but want to display it on hostB, you need to set a proper DISPLAY variable. If hostB has your graphical session on port 22, type the following command as user that will start the installer:

      hostA# DISPLAY=hostB:22 ; export DISPLAY
      

      See next step to finish the setup.

    2. Remote host does not allow remote GUI applications.
      In this case you will see a similar message:
      hostA# ./start_gui_installer 
      Starting Installer ...
      Xlib: connection to "hostB:22" refused by server
      Xlib: No protocol specified
      
      Exception in thread "main" java.lang.InternalError: Can't connect to X11 window server using 'hostB:22' as the value of the DISPLAY variable.
              at sun.awt.X11GraphicsEnvironment.initDisplay(Native Method)
              at sun.awt.X11GraphicsEnvironment.access$000(X11GraphicsEnvironment.java:53)
              at sun.awt.X11GraphicsEnvironment$1.run(X11GraphicsEnvironment.java:142)
              at java.security.AccessController.doPrivileged(Native Method)
              at sun.awt.X11GraphicsEnvironment.<clinit>(X11GraphicsEnvironment.java:131)
              at java.lang.Class.forName0(Native Method)
              at java.lang.Class.forName(Class.java:164)
              at java.awt.GraphicsEnvironment.getLocalGraphicsEnvironment(GraphicsEnvironment.java:68)
              at sun.awt.motif.MToolkit.<clinit>(MToolkit.java:93)
              at java.lang.Class.forName0(Native Method)
              at java.lang.Class.forName(Class.java:164)
              at java.awt.Toolkit$2.run(Toolkit.java:821)
              at java.security.AccessController.doPrivileged(Native Method)
              at java.awt.Toolkit.getDefaultToolkit(Toolkit.java:804)
              at javax.swing.UIManager.initialize(UIManager.java:1262)
              at javax.swing.UIManager.maybeInitialize(UIManager.java:1245)
              at javax.swing.UIManager.getDefaults(UIManager.java:556)
              at javax.swing.UIManager.put(UIManager.java:841)
              at com.izforge.izpack.installer.GUIInstaller.loadLookAndFeel(Unknown Source)
              at com.izforge.izpack.installer.GUIInstaller.<init>(Unknown Source)
              at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
              at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
              at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
              at java.lang.reflect.Constructor.newInstance(Constructor.java:494)
              at java.lang.Class.newInstance0(Class.java:350)
              at java.lang.Class.newInstance(Class.java:303)
              at com.izforge.izpack.installer.Installer.main(Unknown Source)
      

      You have to explicitly allow remote GUI connections from hostA. Type the following command as the user running the graphical session on hostB:

      hostB# xhost +hostA
      

      Now you may start the start_gui_installer and the Welcome screen should get displayed on the remote host.

  2. How can I remove a host from the host selection that I previously added?
    Right_click the host and select Remove selected action from the pop-up menu.

  3. Can I save hosts that I selected in the host selection to a file?
    Yes, you can. Select multiple hosts using CTRL + left-click and do a right-click. A pop-up menu appears allowing you to save all hosts in the current tab or just the selected hosts.

Known issues and workarounds

  1. Installing BDB server always fails with a timeout state.
    Unfortunately you can't use the SGE 6.2u2 GUI installer to install a BDB server to any other platform, but Solaris OS. You may use the CLI installation (inst_sge -db) to do the job locally. You may then use the GUI installer to install qmaster and any number of shadow and executions hosts if the password-less access is configured. See issue 2941 for more information.
  2. Installing BDB server fails. Log messages suggest that you start the installation on a local host.
    BDB server installation is not possible when ignore hostname resolving is unchecked in the custom installation mode. See issue 2944 for more information.
  3. Install button is disabled, but I already have some reachable hosts.
    This problem might happen occasionally. Right-click any single host in the host selection table and select refresh selected action. The progress bar should appear and enable the Install button. See issue 2945 for more information.

Installing the Software From the Command Line

Note
  • The instructions in this section assume that you are installing the software on a computer running the Solaris TM Operating System. Any difference in functionality created by other operating system architecture that the Grid Engine software runs on is documented in files starting with the string arc_depend in the $SGE_ROOT/doc directory. The remainder of the file name indicates the operating system architectures to which the comments in the files apply, as in the arc_depend_irix.asc file.
  • Also note that there are several prerequisites that you must satisfy for Windows systems before you can install Grid Engine. See Microsoft Services For UNIX and Microsoft Subsystem for UNIX-based Applications for further details.
  • This section does not cover the upgrade process or the installation of the Accounting and Reporting Module, ARCo. For information about upgrading, see Upgrading From a Previous Release of the Software. For information about installing ARCo, see Installing the Accounting and Reporting Console.

Installation Overview

Note
The instructions in this section are for a new Grid Engine system only. For instructions on how to install a new system with additional security protection, see Installing the Increased Security Features. For instructions on how to upgrade an existing installation of an earlier version of the Grid Engine software, see Upgrading From a Previous Release of the Software.

Full installation includes the following tasks:

  • Running an installation script on the master host and on every execution host in the Grid Engine system
  • Registering information about administration hosts and submit hosts

Performing an Installation

The following sections describe how to install all the components of the Grid Engine system, including the master, execution, administration, and submit hosts. If you need to install the system with enhanced security, see Installing the Increased Security Features before you continue installation. For more information about installing Grid Engine SMF services see Installing the SMF Services before you start the installation.

Topic Description
How to Install the Master Host (Example Master Host Installation) Procedure for installing the master host.
How to Install Execution Hosts (Example Execution Host Installation) Procedure for installing the execution host.
How to Register Administration Hosts Procedure for registering an administration host.
How to Register Submit Hosts Procedure for registering a submit host.
How to Install the Berkeley DB Spooling Server (Example Berkeley DB Spooling Server Installation) Procedure for installing the necessary software for Berkeley DB spooling.

How to Install the Master Host

The master host installation procedure creates the appropriate directory hierarchy that the master daemon requires and starts the Grid Engine master daemon sge_qmaster on the master host. The master host is also registered as a host with administrative and submit permission. The installation procedure creates a default configuration for the system on which it is run. The installation script queries the system for the type of operating system. The script then makes meaningful settings based on this information.

If, at any time during the installation, you think something went wrong, you can quit the installation procedure and restart it.

Before You Begin
Steps
  1. Log in to the master host as root.

  2. If the $SGE_ROOT environment variable is not set, set it by typing:
    # SGE_ROOT=<path_to_installation_directory (the directory MUST contain all SGE files such as SGE binaries)>; export SGE_ROOT
    

    To confirm that you have set the $SGE_ROOT environment variable, type:

    # echo $SGE_ROOT
    
  3. Go to the installation directory.
    • If the directory where the installation files reside is visible from the master host, change directory (cd) to the installation directory sge-root, and then proceed to the next step.
    • If the directory is not visible and cannot be made visible, do the following:
      • Create a local installation directory, sge-root, on the master host.
      • Copy the installation files to the local installation directory sge-root across the network, for example, by using ftp or rcp.
      • Change directory (cd) to the local sge-root directory.

  4. Type the install_qmaster command, adding the -csp flag if you are installing using the Certificate Security Protocol method described in Installing the Increased Security Features.
    This command starts the master host installation procedure. You are asked several questions, and you might be required to run some administrative actions.
    For a complete installation example, see Example Master Host Installation.
    # ./install_qmaster
    Welcome to the Grid Engine installation
    ---------------------------------------
    
    Grid Engine qmaster host installation
    -------------------------------------
    
    .
    .
    .
    
    The qmaster installation procedure will take approximately 5-10 minutes.
    
    Hit <RETURN> to continue >> 
    
  5. Choose an administrative account owner.
    See Step 5 in the Example Master Host Installation.

  6. Verify the $SGE_ROOT directory setting.
    In the example shown Step 6 of the Example Master Host Installation, the value of $SGE_ROOT is /opt/sge62.

  7. Set up the TCP/IP services for the Grid Engine software.
    See Step 7 in the Example Master Host Installation.
    If TCP/IP services have not been configured, you will be notified. To configure TCP/IP services:
    1. Start a new terminal session or window to add the information /etc/services file or your NIS maps.
    2. Add the correct ports to the /etc/services file or your NIS services map, as described in Network Services.
      The following example adds entries for both sge_qmaster and sge_execd to your /etc/services file.
      ...
      sge_qmaster     6444/tcp
      sge_execd       6445/tcp
      
    3. Save your changes and return to the window where the installation script is running.

  8. Type the name of your cell or accept the default cell name.
    See Step 8 in the Example Master Host Installation.
    The use of Grid Engine system cells is described in Cells.
    • If you have decided to use cells, type the cell name now.
    • If you have decided not to use cells, press the Return key.

  9. Set up a unique cluster name.
    See Step 9 in the Example Master Host Installation.
    For more information, see Cluster Name.
    • To accept the default cluster name, press the Return key.
    • To enter a new cluster name, type the cluster name and press the Return key.

  10. Specify a spool directory.
    See Step 10 in the Example Master Host Installation.
    For guidelines on disk space requirements for the spool directory, see Disk Space Requirements. For information on where spool directory is installed, see Spool Directories Under the Root Directory.
    • To accept the default spool directory, press the Return key.
    • If you want to use a different spool directory, then answer y to the prompt and provide a complete path name to the directory.

  11. Specify whether you plan to use Windows-based execution hosts.
    See Step 11 in the Example Master Host Installation.
    • If you do not plan to use Windows support, answer No.
    • If you want Windows support, answer Yes. You will be asked some Windows-specific questions later in the installation process. These questions will be marked as WINDOWS-ONLY.

  12. Verify or set the correct file permissions.
    See Step 12 in the Example Master Host Installation.
    • If you used pkgadd or you know that the file permissions are correct, answer y to accept the current permissions.
    • Answer n if you need to verify or change the file permissions.
    • WINDOWS ONLY – If you specified that you wanted Windows Execution Host support in the previous question, you should let the script set the file permissions for you.

  13. Specify whether all Grid Engine hosts for this cluster are located in a single DNS domain.
    See Step 13 in the Example Master Host Installation.
    • If all of your Grid Engine system hosts are located in a single DNS domain, then answer y. Grid Engine will not care if domain information is supplied. hostA and hostA.foo.com are equivalent.
    • If all of your Grid Engine system hosts are not located in a single DNS domain, then answer n. You will be asked to configure a default domain to use in case a host is specified without domain information.

  14. Watch while Grid Engine creates directories according to the information that you provided so far.
    See Step 14 in the Example Master Host Installation.

  15. Specify whether you want to use classic spooling or Berkeley DB.
    See Step 15 in the Example Master Host Installation. By default, Grid Engine uses Berkeley Database spooling.
    For more information on how to determine the type of spooling mechanism you want, please see Choosing Between Classic Spooling and Database Spooling.
    • If you choose Berkeley DB spooling, you are asked to choose whether to use a local directory or a Berkeley DB Spooling Server.
      Tip
      To use a shadow master host for increased availability of the database, use the Berkeley DB Spooling Server.
      • To use a Berkeley DB spooling server, enter y. To install the Berkeley DB Spooling Server:
        1. Start a new terminal session or window and install the software, as described in How to Register Submit Hosts.
        2. After you have installed the software on the spooling server, return to the master installation window, and press the Return key.
        3. Type the name of the spooling server.
          In Step 15 of the Example Master Host Installation, vector is the host name of the spooling server.
        4. Type the name of the spooling directory.
          In Step 15 of the Example Master Host Installation, /opt/sge62/default/spool/spooldb is the spooling directory.
      • If you do not want to use a Berkeley DB spooling server, type n. You are asked to provide the complete path to the database directory. If the directory does not exist, it is created.
    • To specify classic spooling, type classic.

  16. Type a range of IDs that will be assigned dynamically for jobs.
    See Step 16 in the Example Master Host Installation.
    For more information, see Group IDs.

  17. Verify the spooling directory for the execution daemon.
    See Step 17 in the Example Master Host Installation.
    The Grid Engine administrator must have access to create and write into this directory. For information on spooling, see Spool Directories Under the Root Directory.

  18. Type the email address of the user who should receive problem reports.
    See Step 18 in the Example Master Host Installation. In the example, the user who will receive problem reports is me@my.domain.

  19. Verify the configuration parameters.
    See Step 19 in the Example Master Host Installation.
    • If configuration parameters are correct, Grid Engine proceeds to create the local configuration.
    • If configuration parameters are not correct, type y to change them.

  20. Specify whether you want the daemons to start when the system is booted.
    See Step 20 in the Example Master Host Installation.

  21. WINDOWS-ONLY – If you specified that you want Windows support, you are asked to create Certificate Security Protocol (CSP) certificates.
    Even if the system is not running in CSP mode, it is necessary to create certain CSP certificates for Windows support. These certificates are automatically generated during the master host installation. For instructions on how to transfer these certificates to the Windows execution hosts, see Step 6 of How to Install a CSP-Secured System.

  22. WINDOWS-ONLY – Add the Windows Administrator name to the Grid Engine manager list.

  23. Identify the hosts that you will later install as execution hosts.
    See Step 23 in the Example Master Host Installation.
    Tip
    You can list hosts individually, separated by a blank space, or you can supply a file that contains host names.
    Note
    You can use the master host for executing jobs. To do so, you must carry out the execution host installation for the master machine. However, if you use a very slow machine as master host, or if your cluster is significantly large, do not use the master host as an execution host.
  24. Select a scheduler profile.
    See Step 24 in the Example Master Host Installation.
    For information on how to determine which profile you should use, see Scheduler Profiles.
    Once you answer this question, the installation process is complete. Several screens of information will be displayed before the script exits.

  25. WINDOWS-ONLY – Copy the certificate files to the Windows execution hosts.
    You can use a script to perform this function.
    Tip
    To use this functionality without being asked for a password, the root user should use rsh or ssh to access the execution hosts.
  26. Create the environment variables ($SGE_ROOT and $SGE_CELL) for use with the Grid Engine software.
    See Step 26 in the Example Master Host Installation.
    Note
    If no cell name was specified during installation, the value of cell is default.
    • If you are using a C shell, type the following command:
      % source $SGE_ROOT/$SGE_CELL/common/settings.csh
      
    • If you are using a Bourne shell or Korn shell, type the following command:
      $ . $SGE_ROOT/$SGE_CELL/common/settings.sh
      
See Also
For details about how you can verify that the execution host has been set up correctly, see [How to Verify That the Daemons Are Running on the Master Host].

Example Master Host Installation

The following example shows a complete Sun Grid Engine master host installation. Remember that this is only one step in the entire Sun Grid Engine installation process. The steps in this example coordinate with the master host installation description at How to Install the Master Host.


Steps 1-4

001   % su -
002   # cd sge-install-dir
003   # ./install_qmaster
004   Grid Engine License is displayed. 
005
006   Do you agree with that license? (y/n) [n] >>
007
008   Welcome to the Grid Engine installation
009   ---------------------------------------
010   
011   Grid Engine qmaster host installation
012   -------------------------------------
013   
014   Before you continue with the installation please read these hints:
015   
016      - Your terminal window should have a size of at least
017        80x24 characters
018   
019      - The INTR character is often bound to the key Ctrl-C.
020        The term >Ctrl-C< is used during the installation if you
021        have the possibility to abort the installation
022   
023   The qmaster installation procedure will take approximately 5-10 minutes.
024   
025   Hit <RETURN> to continue >> 
026   

Step 5

027   Grid Engine admin user account
028   ------------------------------
029
030   The current directory
031
032   /opt/sge62
033
034   is owned by user
035
036   myusername
037
038   If user >root< does not have write permissions in this directory on *all*
039   of the machines where Grid Engine will be installed (NFS partitions not
040   exported for user >root< with read/write permissions) it is recommended to
041   install Grid Engine that all spool files will be created under the user id
042   of user >myusername<.
043
044   IMPORTANT NOTE: The daemons still have to be started by user >root<.
045
046   Do you want to install Grid Engine as admin user >myusername< (y/n) [y] >> 
047
048   Installing Grid Engine as admin user >myusername<
049
050   Hit <RETURN> to continue >> 
051   Choosing Grid Engine admin user account
052   ---------------------------------------
053   
054   You may install Grid Engine that all files are created with the user id of an
055   unprivileged user.
056   
057   This will make it possible to install and run Grid Engine in directories
058   where user >root< has no permissions to create and write files and directories.
059   
060      - Grid Engine still has to be started by user >root<
061   
062      - This directory should be owned by the Grid Engine administrator
063   
064   Do you want to install Grid Engine
065   under an user id other than >root< (y/n) [y] >> y
066   
067   Choosing a Grid Engine admin user name
068   --------------------------------------
069   
070   Please enter a valid user name >> sgeadmin
071   
072   Installing Grid Engine as admin user >sgeadmin<
073   
074   Hit <RETURN> to continue >>
075

Step 6

  
076   Checking $SGE_ROOT directory
077   ----------------------------
078   
079   The Grid Engine root directory is:
080   
081      $SGE_ROOT = /opt/sge62
082   
083   If this directory is not correct (e.g. it may contain an automounter
084   prefix) enter the correct path to this directory or hit <RETURN>
085   to use default [/opt/sge62] >> 
086   
087   Your $SGE_ROOT directory: /opt/sge62
088   
089   Hit <RETURN> to continue >> 
090  

Step 7
Two actions – one for qmaster, one for execd

091   Grid Engine TCP/IP communication service
092   ----------------------------------------
093
094   The port for sge_qmaster is currently set by the shell environment.
095
096      SGE_QMASTER_PORT = 10500
097
098   Now you have the possibility to set/change the communication ports by using the
099   >shell environment< or you may configure it via a network service, configured
100   in local >/etc/services<, >NIS< or >NIS+<, adding an entry in the form
101
102      sge_qmaster <port_number>/tcp
103
104   to your services database and make sure to use an unused port number.
105
106   How do you want to configure the Grid Engine communication ports?
107
108    Using the >shell environment<:                           [1]
109
110    Using a network service like >/etc/services<, >NIS/NIS+<: [2]
111
112   (default: 1) >> 1 
113
114    Using the environment variable
115
116      $SGE_QMASTER_PORT=10500
117
118   as port for communication.
119
120   Hit <RETURN> to continue >> 
121
122   Grid Engine TCP/IP communication service
123   ----------------------------------------
124
125   The port for sge_execd is currently set by the shell environment.
126
127      SGE_EXECD_PORT = 10501
128
129   Now you have the possibility to set/change the communication ports by using the
130   >shell environment< or you may configure it via a network service, configured
131   in local >/etc/services<, >NIS< or >NIS+<, adding an entry in the form
132
133      sge_execd <port_number>/tcp
134
135   to your services database and make sure to use an unused port number.
136
137   How do you want to configure the Grid Engine communication ports?
138
139   Using the >shell environment<:                           [1]
140
141   Using a network service like >/etc/services<, >NIS/NIS+<: [2]
142
143   (default: 1) >> 1
144
145   Using the environment variable
146
147      $SGE_EXECD_PORT=10501
148
149   as port for communication.
150
151   Hit <RETURN> to continue >> 

Step 8

 
152   Grid Engine cells
153   -----------------
154   
155   Grid Engine supports multiple cells.
156  
157   If you are not planning to run multiple Grid Engine clusters or if you don't
158   know yet what is a Grid Engine cell it is safe to keep the default cell name
159   
160      default
161   
162   If you want to install multiple cells you can enter a cell name now.
163
164   The environment variable
165   
166      $SGE_CELL=<your_cell_name>
167   
168   will be set for all further Grid Engine commands.
169   
170   Enter cell name [default] >> 
171   
172   Using cell >default<. 
173   Hit <RETURN> to continue >> 
174  

Step 9

175   Unique cluster name
176   -------------------
177
178   The cluster name uniquely identifies a specific Sun Grid Engine cluster.
179   The cluster name must be unique throughout your organization. The name 
180   is not related to the SGE cell.
181 
182   The cluster name must start with a letter ([A-Za-z]), followed by letters, 
183   digits ([0-9]), dashes (-) or underscores (_).
184
185   Enter new cluster name or hit <RETURN>
186   to use default [p10500] >> 
187
188   Your $SGE_CLUSTER_NAME: p10500
189
190   Hit <RETURN> to continue >> 

Step 10

 
191   Grid Engine qmaster spool directory
192   -----------------------------------
193   
194   The qmaster spool directory is the place where the qmaster daemon stores
195   the configuration and the state of the queuing system.
196   
197   The admin user >myusername< must have read/write access
198   to the qmaster spool directory.
199   
200   If you will install shadow master hosts or if you want to be able to start
201   the qmaster daemon on other hosts (see the corresponding section in the
202   Grid Engine Installation and Administration Manual for details) the account
203   on the shadow master hosts also needs read/write access to this directory.
204   
205   The following directory
206   
207   [/opt/sge62/default/spool/qmaster]
208   
209   will be used as qmaster spool directory by default!
210   
211   Do you want to select another qmaster spool directory (y/n) [n] >> 
212  

Step 11

 
213   Windows Execution Host Support
214   ------------------------------
215                                                                                   
216   Are you going to install Windows Execution Hosts? (y/n) [n]
217  

Step 12

 
218   Verifying and setting file permissions
219   --------------------------------------
220   
221   Did you install this version with >pkgadd< or did you already
222   verify and set the file permissions of your distribution (y/n) [y] >> 
223
224   Verifying and setting file permissions
225   --------------------------------------
226
227   We may now verify and set the file permissions of your Grid Engine
228   distribution.
229
230   This may be useful since due to unpacking and copying of your distribution
231   your files may be unaccessible to other users.
232
233   We will set the permissions of directories and binaries to
234
235      755 - that means executable are accessible for the world
236
237   and for ordinary files to
238
239      644 - that means readable for the world
240
241   Do you want to verify and set your file permissions (y/n) [y] >>
242
243   Verifying and setting file permissions and owner in >3rd_party<
244   Verifying and setting file permissions and owner in >bin<
245   Verifying and setting file permissions and owner in >ckpt<
246   Verifying and setting file permissions and owner in >examples<
247   Verifying and setting file permissions and owner in >inst_sge<
248   Verifying and setting file permissions and owner in >install_execd<
249   Verifying and setting file permissions and owner in >install_qmaster<
250   Verifying and setting file permissions and owner in >lib<
251   Verifying and setting file permissions and owner in >mpi<
252   Verifying and setting file permissions and owner in >pvm<
253   Verifying and setting file permissions and owner in >qmon<
254   Verifying and setting file permissions and owner in >util<
255   Verifying and setting file permissions and owner in >utilbin<
256   Verifying and setting file permissions and owner in >catman<
257   Verifying and setting file permissions and owner in >doc<
258   Verifying and setting file permissions and owner in >include<
259   Verifying and setting file permissions and owner in >man<
260
261   Your file permissions were set
262
263   Hit <RETURN> to continue >>
264   

Step 13

265   Select default Grid Engine hostname resolving method
266   ----------------------------------------------------
267   
268   Are all hosts of your cluster in one DNS domain? If this is
269   the case the hostnames
270   
271      >hostA< and >hostA.foo.com<
272   
273   would be treated as equal, because the DNS domain name >foo.com<
274   is ignored when comparing hostnames.
275   
276   Are all hosts of your cluster in a single DNS domain (y/n) [y] >>   
277   
278   Ignoring domainname when comparing hostnames.
279   
280   Hit <RETURN> to continue >> 
281   

Step 14

 
282   Making directories
283   ------------------
284   
285   creating directory: /opt/sge62/default/spool/qmaster
286   creating directory: /opt/sge62/default/spool/qmaster/job_scripts
287   Hit <RETURN> to continue >> 
288  

Step 15

 
289   Setup spooling
290   --------------
291   Your SGE binaries are compiled to link the spooling libraries
292   during runtime (dynamically). So you can choose between Berkeley DB 
293   spooling and Classic spooling method. 
294   Please choose a spooling method (berkeleydb|classic) [berkeleydb] >> 
295   
296   The Berkeley DB spooling method provides two configurations!
297   
298   1) Local spooling:
299   The Berkeley DB spools into a local directory on this host (qmaster host)
300   This setup is faster, but you can't setup a shadow master host
301   
302   2) Berkeley DB Spooling Server:
303   If you want to setup a shadow master host, you need to use
304   Berkeley DB Spooling Server!
305   In this case you have to choose a host with a configured RPC service.
306   The qmaster host connects via RPC to the Berkeley DB. This setup is more
307   failsafe, but results in a clear potential security hole. RPC communication
308   (as used by Berkeley DB) can be easily compromised. Please only use this
309   alternative if your site is secure or if you are not concerned about
310   security. Check the installation guide for further advice on how to achieve
311   failsafety without compromising security.
312   
313   Do you want to use a Berkeley DB Spooling Server? (y/n) [n] >> y
314   
315   Berkeley DB Setup
316   
317   -----------------
318   Please, log in to your Berkeley DB spooling host and execute "inst_sge -db"
319   Please do not continue, before the Berkeley DB installation with
320   "inst_sge -db" is completed, continue with <RETURN>
321   
322   Berkeley Database spooling parameters
323   -------------------------------------
324   
325   Please enter the name of your Berkeley DB Spooling Server! >> vector
326   
327
328   Do you want to use a Berkeley DB Spooling Server? (y/n) [n] >> 
329
330   Hit <RETURN> to continue >> 
331
332   Berkeley Database spooling parameters
333   -------------------------------------
334
335   Please enter the Database Directory now, even if you want to spool locally,
336   it is necessary to enter this Database Directory. 
337
338   Default: [/opt/sge62/default/spool/spooldb] >> /tmp/dom/spooldb
339
340   Dumping bootstrapping information
341   Initializing spooling database
342
343   Hit <RETURN> to continue >> 

Step 16

 
344   Grid Engine group id range
345   --------------------------
346   
347   When jobs are started under the control of Grid Engine an additional group id
348   is set on platforms which do not support jobs. This is done to provide maximum
349   control for Grid Engine jobs.
350   
351   This additional UNIX group id range must be unused group id's in your system.
352   Each job will be assigned a unique id during the time it is running.
353   Therefore you need to provide a range of id's which will be assigned
354   dynamically for jobs.
355   
356   The range must be big enough to provide enough numbers for the maximum number
357   of Grid Engine jobs running at a single moment on a single host. E.g. a range
358   like >20000-20100< means, that Grid Engine will use the group ids from
359   20000-20100 and provides a range for 100 Grid Engine jobs at the same time
360   on a single host.
361   
362   You can change at any time the group id range in your cluster configuration.
363   
364   Please enter a range >> 20000-20100
365   
366   Using >20000-20100< as gid range. Hit <RETURN> to continue >> 
367  

Step 17

 
368   Grid Engine cluster configuration
369   ---------------------------------
370   
371   Please give the basic configuration parameters of your Grid Engine
372   installation:
373   
374      <execd_spool_dir>
375   
376   The pathname of the spool directory of the execution hosts. User >myusername<
377   must have the right to create this directory and to write into it.
378   
379   Default: [/opt/sge62/default/spool] >>  
380  

Step 18

 
381   Grid Engine cluster configuration (continued)
382   ---------------------------------------------
383   <administator_mail>
384   
385   The email address of the administrator to whom problem reports are sent.
386   
387   It is recommended to configure this parameter. You may use >none<
388   if you do not wish to receive administrator mail.
389   
390   Please enter an email address in the form >user@foo.com<.
391   
392   Default: [none] >> me@my.domain
393  

Step 19

 
394   The following parameters for the cluster configuration were configured:
395   
396      execd_spool_dir        /opt/sge62/default/spool
397      administrator_mail     me@my.domain
398   
399   Do you want to change the configuration parameters (y/n) [n] >> n
400   
401   Creating local configuration
402   ----------------------------
403   Creating >act_qmaster< file
404   Adding default complex attributes
405   Adding SGE default usersets
406   Adding >sge_aliases< path aliases file
407   Adding >qtask< qtcsh sample default request file
408   Adding >sge_request< default submit options file
409   Creating >sgemaster< script
410   Creating >sgeexecd< script
411   Creating settings files for >.profile/.cshrc<
412   
413   Hit <RETURN> to continue >> 
414  

Step 20

415   qmaster startup script
416   ----------------------
417
418   Do you want to start qmaster automatically at machine boot?
419   NOTE: If you select "n" SMF will be not used at all! (y/n) [y] >> 
420
421
422   Hit <RETURN> to continue >> 
423 
424   Grid Engine qmaster startup
425   ---------------------------
426
427   Starting qmaster daemon. Please wait ...
428   Hit <RETURN> to continue >> 

Step 23

 
429   Adding Grid Engine hosts
430   ------------------------
431   
432   Please now add the list of hosts, where you will later install your execution
433   daemons. These hosts will be also added as valid submit hosts.
434   
435   Please enter a blank separated list of your execution hosts. You may
436   press <RETURN> if the line is getting too long. Once you are finished
437   simply press <RETURN> without entering a name.
438
439   You also may prepare a file with the hostnames of the machines where you plan
440   to install Grid Engine. This may be convenient if you are installing Grid
441   Engine on many hosts.
442   
443   Do you want to use a file which contains the list of hosts (y/n) [n] >> n
444   
445   Adding admin and submit hosts
446   -----------------------------
447   
448   Please enter a blank seperated list of hosts.
449   
450   Stop by entering <RETURN>. You may repeat this step until you are
451   entering an empty list. You will see messages from Grid Engine
452   when the hosts are added.
453   
454   Host(s): host1 host2 host3 host4
455   
456   host1 added to administrative host list
457   host1 added to submit host list
458   host2 added to administrative host list
459   host2 added to submit host list
460   host3 added to administrative host list
461   host3 added to submit host list
462   host4 added to administrative host list
463   host4 added to submit host list
464   Hit <RETURN> to continue >> 
465
466   Adding admin and submit hosts
467   -----------------------------
468
469   Please enter a blank seperated list of hosts.
470
471   Stop by entering <RETURN>. You may repeat this step until you are
472   entering an empty list. You will see messages from Grid Engine
473   when the hosts are added.
474
475   Host(s): 
476   Finished adding hosts. Hit <RETURN> to continue >> 
477
478   If you want to use a shadow host, it is recommended to add this host
479   to the list of administrative hosts.
480
481   If you are not sure, it is also possible to add or remove hosts after the
482   installation with <qconf -ah hostname> for adding and <qconf -dh hostname>
483   for removing this host
484
485   Attention: This is not the shadow host installation
486   procedure.
487   You still have to install the shadow host separately
488
489   Do you want to add your shadow host(s) now? (y/n) [y] >> 
490
491   Adding Grid Engine shadow hosts
492   -------------------------------
493
494   Please now add the list of hosts, where you will later install your shadow
495   daemon.
496
497   Please enter a blank separated list of your execution hosts. You may
498   press <RETURN> if the line is getting too long. Once you are finished
499   simply press <RETURN> without entering a name.
500
501   You also may prepare a file with the hostnames of the machines where you plan
502   to install Grid Engine. This may be convenient if you are installing Grid
503   Engine on many hosts.
504 
505   Do you want to use a file which contains the list of hosts (y/n) [n] >> 
506
507   Adding admin hosts
508   ------------------
509
510   Please enter a blank seperated list of hosts.
511
512   Stop by entering <RETURN>. You may repeat this step until you are
513   entering an empty list. You will see messages from Grid Engine
514   when the hosts are added.
515
516   Host(s): es-ergb01-01
517   adminhost "es-ergb01-01" already exists
518   Hit <RETURN> to continue >> 
519
520   Please enter a blank seperated list of hosts.
521
522   Stop by entering <RETURN>. You may repeat this step until you are
523   entering an empty list. You will see messages from Grid Engine
524   when the hosts are added.
525
526   Host(s): 
527   Finished adding hosts. Hit <RETURN> to continue >> 
528  
529   Creating the default <all.q> queue and <allhosts> hostgroup
530   -----------------------------------------------------------
531   
532   root@myhost added "@allhosts" to host group list
533   root@myhost added "all.q" to cluster queue list
534   
535   Hit <RETURN> to continue >> 
536  
537   No CSP system installed!
538   No CSP system installed!

Step 24

 
539   Scheduler Tuning
540   ----------------
541   The details on the different options are described in the manual. 
542
543   Configurations
544    --------------
545   1) Normal
546       Fixed interval scheduling, report scheduling information,
547       actual + assumed load
548
549   2) High
550       Fixed interval scheduling, report limited scheduling information,
551       actual load
552
553   3) Max
554       Immediate Scheduling, report no scheduling information,
555       actual load
556
557   Enter the number of your preferred configuration and hit <RETURN>! 
558   Default configuration is [1] >> 
559
560
561   We're configuring the scheduler with >Normal< settings!
562   Do you agree? (y/n) [y] >> 
563
564   changed scheduler configuration

Step 26

565   Using Grid Engine
566   -----------------
567
568   You should now enter the command:
569
570      source /scratch2/myusername/sge62/default/common/settings.csh
571
572   if you are a csh/tcsh user or
573
574      # . /scratch2/myusername/sge62/default/common/settings.sh
575
576   if you are a sh/ksh user.
577
578   This will set or expand the following environment variables:
579
580      - $SGE_ROOT         (always necessary)
581      - $SGE_CELL         (if you are using a cell other than >default<)
582      - $SGE_CLUSTER_NAME (always necessary)
583      - $SGE_QMASTER_PORT (if you haven't added the service >sge_qmaster<)
584      - $SGE_EXECD_PORT   (if you haven't added the service >sge_execd<)
585      - $PATH/$path       (to find the Grid Engine binaries)
586      - $MANPATH          (to access the manual pages)
587
588   Hit <RETURN> to see where Grid Engine logs messages >> 
589
590   Grid Engine messages
591   --------------------
592
593   Grid Engine messages can be found at:
594
595      Startup messages can be found in SMF service log files.
596      You can get the name of the log file by calling svcs -l <SERVICE_NAME> 
597      E.g.: svcs -l svc:/application/sge/qmaster:p10500
598
599   After startup the daemons log their messages in their spool directories.
600
601      Qmaster:     /scratch2/myusername/sge62/default/spool/qmaster/messages
602      Exec daemon: <execd_spool_dir>/<hostname>/messages
603
604
605   Grid Engine startup scripts
606   ---------------------------
607
608   Grid Engine startup scripts can be found at:
609
610      /scratch2/myusername/sge62/default/common/sgemaster (qmaster)
611      /scratch2/myusername/sge62/default/common/sgeexecd (execd)
612
613   Do you want to see previous screen about using Grid Engine again (y/n) [n] >> 
614
615   Your Grid Engine qmaster installation is now completed
616   ------------------------------------------------------
617
618   Please now login to all hosts where you want to run an execution daemon
619   and start the execution host installation procedure.
620
621   If you want to run an execution daemon on this host, please do not forget
622   to make the execution host installation in this host as well.
623
624   All execution hosts must be administrative hosts during the installation.
625   All hosts which you added to the list of administrative hosts during this
626   installation procedure can now be installed.
627
628   You may verify your administrative hosts with the command
629
630      # qconf -sh
631
632   and you may add new administrative hosts with the command
633
634      # qconf -ah <hostname>
635
636   Please hit <RETURN> >> 
637
638   sge_qmaster successfully installed!

How to Install Execution Hosts

The execution host installation procedure creates the appropriate directory hierarchy required by sge_execd, and starts the sge_execd daemon on the execution host. This section describes how to install execution hosts interactively from the command line. You can automate the installation of execution of multiple hosts by using the procedure described in Automating the Installation Process.

Before You Begin

Before installing an execution host, you first need to install the master server as described in How to Install the Master Host and share the common directory.

Windows-Only

You must satisfy several prerequisites before you can install Grid Engine execution hosts with Windows operating systems.

Steps
  1. Log in to the execution host as root.

  2. As you did for the master installation, either copy the installation files to a local installation directory sge-root or use a network installation directory.

  3. If the $SGE_ROOT environment variable is not set, set it by typing:
    # SGE_ROOT=<path_to_install/unpacked_directory>; export SGE_ROOT
    

    To confirm that you have set the $SGE_ROOT environment variable, type:

    # echo $SGE_ROOT
    


  4. Change directory (cd) to the installation directory, sge-root.

  5. Verify that the execution host has been declared on the administration host.
    • If you do not see the name of this execution host in the output of the qconf -sh command, you will need to declare it as an administration host.
      • Start a new terminal session or window.
      • In that window, log into the master host.
      • Declare the execution host as an administration host, using the qconf command.
        # qconf -ah quark
        quark added to administrative host list
        
      • Log back out of the master host, and continue with the installation of the execution host.

  6. Type the install_execd command, adding the -csp flag if you are installing using the Certificate Security Protocol method described in Installing the Increased Security Features.
    This command starts the execution host installation procedure.
    For a complete installation example, see Example Execution Host Installation.
    # ./install_execd
    Welcome to the Grid Engine execution host installation
    ------------------------------------------------------
    
    .
    .
    .
    
    The execution host installation will take approximately 5 minutes.
    
    Hit <RETURN> to continue >> 
    


  7. Verify the $SGE_ROOT directory setting.
    In the example shown in lines 27 through 41 of the Example Execution Host Installation, the value of $SGE_ROOT is /scratch2/myusername/sge62.

  8. Type the name of your cell or accept the default cell name.
    See lines 042 through 076 of the Example Execution Host Installation.
    The use of Grid Engine system cells is described in Cells.
    • If you have decided to use cells, then type the cell names now.
    • If you have decided not to use cells, then press the Return key.

  9. The install script checks to see what ports have been defined for the execution daemon.
    See lines 077 through 085 of the Example Execution Host Installation.
    If no ports have been defined, you will be asked to define them.

  10. The install script checks to see whether the admin user already exists.
    If the admin user already exists, the script continues uninterrupted. If the admin user does not exist, the script shows the following screen where you must supply a password for the admin user. After the admin user is created, press the Return key.
    Local Admin User
    ----------------
    
    The local admin user sgeadmin, does not exist!
    The script tries to create the admin user.
    Please enter a password for your admin user >>
    
    Creating admin user sgeadmin, now ...
    
    Admin user created, hit <ENTER> to continue!
    


  11. Verify the execution host has been declared as an administration host.
    See lines 086 through 092 of the Example Execution Host Installation.

  12. Specify whether you want to use a local spool directory.
    See lines 093 through 122 of the Example Execution Host Installation.
    For information on spooling, see Spool Directories Under the Root Directory.
    • If you do not want a local spool directory, answer n.
    • If you do want a local spool directory, answer y.
      In the example, /tmp/dom/execs is used as the local spool directory on domain.com. Choose any directory that meets the disk space requirements described in Disk Space Requirements.

  13. Specify whether you want execd to start automatically at boot time.
    See lines 123 through 131 of the Example Execution Host Installation.
    You might not want to install the startup script if you are installing a test cluster or you would rather start the daemon manually on reboot.

  14. WINDOWS ONLY – Choose whether to display the GUI for Windows jobs.
    See lines 132 through 163 of the Example Execution Host Installation.
    A Grid Engine Helper Service is included with the Sun Grid Engine distribution. This service enables Windows jobs to display a GUI on the visible desktop of the execution host. The visible desktop is either the desktop of the user currently logged in on the execution host or the desktop of the next user who will log in. It is not the log in screen.
    The Helper Service is a independent component loosely coupled with the execution daemon. The startup of the Helper Service is plugged in the Services dialog box in the Windows control panel. You can install only one Helper Service per host. There can be only one execution daemon installed per Helper Server.
    The installation script asks during the installation of a execution host whether you want to see the GUI of Windows jobs.

  15. Specify a queue for this host.
    See lines 164 through 183 of the Example Execution Host Installation.
    Once you answer this question, the installation process is complete. Several screens of information will be displayed before the script exits.

  16. Create the environment variables ($SGE_ROOT and $SGE_CELL) for use with the Grid Engine Software.
    See lines 184 through 234 of the Example Execution Host Installation.
    Note
    If no cell name was specified during installation, the value of cell is default.
    • If you are using a C shell, type the following command:
      % source $SGE_ROOT/$SGE_CELL/common/settings.csh
      
    • If you are using a Bourne shell or Korn shell, type the following
      command:
      $ . $SGE_ROOT/$SGE_CELL/common/settings.sh
      
See Also
For details about how you can verify that the execution host has been set up correctly, see How to Verify That the Daemons Are Running on the Execution Hosts.

Example Execution Host Installation

The following example shows a complete Sun Grid Engine execution host installation. Before you install the execution host, you need to first install the master server as described in How to Install the Master Host. The line numbers in this example are referred to from the execution host installation description at How to Install Execution Hosts.

Steps 1-6

001   % su -
002   # qstat -f
003   # ./install_execd
004
005   Welcome to the Grid Engine execution host installation
006   ------------------------------------------------------
007
008   If you haven't installed the Grid Engine qmaster host yet, you must execute
009   this step (with >install_qmaster<) prior the execution host installation.
010
011   For a sucessful installation you need a running Grid Engine qmaster. It is
012   also necessary that this host is an administrative host.
013
014   You can verify your current list of administrative hosts with
015   the command:
016
017      # qconf -sh
018
019   You can add an administrative host with the command:
020
021      # qconf -ah <hostname>
022
023   The execution host installation will take approximately 5 minutes.
024
025   Hit <RETURN> to continue >> 
026

Step 7

027   Checking $SGE_ROOT directory
028   ----------------------------
029
030   The Grid Engine root directory is:
031
032      $SGE_ROOT = /scratch2/myusername/sge62
033
034   If this directory is not correct (e.g. it may contain an automounter
035   prefix) enter the correct path to this directory or hit <RETURN>
036   to use default [/scratch2/myusername/sge62] >> 
037
038   Your $SGE_ROOT directory: /scratch2/myusername/sge62
039
040   Hit <RETURN> to continue >> 
041

Step 8

042   Grid Engine cells
043   -----------------
044
045   Please enter cell name which you used for the qmaster
046   installation or press <RETURN> to use [default] >> 
047
048   Using cell: >default<
049
050   Hit <RETURN> to continue >> 
051
052   ... set owner of /var/sgeCA/port10500 to bofur+myusername
053
054   ... copy /var/sgeCA/port10500/default/userkeys/root to 
055   /var/sgeCA/port10500/default/userkeys/bofur+Administrator
056   cp: /var/sgeCA/port10500/default/userkeys/root: No such file or directory
057
058   ... copy /var/sgeCA/port10500/default/userkeys/root to 
059   /var/sgeCA/port10500/default/userkeys/Administrator
060   cp: /var/sgeCA/port10500/default/userkeys/root: No such file or directory
061
062   ... copy /var/sgeCA/port10500/default/userkeys/myusername to 
063   /var/sgeCA/port10500/default/userkeys/bofur+myusername
064
065   ... set owner of /var/sgeCA/port10500/default/userkeys/Administrator to Administrator
066
067   ... set owner of /var/sgeCA/port10500/default/userkeys/bofur+Administrator to bofur+Administrator
068
069   ... set owner of /var/sgeCA/port10500/default/userkeys/myusername to myusername
070
071   ... set owner of /var/sgeCA/port10500/default/userkeys/bofur+myusername to bofur+myusername
072
073   ... remove old /var/sgeCA/port10500/default/userkeys/root certificates
074
075   WINDOWS certificates are copied and permissions are set!
076

Step 9

077   Grid Engine TCP/IP communication service
078   ----------------------------------------
079
080   The port for sge_execd is currently set BOTH as service and by the
081   shell environment
082
083      SGE_EXECD_PORT = 10501
084      sge_execd service set to port 725
085

Step 10
If the admin user already exists, the script automatically skips this step. See How to Install Execution Hosts for more information.

Step 11

086   Checking hostname resolving
087   ---------------------------
088
089   This hostname is known at qmaster as an administrative host.
090
091   Hit <RETURN> to continue >>
092

Step 12

093   Local execd spool directory configuration
094   -----------------------------------------
095
096   During the qmaster installation you've already entered a global
097   execd spool directory. This is used, if no local spool directory is configured.
098
099   Now you can configure a local spool directory for this host.
100   ATTENTION: The local spool directory doesn't have to be located on a local
101   drive. It is specific to the <local> host and can be located on network drives,
102   too. But for performance reasons, spooling to a local drive is recommended.
103
104   FOR WINDOWS USER: On Windows systems the local spool directory MUST be set
105   to a local harddisk directory.
106   Installing an execd without local spool directory makes the host unuseable.
107   Local spooling on local harddisk is mandatory for Windows systems.
108
109   Do you want to configure a local spool directory
110   for this host (y/n) [n] >> y
111
112   Please enter the local spool directory now! >> /tmp/dom/execs
113   Using local execd spool directory [/tmp/dom/execs]
114   Hit <RETURN> to continue >> 
115
116   Creating local configuration
117   ----------------------------
118   myusername@domain.com modified "domain.com" in configuration list
119   Local configuration for host >domain.com< created.
120
121   Hit <RETURN> to continue >> 
122

Step 13

123   execd startup script
124   --------------------
125
126   We can install the startup script that will
127   start execd at machine boot (y/n) [y] >> n
128
129
130   Hit <RETURN> to continue >> 
131

Step 14

132   Windows Helper Service Installation
133   ---------------------------------------
134
135   If you're going to run Windows job's using GUI support, you have
136   to install the Windows Helper Service
137   Do you want to install the Windows Helper Service? (y/n) [n] >> y
138
139   Testing, if a service is already installed!
140
141      ... a service is already installed!
142      ... stopping service!
143      ... uninstalling old service!
144   Service successfully uninstalled.
145
146
147      ... moving new service binary!
148      ... installing new service!
149   Service successfully installed.
150
151
152      ... starting new service!
153
154   Hit <RETURN> to continue >> 
155
156   Grid Engine execution daemon startup
157   ------------------------------------
158
159   Starting execution daemon. Please wait ...
160      starting sge_execd
161
162   Hit <RETURN> to continue >> 
163

Step 15

164   Adding a queue for this host
165   ----------------------------
166
167   We can now add a queue instance for this host:
168
169      - it is added to the >allhosts< hostgroup
170      - the queue provides 1 slot(s) for jobs in all queues
171        referencing the >allhosts< hostgroup
172
173   You do not need to add this host now, but before running jobs on this host
174   it must be added to at least one queue.
175
176   Do you want to add a default queue instance for this host (y/n) [y] >> 
177
178   No modification because "bofur" already exists in "hostlist" of "hostgroup"
179   root@domain.com modified "@allhosts" in host group list
180   root@domain.com modified "all.q" in cluster queue list
181
182   Hit <RETURN> to continue >> 
183

Step 16

184   Using Grid Engine
185   -----------------
186
187   You should now enter the command:
188
189      source /scratch2/myusername/sge62/default/common/settings.csh
190
191   if you are a csh/tcsh user or
192
193      # . /scratch2/myusername/sge62/default/common/settings.sh
194
195   if you are a sh/ksh user.
196
197   This will set or expand the following environment variables:
198
199      - $SGE_ROOT         (always necessary)
200      - $SGE_CELL         (if you are using a cell other than >default<)
201      - $SGE_CLUSTER_NAME (always necessary)
202      - $SGE_QMASTER_PORT (if you haven't added the service >sge_qmaster<)
203      - $SGE_EXECD_PORT   (if you haven't added the service >sge_execd<)
204      - $PATH/$path       (to find the Grid Engine binaries)
205      - $MANPATH          (to access the manual pages)
206
207   Hit <RETURN> to see where Grid Engine logs messages >> 
208
209   Grid Engine messages
210   --------------------
211
212   Grid Engine messages can be found at:
213
214      /tmp/qmaster_messages (during qmaster startup)
215      /tmp/execd_messages   (during execution daemon startup)
216
217   After startup the daemons log their messages in their spool directories.
218 
219      Qmaster:     /scratch2/myusername/sge62/default/spool/qmaster/messages
220      Exec daemon: <execd_spool_dir>/<hostname>/messages
221
222
223   Grid Engine startup scripts
224   ---------------------------
225
226   Grid Engine startup scripts can be found at:
227 
228      /scratch2/myusername/sge62/default/common/sgemaster (qmaster)
229      /scratch2/my/sge62/default/common/sgeexecd (execd)
230
231   Do you want to see previous screen about using Grid Engine again (y/n) [n] >> 
232
233   Your execution daemon installation is now completed.
234

How to Register Administration Hosts

The master host is implicitly allowed to run administrative tasks and to submit, monitor, and delete jobs. The master host does not require any additional installation or configuration to perform administration functions. By contrast, pure administration hosts do require registration.

Note
You can also install administration hosts by using the QMON graphical user interface. See How to Configure Administration Hosts With QMON.

To register an administration host from the command line:

  1. On the master host, log in to the Grid Engine system administrative account, for example, the sgeadmin account.

  2. Type the following command:
    % qconf -ah <admin-host-name>[,...]
    

How to Register Submit Hosts

Note
You can also install submit hosts by using the QMON graphical user interface. See How to Configure Submit Hosts With QMON.

To register a submit host from the command line:

  1. On the master host, log in to the Grid Engine system administrative account, for example, the sgeadmin account.

  2. Type the following command:
    % qconf -as <submit-host-name>[,...]
    

Refer to About Hosts and Daemons for more details and other means to configure the different host types.


How to Install the Berkeley DB Spooling Server

The installation procedure installs the Grid Engine software necessary for Berkeley DB spooling.

  1. Load the Grid Engine software onto a local file system.
    For details on how to extract the files, see How to Load the Distribution Files On a Workstation.

  2. Log in to the spooling server host as root.

  3. If the $SGE_ROOT environment variable is not set, set it by typing:
    # SGE_ROOT=sge-root; export SGE_ROOT
    

    To confirm that you have set the $SGE_ROOT environment variable, type:

    # echo $SGE_ROOT
    
  4. Change to the installation directory.
    # cd $SGE_ROOT
    
  5. Type the inst_sge command with the -db option.
    # sge-root/inst_sge -db
    

    This command starts the spooling server installation procedure. You are asked several questions. If you think something went wrong, you can quit the installation procedure and restart it at any time.

  6. Choose an administrative account owner.
    Choosing Grid Engine admin user account
    ---------------------------------------
    
    You may install Grid Engine that all files are created with the user id of an 
    unprivileged user.
    
    This will make it possible to install and run Grid Engine in directories
    where user >root< has no permissions to create and write files and directories.
    
       - Grid Engine still has to be started by user >root<
    
       - this directory should be owned by the Grid Engine administrator
    
    Do you want to install Grid Engine
    under an user id other than >root< (y/n) [y] >> y
    
    Choosing a Grid Engine admin user name
    --------------------------------------
    
    Please enter a valid user name >> sgeadmin
    Installing Grid Engine as admin user >sgeadmin<
    
    Hit <RETURN> to continue >>
    
  7. Verify the $SGE_ROOT directory setting.
    In the following example, the value of $SGE_ROOT is /opt/sge62.
    Checking $SGE_ROOT directory
    ----------------------------
    
    The Grid Engine root directory is:
    
       $SGE_ROOT = /opt/sge62
    
    If this directory is not correct (e.g. it may contain an automounter
    prefix) enter the correct path to this directory or hit <RETURN>
    to use default [/opt/n1ge6] >> 
    
    Your $SGE_ROOT directory: /opt/sge62
    
    Hit <RETURN> to continue >> 
    
  8. Type the name of your cell.
    The use of Grid Engine system cells is described in Cells.
    Grid Engine cells
    -----------------
    
    Grid Engine supports multiple cells.
    
    If you are not planning to run multiple Grid Engine clusters or if you don't
    know yet what is a Grid Engine cell it is safe to keep the default cell name
    
       default
    
    If you want to install multiple cells you can enter a cell name now.
    
    The environment variable
    
       $SGE_CELL=<your_cell_name>
    
    will be set for all further Grid Engine commands.
    
    Enter cell name [default] >> 
    
  9. Select Berkeley DB spooling.
    Setup spooling
    --------------
    Your SGE binaries are compiled to link the spooling libraries
    during runtime (dynamically). So you can choose between Berkeley DB 
    spooling and Classic spooling method.
    Please choose a spooling method (berkeleydb|classic) [berkeleydb] >> 
    
  10. Verify your host name.
    In this example, the installation script is being run on host2.
    Berkeley Database spooling parameters
    -------------------------------------
    
    You are going to install an RPC Client/Server mechanism!
    In this case, qmaster will
    contact an RPC server running on a separate server machine.
    If you want to use the SGE shadowd, you have to use the 
    RPC Client/Server mechanism.
    
    Enter database server name or 
    hit <RETURN> to use default [host2] >> 
    
  11. Type the directory path of your spooling directory.
    You might need to change this path if this directory is NFS mounted, or if you do not have write permissions to this directory.
    Enter the database directory
    or hit <RETURN> to use default [/opt/sge62/default//spooldb] >> 
    
    creating directory: /opt/sge62/default//spooldb
    
  12. Start the RPC server.
    Now we have to startup the rc script
    >/opt/sge62/default/common/sgebdb< 
    on the RPC server machine
    
    If you already have a configured Berkeley DB Spooling Server,
    you have to restart the Database with the rc script now and continue with >NO<
    
    Shall the installation script try to start the RPC server? (y/n) [y] >> y
    Starting rpc server on host host2!
    The Berkeley DB has been started with these parameters:
    
    Spooling Server Name: host2
    DB Spooling Directory: /opt/sge62/default//spooldb
    
    Please remember these values, during Qmaster installation
    you will be asked for them! Hit <RETURN> to continue!
    
  13. Specify whether you want Berkeley DB service to start automatically at boot time.
    Berkeley DB startup script
    --------------------------
    
    We can install the startup script that
    Grid Engine is started at machine boot (y/n) [y] >> y
    

    Once you answer this question, the installation process is complete.

  14. Create the environment variables for use with the Grid Engine software.
    Note
    If no cell name was specified during installation, the value of $SGE_CELL is default.
    • If you are using a C shell, type the following command:
      % source $SGE_ROOT/$SGE_CELL/common/settings.csh
      
    • If you are using a Bourne shell or Korn shell, type the following command:
      $ . $SGE_ROOT/$SGE_CELL/common/settings.sh
      

How to Install the Berkeley DB Spooling Server

The installation procedure installs the Grid Engine software necessary for Berkeley DB spooling.

  1. Load the Grid Engine software onto a local file system.
    For details on how to extract the files, see How to Load the Distribution Files On a Workstation.

  2. Log in to the spooling server host as root.

  3. If the $SGE_ROOT environment variable is not set, set it by typing:
    # SGE_ROOT=sge-root; export SGE_ROOT
    

    To confirm that you have set the $SGE_ROOT environment variable, type:

    # echo $SGE_ROOT
    
  4. Change to the installation directory.
    # cd $SGE_ROOT
    
  5. Type the inst_sge command with the -db option.
    # sge-root/inst_sge -db
    

    This command starts the spooling server installation procedure. You are asked several questions. If you think something went wrong, you can quit the installation procedure and restart it at any time.

  6. Choose an administrative account owner.
    Choosing Grid Engine admin user account
    ---------------------------------------
    
    You may install Grid Engine that all files are created with the user id of an 
    unprivileged user.
    
    This will make it possible to install and run Grid Engine in directories
    where user >root< has no permissions to create and write files and directories.
    
       - Grid Engine still has to be started by user >root<
    
       - this directory should be owned by the Grid Engine administrator
    
    Do you want to install Grid Engine
    under an user id other than >root< (y/n) [y] >> y
    
    Choosing a Grid Engine admin user name
    --------------------------------------
    
    Please enter a valid user name >> sgeadmin
    Installing Grid Engine as admin user >sgeadmin<
    
    Hit <RETURN> to continue >>
    
  7. Verify the $SGE_ROOT directory setting.
    In the following example, the value of $SGE_ROOT is /opt/sge62.
    Checking $SGE_ROOT directory
    ----------------------------
    
    The Grid Engine root directory is:
    
       $SGE_ROOT = /opt/sge62
    
    If this directory is not correct (e.g. it may contain an automounter
    prefix) enter the correct path to this directory or hit <RETURN>
    to use default [/opt/n1ge6] >> 
    
    Your $SGE_ROOT directory: /opt/sge62
    
    Hit <RETURN> to continue >> 
    
  8. Type the name of your cell.
    The use of Grid Engine system cells is described in Cells.
    Grid Engine cells
    -----------------
    
    Grid Engine supports multiple cells.
    
    If you are not planning to run multiple Grid Engine clusters or if you don't
    know yet what is a Grid Engine cell it is safe to keep the default cell name
    
       default
    
    If you want to install multiple cells you can enter a cell name now.
    
    The environment variable
    
       $SGE_CELL=<your_cell_name>
    
    will be set for all further Grid Engine commands.
    
    Enter cell name [default] >> 
    
  9. Select Berkeley DB spooling.
    Setup spooling
    --------------
    Your SGE binaries are compiled to link the spooling libraries
    during runtime (dynamically). So you can choose between Berkeley DB 
    spooling and Classic spooling method.
    Please choose a spooling method (berkeleydb|classic) [berkeleydb] >> 
    
  10. Verify your host name.
    In this example, the installation script is being run on host2.
    Berkeley Database spooling parameters
    -------------------------------------
    
    You are going to install an RPC Client/Server mechanism!
    In this case, qmaster will
    contact an RPC server running on a separate server machine.
    If you want to use the SGE shadowd, you have to use the 
    RPC Client/Server mechanism.
    
    Enter database server name or 
    hit <RETURN> to use default [host2] >> 
    
  11. Type the directory path of your spooling directory.
    You might need to change this path if this directory is NFS mounted, or if you do not have write permissions to this directory.
    Enter the database directory
    or hit <RETURN> to use default [/opt/sge62/default//spooldb] >> 
    
    creating directory: /opt/sge62/default//spooldb
    
  12. Start the RPC server.
    Now we have to startup the rc script
    >/opt/sge62/default/common/sgebdb< 
    on the RPC server machine
    
    If you already have a configured Berkeley DB Spooling Server,
    you have to restart the Database with the rc script now and continue with >NO<
    
    Shall the installation script try to start the RPC server? (y/n) [y] >> y
    Starting rpc server on host host2!
    The Berkeley DB has been started with these parameters:
    
    Spooling Server Name: host2
    DB Spooling Directory: /opt/sge62/default//spooldb
    
    Please remember these values, during Qmaster installation
    you will be asked for them! Hit <RETURN> to continue!
    
  13. Specify whether you want Berkeley DB service to start automatically at boot time.
    Berkeley DB startup script
    --------------------------
    
    We can install the startup script that
    Grid Engine is started at machine boot (y/n) [y] >> y
    

    Once you answer this question, the installation process is complete.

  14. Create the environment variables for use with the Grid Engine software.
    Note
    If no cell name was specified during installation, the value of $SGE_CELL is default.
    • If you are using a C shell, type the following command:
      % source $SGE_ROOT/$SGE_CELL/common/settings.csh
      
    • If you are using a Bourne shell or Korn shell, type the following command:
      $ . $SGE_ROOT/$SGE_CELL/common/settings.sh
      

Installing the Increased Security Features

Use the instructions in this section to set up your system more securely. These instructions will help you set up your system with Certificate Security Protocol (CSP)-based encryption.

Why Install the Increased Security Features?

Instead of transferring messages in clear text, the messages in this secure system are encrypted with a secret key. The secret key is exchanged using a public/private key protocol. Users present their certificates through the Grid Engine system to prove identity. Users receive the certificate to ensure that they are communicating with the correct systems. After this initial announcement phase, communication continues transparently in encrypted form. The session is valid only for a certain period, after which the session must be re-announced.

Additional Setup Required

The steps required to set up the Certificate Security Protocol enhanced version of the Grid Engine system are similar to the standard setup. You generally follow the instructions in Planning the Installation, Loading the Distribution Files on a Workstation, How to Install the Master Host, How to Install Execution Hosts and How to Register Administration Hosts.

However, the following additional tasks are required:

  • Generating the Certificate Authority (CA) system keys and certificates on the master host by calling the installation script with the -csp flag
  • Distributing the system keys and certificates to the execution and submit hosts using a secure method such as ssh
  • Generating user keys and certificates automatically, after master installation
  • Adding new users
Topic Description
How to Install a CSP-Secured System Procedure for installing a CSP-secured system.
How to Generate Certificates and Private Keys for Users Procedure for generating user-specific certificates and private keys.
How to Renew Certificates Procedure for renewing user-specific certificates.
How to Check Certificates Procedure for checking user-specific certificates.

How to Install a CSP-Secured System

Install the Grid Engine software as outlined in Performing an Installation, with the following exception: use the additional flag -csp when invoking the various installation scripts. To install a CSP-secured system do the following:

  1. Change the master host installation procedure.
    Type the following command and respond to the prompts from the installation script.
    # ./install_qmaster -csp
    


  2. Supply the following information to generate the CSP certificates and keys:
    • Two-letter country code, for example, US for the United States
    • State
    • Location, such as a city
    • Organization
    • Organizational unit
    • CA email address

      As the installation proceeds, the Certificate Authority is created. A CA specific to the Grid Engine system is created on the master host. The directories that contain information relevant to security are as follows:
    • The publicly accessible CA and daemon certificate are stored in
      $SGE_ROOT/$SGE_CELL/common/sgeCA
      
    • The corresponding private keys are stored in
      /var/sgeCA/{sge_service| portSGE_QMASTER_PORT}/cell/private
      
    • User keys and certificates are stored in
      /var/sgeCA/{sge_service| portSGE_QMASTER_PORT}/cell/userkeys/$USER
      


  3. The script prompts you for site information.

  4. Confirm whether the information you supplied is correct.

  5. Continue the installation.
    After the security-related setup of the master host sge_qmaster is finished, the script prompts you to continue with the rest of the installation procedure, as in the following example:
    SGE startup script
    --------------------
    
    Your system wide SGE startup script is installed as:
    
         "/scratch2/eddy/sge_sec/default/common/sgemaster"
    
    Hit Return to continue >>
    


  6. Transfer the directory that contains the private key and the random file to each execution host.
    1. As root on the master host, type the following commands to prepare to copy the private keys to the machines you set up as execution hosts:
      # umask 077
      # cd /
      # tar cvpf /var/sgeCA/port536.tar /var/sgeCA/port536/default
      
    2. As root on each execution host, use the following commands to securely copy the files:
      # umask 077
      # cd /
      # scp masterhost:/var/sgeCA/port536.tar . 
      # umask 022
      # tar xvpf /port536.tar
      # rm /port536.tar
      
      Note
      On a Windows execution host, the tar utility cannot restore the ownerships and permissions. In this case, the Administrator must set the ownerships and permissions manually.
    3. Type the following command to verify the file permissions:
      # ls -lR /var/sgeCA/port536/
      


      The output should look like the following example:

      /var/sgeCA/port536/:
      total 2
      drwxr-xr-x   4 eddy     other        512 Mar  6 10:52 default
      /var/sgeCA/port536/default:
      total 4
      drwx------   2 eddy     staff        512 Mar  6 10:53 private
      drwxr-xr-x   4 eddy     staff        512 Mar  6 10:54 userkeys
      /var/sgeCA/port536/default/private:
      total 8
      -rw-------   1 eddy     staff        887 Mar  6 10:53 cakey.pem
      -rw-------   1 eddy     staff        887 Mar  6 10:53 key.pem
      -rw-------   1 eddy     staff       1024 Mar  6 10:54 rand.seed
      -rw-------   1 eddy     staff        761 Mar  6 10:53 req.pem
      /var/sgeCA/port536/default/userkeys:
      total 4
      dr-x------   2 eddy     staff        512 Mar  6 10:54 eddy
      dr-x------   2 root     staff        512 Mar  6 10:54 root
      /var/sgeCA/port536/default/userkeys/eddy:
      total 16
      -r--------   1 eddy     staff       3811 Mar  6 10:54 cert.pem
      -r--------   1 eddy     staff        887 Mar  6 10:54 key.pem
      -r--------   1 eddy     staff       2048 Mar  6 10:54 rand.seed
      -r--------   1 eddy     staff        769 Mar  6 10:54 req.pem
      /var/sgeCA/port536/default/userkeys/root:
      total 16
      -r--------   1 root     staff       3805 Mar  6 10:54 cert.pem
      -r--------   1 root     staff        887 Mar  6 10:54 key.pem
      -r--------   1 root     staff       2048 Mar  6 10:53 rand.seed
      -r--------   1 root     staff        769 Mar  6 10:54 req.pem
      


  7. Install the Grid Engine software on each execution host.
    # cd $SGE_ROOT
    # ./install_execd -csp
    


  8. Respond to the prompts from the installation script.
    The execution host installation procedure creates the appropriate directory hierarchy required by sge_execd, and starts the sge_execd daemon on the execution host.
    If the root user does not have write permissions in the $SGE_ROOT directory on all of the machines where Grid Engine software will be installed, you are asked whether to install the software as the user to whom the directory belongs. If you answer yes, you must install the security-related files into that user's $HOME/.sge directory, as shown in the following example.
    % su - sgeadmin
    % source $SGE_ROOT/default/common/settings.csh
    % $SGE_ROOT/util/sgeCA/sge_ca -copy
    % logout
    

    In the above example, sgeadmin is the name of the user who owns the installation directory.

  9. After completing all remaining installation steps, refer to the instructions below in How to Generate Certificates and Private Keys for Users.

How to Generate Certificates and Private Keys for Users

To use the CSP-secured system, the user must have access to a user-specific certificate and private key. The most convenient method of gaining access is to create a text file identifying the users.

  1. On the master host, create and save a text file that identifies users.
    Use the format of the file myusers.txt shown in the following example. The fields of the file are UNIX_username:Gecos_field:email_address.
    eddy:Eddy Smith:eddy@my.org
    sarah:Sarah Miller:sarah@my.org
    leo:Leo Lion:leo@my.org
    


  2. As root on the master host, type the following command:
    # $SGE_ROOT/util/sgeCA/sge_ca -usercert myusers.txt
    


  3. Confirm by typing the following command:
    # ls -l /var/sgeCA/port536/default/userkeys
    

    This directory listing produces output similar to the following example.

    dr-x------  2 eddy  staff        512 Mar  5 16:13 eddy
    dr-x------  2 sarah staff        512 Mar  5 16:13 sarah
    dr-x------  2 leo   staff        512 Mar 5 16:13 leo
    


  4. Tell each user to install security related files in their directories.
    Tell each user listed in the file (myusers.txt in the example) to install the security-related files in their $HOME/.sge directories by typing the following commands.
    % source $SGE_ROOT/default/common/settings.csh
    % $SGE_ROOT/util/sgeCA/sge_ca -copy
    

    Users should see the following confirmation (user eddy in the example).

    Certificate and private key for user
    eddy have been installed
    

    For every Grid Engine software installation, a subdirectory for the corresponding SGE_QMASTER_PORT number is installed. The following example, based on the myusers.txt file, is a result of issuing the command preceding the output.

    % ls -lR $HOME/.sge
    
    /home/eddy/.sge:
    total 2
    drwxr-xr-x  3 eddy staff        512 Mar  5 16:20 port536
    
    /home/eddy/.sge/port536:
    total 2
    drwxr-xr-x  4 eddy staff        512 Mar  5 16:20 default
    
    /home/eddy/.sge/port536/default:
    total 4
    drwxr-xr-x  2 eddy staff        512 Mar  5 16:20 certs
    drwx------  2 eddy staff        512 Mar 5 16:20 private
    
    /home/eddy/.sge/port536/default/certs:
    total 8
    -r--r--r--  1 eddy staff       3859 Mar  5 16:20 cert.pem
    
    /home/eddy/.sge/port536/default/private:
    total 6
    -r--------  1 eddy staff        887 Mar  5 16:20 key.pem
    -r--------  1 eddy staff       2048 Mar 5 16:20 rand.seed
    

How to Renew Certificates

  1. Change to $SGE_ROOT and become root on the master host.
    # tcsh
    # source $SGE_ROOT/default/settings.csh
    
    Note
    This assumes that $SGE_CELL is the default.


  2. Edit $SGE_ROOT/util/sgeCA/renew_all_certs.csh, and change the number of days that the certificates are valid:
     # extend the validity of the CA certificate by
      set CADAYS = 365
      # extend the validity of the daemon certificate by
      set DAEMONDAYS = 365
      # extend the validity of the user certificate by
      set USERDAYS = 365
    


  3. Run the changed script.
    # util/sgeCA/renew_all_certs.csh
    
    Note
    The default for all extension times is 365 days from the day the script is run.


  4. Replace the old certificates against the new ones on all hosts that installed them locally.
    That is, under /var/sgeCA/..., see the execution daemon installation.

  5. If users have copied certificates and keys to $HOME/.sge, they have to repeat $SGE_ROOT/util/sgeCA/sge_ca -copy to have access to the renewed certificates.


Verifying the Installation

The verification phase includes the following tasks:

  • Ensuring that the master daemon is running on the master host
  • Ensuring that the daemons are running on all execution hosts
  • Ensuring that you can run simple commands
  • Submitting test jobs

To ensure that the Grid Engine system daemons are running, look for the sge_qmaster daemon on the master host and the sge_execd daemon on the execution hosts. Once you have verified that the daemons are running, you can try to use commands and prepare to submit jobs.

Note
If no cell name was specified during installation, the value of $SGE_CELL is default.
Topic Description
How to Verify That the Daemon Is Running on the Master Host Procedure for verifying that the Daemon is running on the master host.
How to Verify That the Daemons Are Running on the Execution Hosts Procedure for verifying that the Daemons are running on the execution hosts.
How to Run Simple Commands Procedure for verifying that the Sun Grid Engine software is operational by running some trial commands.
How to Submit Test Jobs Procedure for submitting test jobs.

How to Verify That the Daemon is Running on the Master Host

  1. Log in to the master host.
    Look in the file $SGE_ROOT/$SGE_CELL/common/act_qmaster to see if you really are on the master host.

  2. Verify that the master daemon is running.
    • On BSD-based UNIX systems, type the following command:
      % ps -ax | grep sge
      

      You should see output similar to the following example.

      14676 p1 S <  4:47 /gridware/sge/bin/solaris/sge_qmaster
      
    • On systems running a UNIX System 5-based operating system (such as the Solaris Operating System), type the following command:
      % ps -ef | grep sge
      

      You should see output similar to the following example.

      root 439 1 0 Jun 2 ? 3:37 /gridware/sge/bin/solaris/sge_qmaster
      


  3. If you do not see the appropriate string, restart the daemon.
    To start the master host daemon, sge_qmaster:
    # $SGE_ROOT/$SGE_CELL/common/sgemaster  start
    


  4. Continue the verification process.
    After you have verified that the master host and the execution host daemons are running, continue the verification process. See How to Run Simple Commands.

How to Verify That the Daemons Are Running on the Execution Hosts

  1. Log in to the execution hosts on which you ran the execution host installation procedure.

  2. Verify that the daemons are running.
    • On BSD-based UNIX systems, type the following command:
      % ps -ax | grep sge
      

      You should see output similar to the following example.

      14688 p1 S <    4:27  /gridware/sge/bin/solaris/sge_execd
      
    • On systems running a UNIX System 5-based operating system (such as the Solaris Operating System), type the following command:
      % ps -ef | grep sge
      

      You should see output similar to the following example.

      root 171 1 0 Jun 22 ? 7:11 /gridware/sge/bin/solaris/sge_execd
      


  3. If you do not see similar output, restart the daemon.
    # $SGE_ROOT/$SGE_CELL/common/sgeexecd  start
    


  4. Continue the verification process.
    After you have verified that the master host and the execution host daemons are running, continue the verification process. See How to Run Simple Commands below for details.

How to Run Simple Commands

If both the necessary daemons are running on the master and execution hosts, the Grid Engine software should be operational. Check by issuing a trial command.

  1. Log in to either the master host or another administrative host.
    In your standard search path, make sure to include $SGE_ROOT/bin.

  2. From the command line, type the following command:
    % qconf -sconf
    

    This qconf command displays the current global cluster configuration Configuring Clusters.
    If this command fails, your $SGE_ROOT environment variable is not set correctly.

    1. Check whether the environment variables SGE_EXECD_PORT and SGE_QMASTER_PORT are set in the script files, $SGE_ROOT/$SGE_CELL/common/settings.csh or $SGE_ROOT/$SGE_CELL/common/settings.sh.
      Note
      If no cell name was specified during installation, the value of $SGE_CELL is default.
      • If so, make sure that the environment variables SGE_EXECD_PORT and {{SGE_QMASTER_PORT} are set to the correct value before you try the command again.
      • If not, verify whether your NIS services map contains entries for sge_qmaster and sge_execd.
        If the SGE_EXECD_PORT and SGE_QMASTER_PORT variables are not used in these files, then the services database (/etc/services or the NIS services map for example) on the machine from which you run the command must provide entries for both sge_qmaster and sge_execd. If these entries do not exist, add an entry to the machine's services database, giving it the same value as is configured on the master host.
    2. Retry the qconf command.

  3. Try to submit test jobs.

How to Submit Test Jobs

Before you start submitting batch scripts to the Grid Engine system, check to see whether your site's standard shell resource files (.cshrc, .profile, or .kshrc) as well as your personal resource files contain commands such as stty. Batch jobs do not have a terminal connection by default, and therefore calls to stty result in an error.

  1. Log in to the master host.

  2. Type the following command.
    % rsh <exec-host-name> date
    

    The exec-host-name refers to one of the already installed execution hosts. You should try this test on all execution hosts if your login or home directories differ from host to host. The rsh command should give you output similar to the date command run locally on the master host. If any additional lines contain error messages, you must fix the cause of the errors before you can run a batch job successfully.

    For all command interpreters, you can check on an actual terminal connection before you run a command such as stty.
    The following is an example of a Bourne shell script to test the terminal connection.

    tty -s 
    if [ $? = 0 ]; then
       stty erase ^H
    fi
    


    The following example shows C shell syntax.

    tty -s
    if ( $status = 0 ) then
       stty erase ^H
    endif
    


  3. Submit one of the sample scripts contained in the $SGE_ROOT/examples/jobs directory.
    % qsub $SGE_ROOT/examples/jobs/simple.sh
    


  4. Use the qstat command to monitor the job's behavior.
    For more information about submitting and monitoring batch jobs, see Submitting Batch Jobs.

  5. After the job finishes executing, check your home directory for the redirected stdout/stderr files script-name.ejob-id and script-name.ojob-id.
    The job-id is a consecutive unique integer number assigned to each job.

In case of problems, see Fine-Tuning Your Environment and Using DTrace for Performance Tuning.


Automating the Installation Process

This section describes how you can automate the software installation process for the following reasons:

  • To install the Grid Engine software on many hosts
  • To install the Grid Engine software without user interaction

You can use the $SGE_ROOT/inst_sge utility to install and uninstall Sun Grid Engine master hosts, execution hosts, shadow host and Berkeley DB spooling server hosts. You can also use this utility to backup automatically the Sun Grid Engine configuration and accounting data.

Note
Using the Berkeley DB Spooling Server host does not provide high availability, and it has no authentication mechanism. It should only be used on a closed network with fully trusted users.

You can use the inst_sge utility in interactive mode to supplant any of the commands that were described in Installing the Software From the Command Line.

To simplify automatic installation and backup processes, use the configuration templates that are located in the $SGE_ROOT/util/install_modules directory.

The automatic installation requires no user interaction. No messages are displayed on the terminal during the installation.

When the installation finishes, a message indicates where the installation log file resides. The name of the installation log file format is install_hostname_timestamp.log. Normally, you can find information about errors during installation in this file. In case of serious errors though, the installation script might not be able to move the log file into the spool directory. In this situation, the log file is placed in the /tmp directory.

Topic Description
Automatic Installation Perform an automatic installation by setting up a configuration file.
Automatic Uninstallation Learn how to uninstall hosts automatically.
How to Start the Automatic Backup Procedure for backing up configuration and accounting data by using the interactive backup procedure.
Troubleshooting Automatic Installation and Uninstallation Troubleshoot errors that might be encountered during automatic installation.

Automatic Installation

Special Considerations

The first step in performing an automatic installation is to set up a configuration file. You can find configuration file templates in the $SGE_ROOT/util/install_modules directory. Consider the following as you plan your automatic installation:

  • To use automatic installation on remote hosts, the root user must be able to access those hosts through rsh or ssh without supplying a password.
  • For local spooling, that is, spooling on the master host, no special configuration is needed. However, the directory where the spooling occurs must not be on an NFS version 3 volume. You may use an NFS version 4 volume for local spooling.
  • To run the Berkeley DB spooling server on a host other than the master host, you must install and configure RPC services on this separate host.

To perform this step manually before you start the automatic installation, use the following command:

./inst_sge -db

You can also use the following command to install automatically the Berkeley DB Spooling Server:

% ./inst_sge -db -m -x -auto <full-path-to-configuration-file>

This command checks the SPOOLING_SERVER entry within the configuration file and starts the Berkeley DB installation on the server host.

Note
If you start the automatic installation on the master host, the entire cluster can be installed with one command. The automatic installation script accesses the remote hosts through rsh or ssh and starts the installation remotely. This process requires a well-configured configuration file, which each host must be able to read. That file should be installed on each host or shared through NFS.

Using the inst_sge Utility and a Configuration Template

To automate system installation, use the inst_sge utility in combination with a configuration file. See How to Automate Other Installations Through a Configuration File.

Note
You cannot use the auto installation procedure to install remotely a Windows execution host. You must run the auto installation procedure directly on the Windows execution host.
Topic Description
How to Automate the Master Host Installation Procedure for automating the master host installation.
How to Automate Other Installations Through a Configuration File Procedure for performing a variety of other automatic installations using the configuration file.
How to Automate Installation With Increased Security (CSP) Procedure for automating installation with Certificate Security Protocol (CSP) mode.

How to Automate Installation With Increased Security (CSP)

The automatic installation also supports the Certificate Security Protocol (CSP) mode described in Installing the Increased Security Features. To use the CSP security mode, you must fill out the CSP parameters of the template files. The parameters are as follows:

# This section is used for csp installation mode.
# CSP_RECREATE recreates the certs on each installation, if true.
# In case of false, the certs will be created, if not existing.
# Existing certs won't be overwritten. (mandatory for csp install)
CSP_RECREATE="true"

# The created certs won't be copied, if this option is set to false
# If true, the script tries to copy the generated certs. This
# requires passwordless ssh/rsh access for user root to the
# execution hosts
CSP_COPY_CERTS="false"

# csp information, your country code (only 2 characters)
# (mandatory for csp install)
CSP_COUNTRY_CODE="DE"

# your state (mandatory for csp install)
CSP_STATE="Germany"

# your location, eg. the building (mandatory for csp install)
CSP_LOCATION="Building"

# your organisation (mandatory for csp install)
CSP_ORGA="Organisation"

# your organisation unit (mandatory for csp install)
CSP_ORGA_UNIT="Organisation_unit"

# your email (mandatory for csp install)
CSP_MAIL_ADDRESS="name@yourdomain.com"

To start the installation, type the following command:

inst_sge -m -csp -auto template-file-name
Note
Certificates are created during the installation process. These certificates have to be copied to each host of the installed cluster. The installation process can do this for you; however, you need to perform the following steps to allow the installation process appropriate permissions to copy the certificates:
  1. Use rsh/rcp or ssh/scp on each host.
  2. Provide the root user with access to each host over ssh or rsh, without entering a password.

How to Automate Other Installations Through a Configuration File

In addition to installing the master host, you can perform a variety of other automatic installations using a similar process. The actual form of the inst_sge command differs slightly, and different sections of the configuration file apply. This section provides some examples.

  • To install a shadow host, use the following form of the command:
    inst_sge -sm -auto <full-path-to-configuration-file>
    
    Tip
    To install more than one shadow host, enter the host names in the <SHADOW_HOST> parameter section within the configuration file.
  • You can install a separate execution host installation if the master host was installed without identified compute hosts or if you need to add additional compute hosts. For the execution host installation, you also need to have a configuration file.

    To install all configured execution hosts, use the following form of the command:
    inst_sge -x -auto <full-path-to-configuration-file>
    
  • To install the Berkeley database server, use the following form of the command:
    inst_sge -db -auto <full-path-to-configuration-file>
    

See Configuration File Templates.


How to Automate the Master Host Installation

Before You Begin

You need to complete the planning process as outlined in Planning the Installation.

In addition, you need to be able to connect to each of the remote hosts using the rsh or ssh commands, without supplying a password. If this type of access is not allowed on your network, you cannot use this method of installation.

Steps
  1. Create a copy of the configuration template, $SGE_ROOT/util/install_modules/inst_template.conf.
    # cd $SGE_ROOT/util/install_modules
    # cp inst_template.conf my_configuration.conf
    


  2. Edit your configuration template, using the values from the worksheet you completed in Planning the Installation.
    The configuration file template includes liberal comments to help you decide where appropriate information belongs. See Configuration File Templates.

  3. Log in as root on the system that you want to be the Sun Grid Engine master host.

  4. Create the $SGE_ROOT directory.
    The $SGE_ROOT directory is the root directory of the Sun Grid Engine software hierarchy, for example /opt/sge62.

  5. Go to the $SGE_ROOT directory and start the installation.
    # cd $SGE_ROOT
    # ./inst_sge -m -auto <full-path-to-configuration-file>
    

The -m option starts the master host installation and installs the master daemon on the local machine. In addition, the -auto option sets up any remote hosts, as specified in the configuration file.

Note
You cannot install remotely a master host. You must always install a master host locally.

To prevent data loss or destroying an already installed cluster, the automatic installation terminates if the configured $SGE_CELL directory or the configured Berkeley DB spooling directory already exists. If the installation terminates, the script displays the reason for the termination on the screen.

A log file of the master installation is created in the $SGE_ROOT/default/spool/qmaster directory. The file name is created using the format install_hostname_date_time.log.

Tip
You can also combine options if you want to perform multiple installations with one command. For example, the following command installs the master daemon on the local machine and installs all execution hosts that are configured in the configuration file:
./inst_sge -m -x -auto <full-path-to-configuration-file>

a. Wait for notification that the installation has completed.

b. When the automatic installation exits successfully, it displays a message similar to the following:

The Install log can be found in the 
{{/opt/sge62/spool/install_myhost_30mar2007_090152.log}} file.

The installation log file includes any script or error messages that were generated during installation. If the qmaster_spooling_dir directory exists, the log files will be in that directory. If the directory does not exist, the log files will be in the /tmp directory.

Troubleshooting
If you do not want your execution hosts to spool locally, be sure to set EXECD_SPOOL_DIR_LOCAL="", with no space between the double quotes ("").

Automating Other Installations Through a Configuration File

In addition to installing the master host, you can perform a variety of other automatic installations using a similar process. The actual form of the inst_sge command differs slightly, and different sections of the configuration file apply. This section provides some examples.

  • To install a shadow host, use the following form of the command:
    inst_sge -sm -auto <full-path-to-configuration-file>
    
    Tip
    To install more than one shadow host, enter the host names in the <SHADOW_HOST> parameter section within the configuration file.
  • You can install a separate execution host installation if the master host was installed without identified compute hosts or if you need to add additional compute hosts. For the execution host installation, you also need to have a configuration file.

    To install all configured execution hosts, use the following form of the command:
    inst_sge -x -auto <full-path-to-configuration-file>
    
  • To install the Berkeley database server, use the following form of the command:
    inst_sge -db -auto <full-path-to-configuration-file>
    

See Configuration File Templates.


Automatic Uninstallation

You can also uninstall hosts automatically.

Note
Uninstall all compute hosts before you uninstall the master host. If you uninstall the master host first, you have to uninstall all execution hosts manually.

To ensure that you have a clean environment, always source the $SGE_ROOT/$SGE_CELL/common/settings.csh file before proceeding.

Topic Description
How to Uninstall the Master Host Automatically Procedure for uninstalling the master host automatically.
How to Uninstall Execution Hosts Automatically Procedure for uninstalling the execution hosts automatically.
How to Uninstall the Shadow Master Host Procedure for uninstalling the shadow host.

How to Uninstall the Master Host Automatically

The master host uninstallation removes all of the Sun Grid Engine configuration files. After the uninstallation procedure completes, only the binary files remain. If you think that you will need the configuration information after the uninstallation, perform a backup of the master host. The master host uninstallation supports both interactive and automatic mode.

To start the automatic uninstallation of the master host, type the following command:

% ./inst_sge -um -auto <full-path-to-configuration-file>

This command performs the same procedure as in interactive mode, except the user is not prompted for confirmation of any steps and all terminal output is suppressed. Once the uninstall process is started, it cannot be stopped.


How to Uninstall Execution Hosts Automatically

During the execution host uninstallation, all configuration information for the targeted hosts is deleted. The uninstallation attempts to stop the exec hosts in a graceful manner.

First, the queue instances associated with the target host of the uninstallation will be disabled, so that new jobs will not be started. Then, in sequence, the following actions are done on each of the running jobs: checkpoint the job; reschedule the job; do forced rescheduling of the job.

At this point, the queue instance will be empty, and the execution daemon will be shut down, then the configuration, global spool directory or local spool directory will be removed.

The configuration file template has a section for identifying hosts that can be uninstalled automatically. Look for this section:

# Remove this execution hosts in automatic mode 
EXEC_HOST_LIST_RM="host1 host2 host3 host4"

Every host in the EXEC_HOST_LIST_RM list will be automatically removed from the cluster.

To start the automatic uninstallation of execution hosts, type the following command:

% ./inst_sge -ux -auto <full-path-to-configuration-file>

How to Uninstall the Shadow Master Host

To start the automatic uninstallation of the shadow host, type the following command:

% ./inst_sge -usm -auto <full-path-to-configuration-file>


Troubleshooting Automatic Installation and Uninstallation

The following errors might be encountered during auto-installation:

Problem Solution
If the $SGE_CELL directory exists, the installation terminates to avoid overwriting a previous installation. Remove or rename the directory.
If the Berkeley database spooling directory exists, the installation terminates to avoid overwriting a previous installation. This directory must be removed or renamed in order to proceed. Make sure that the ADMINUSER has permissions to write into the location where the Berkeley database spooling directory is located. The ADMINUSER will be the owner of the Berkeley database spooling directory.
The execution host installation appears to succeed, but the execution daemon is not started, or no load values are shown. Verify that user root is allowed to rsh or ssh to the other host, without entering a password.

If your network does not allow user root to have permissions to connect to other hosts through rsh or ssh without asking for a password, the automatic installation will not work remotely. In this case, log in to the host and use the following command to start the automatic installation locally on each host:

% ./inst_sge -x -noremote -auto /tmp/install_config_file.conf

Installing SMF Services

The Service Management Facility (SMF) is a new feature in Solaris 10. It provides a unified model for controlling services, replaces RC scripts, handles service dependencies, provides better service availability, and speeds up boot process. If you do not use at least Version 10 of the Solaris OS in your cluster, or you do not plan to use SMF, continue with Installing the Software From the Command Line.

Note
SMF is now the default for the hosts that run at least Version 10 of the Solaris OS. If you want to use the old behavior (RC files) for the Solaris hosts, you need to start the installation with the -nosmf option. Use the following command: ./inst_sge -x -nosmf

Installing SMF services includes the following topics:

Why Install SMF Services?

SMF provides a unified administrative model of the persistent services. It solves many challenges of the previous approaches. All services have a common place for log files. Persistent services are automatically restarted on failure. For more information, see SMF documentation.

Additional Setup Required

If you want unprivileged users to use SMF services, you should create a role sge_admin. Assign this role to the users who should be able to manipulate the Grid Engine SMF services as described here.

Then, you can simply answer y when prompted to use SMF during the installation.

How Do SMF Services Compare to the Normal Services?

The biggest difference between SMF and normal services is that SMF does not consider kill -9 to be a correct service shutdown. SMF interprets kill -9 to restart the service.

Within the SMF framework, a service is uniquely identified by its fault resource management identifier (FMRI).

qmaster Daemon

Service name (FMRI) is svc:/application/sge/qmaster:$SGE_CLUSTER_NAME.

SGE version sgemaster stop qconf -km kill -15 kill -9 reboot
6.1 stop stop stop stop restart 1
6.2 stop stop stop restart restart

1 - Restart the daemon if RC scripts were installed

shadowd Daemon

Service name (FMRI) is svc:/application/sge/shadowd:$SGE_CLUSTER_NAME.

SGE version sgemaster -shadow stop kill -15 kill -9 reboot
6.1 stop stop stop restart 1
6.2 stop stop restart restart

1 - Restart the daemon if RC scripts were installed

execd Daemon

Service name (FMRI) is svc:/application/sge/execd:$SGE_CLUSTER_NAME.

SGE version sgeexecd stop qconf -ke kill -15 kill -9 reboot
6.1 stop stop stop stop restart 1
6.2 stop stop stop restart restart

1 - Restart the daemon if RC scripts were installed

Berkeley RPC Server

Service name (FMRI) is svc:/application/sge/bdb:$SGE_CLUSTER_NAME.

SGE version berkeley_svc stop kill -15 kill -9 reboot
6.1 stop stop stop restart 1
6.2 stop restart restart restart

1 - Restart the server if RC scripts were installed

dbwriter Software

Service name (FMRI) is svc:/application/sge/dbwriter:$SGE_CLUSTER_NAME.

SGE version sgedbwriter stop kill -15 kill -9 reboot
6.1 stop stop stop restart 1
6.2 stop restart restart restart

1 - Restart the dbwriter if RC scripts were installed

Additionally you may use new SMF interfaces to interact with services. For more information, see the svcadm(1M) man page.
New actions:
Action Command
Start service temporary svcadm enable -t FMRI
Start service permanently (across reboots) svcadm enable FMRI
Stop service temporary svcadm disable -t FMRI
Stop service permanently (across reboots) svcadm disable FMRI
Restart service svcadm reboot FMRI

Installing a JMX-Enabled System

The JMX agent functionality enables access to a subset of sge_qmaster functionality via the JMX protocol. For Sun Grid Engine 6.2, the main purpose of the JMX agent is to provide an interface between the SDM Grid Engine adapter and the Sun Grid Engine system.

Additional Setup Required

The steps required to set up the JMX agent feature of Grid Engine are similar to the standard setup. You generally follow the instructions in Planning the Installation, Loading the Distribution Files on a Workstation, How to Install the Master Host, How to Install Execution Hosts and How to Register Administration Hosts.
However, you have to perform a few additional tasks:

  • Generating necessary configuration files on the master host by calling the installation script with the -jmx flag and depending on the JMX specific installation settings the optional generation of certificates, keys and keystore files.
  • Optional distribution of security relevant files to the execution and submit hosts using a secure method such as ssh
  • Generating user keys, certificates and keys automatically, after master installation
  • Adding new users
  • Tweaking of JMX-specific files
Topic Description
How to Install a JMX Agent-Enabled System Procedure for installing Sun Grid Engine using the jmx flag when invoking the qmaster installation scripts.
How to Generate Certificates, Private Keys and Keystores for Users Procedure for generating user-specific certificates, private keys, and keystores.
How to Check Certificates, Private Keys and Keystores for Users Procedure for checking certificates, private keys, and keystores.
JMX Configuration Files Describes the JMX configuration files in detail.
Testing and Troubleshooting Testing and troubleshooting a JMX-enabled system.

How to Install a JMX Agent-Enabled System

Install the Grid Engine software as outlined in Installing the Software From the Command Line, with the following exception: use the additional flag -jmx when invoking the qmaster installation scripts.

To install a JMX agent enabled system do the following:

  1. Change the master host installation procedure.
    Type the following command and respond to the prompts from the installation script.
    # ./install_qmaster -jmx [-csp]
    


  2. Supply the following information to generate necessary configuration files and optionally the certificates, keys and keystores:
    • JMX agent options
      • JAVA_HOME (mandatory)
      • Additional JVM arguments (optional)
      • JMX MBean server port (mandatory)
      • JMX ssl authentication (default: true)
      • JMX ssl client authentication (default: true)
      • JMX ssl server keystore path
        (/var/sgeCA/{sge_qmaster| port$SGE_QMASTER_PORT}/$SGE_CELL/private/keystore)
      • JMX ssl server keystore password
    • Optional certificate specific options, if there is no CA available
      • Two-letter country code, for example, US for the United States
      • State
      • Location, such as a city
      • Organization
      • Organizational unit
      • CA email address
        As the installation proceeds, several JMX specific configuration files are installed:
        **jvm_threads is set to 1 instead of 0 if JMX is enabled in $SGE_ROOT/$SGE_CELL/common/bootstrap:
        ...
        jvm_threads             1
        ...
        
    • Several JMX agent specific configuration files are generated as:
      $SGE_ROOT/$SGE_CELL/common/jmx/jaas.config
      $SGE_ROOT/$SGE_CELL/common/jmx/java.policy
      $SGE_ROOT/$SGE_CELL/common/jmx/jmxremote.access
      $SGE_ROOT/$SGE_CELL/common/jmx/jmxremote.password
      $SGE_ROOT/$SGE_CELL/common/jmx/logging.properties
      $SGE_ROOT/$SGE_CELL/common/jmx/management.properties
      

      For a detailed description, see the comments in the files and the description below.

      Optionally the Certificate Authority is created. The directories that contain information relevant to security are as follows:

      • The publicly accessible CA and daemon certificate are stored in $SGE_ROOT/$SGE_CELL/common/sgeCA
      • The publicly accessible user certificates are stored in $SGE_ROOT/$SGE_CELL/common/sgeCA/usercerts
      • The corresponding private keys and keystore are stored in /var/sgeCA/{sge_qmaster| port$SGE_QMASTER_PORT}/$SGE_CELL/private
      • User keys, certificates and keystore are stored in /var/sgeCA/{sge_qmaster| port$SGE_QMASTER_PORT}/$SGE_CELL/userkeys/$USER
  3. The script prompts you for site information.

  4. Confirm whether the information you supplied is correct.

  5. Continue the installation.
    After the security-related setup of the master host is finished, the script prompts you to continue with the rest of the installation procedure, as in the following example:
    SGE startup script
    --------------------
    
    Your system wide SGE startup script is installed as:
    
         "/scratch2/eddy/sge_sec/$SGE_CELL/common/sgemaster"
    
    Hit Return to continue >>
    


  6. Proceed to the next task.
    For more information, see How to Generate Certificates and Private Keys for Users.

How to Generate Certificates, Private Keys and Keystores for Users

To use the CSP-secured system, the user must have access to a user-specific certificate, private key and keystore. Usually the steps outlined in How to Generate Certificates and Private Keys for Users are performed. After that the following procedure can be done to generate the corresponding keystore files for the users.

  1. As root on the master host run the following command:
    # $SGE_ROOT/util/sgeCA/sge_ca -userks [-kspwf <kspwf-file>]
    


  2. Confirm that the creation has been successful.
    # ls -lR /var/sgeCA/port$SGE_QMASTER_PORT/$SGE_CELL/userkeys
    /var/sgeCA/port$SGE_QMASTER_PORT/$SGE_CELL/userkeys/:
    total 8
    drwx------   2 eddy 	staff        512 Mar 13 11:33 eddy
    drwx------   2 sarah 	staff        512 Mar 13 11:33 sarah
    drwx------   2 leo 	staff        512 Mar 13 11:33 leo
    
    /var/sgeCA/port$SGE_QMASTER_PORT/$SGE_CELL/userkeys/eddy:
    total 16
    -rw-------   1 eddy staff       1586 Mar 13 11:32 cert.pem
    -rw-------   1 eddy staff        891 Mar 13 11:32 key.pem
    -rw-------   1 eddy staff       3031 Mar 13 11:33 keystore
    -rw-------   1 eddy staff       1024 Mar 13 11:32 rand.seed
    -rw-------   1 eddy staff        818 Mar 13 11:32 req.pem
    ...
    

The page XHow to Check Certificates, Private Keys and Keystores for Users does not exist.

JMX Configuration Files

The following configuration files are installed into $SGE_ROOT/$SGE_CELL/common/jmx and are explained in detail here. Manual modification is usually not necessary and the preinstalled configurations should be sufficient.

jaas.config

Before using the JMX interface, you must run a special authentication against sge_qmaster. This process adds the correct principle that gives you the necessary role to access the JMX interfaces in read-only or read-write mode. The responsible section in the jaas.config file is named GridwareConfig or TestConfig (for testing only).
In general, the jaas.config file defines which login modules are used for which application case. The choice of the login module is defined either in a configuration file like management.properties or programmatically.
The jaas.config file contains different sections and allows the replacement of the authentication mechanism, e.g. authentication via unix pam or via LDAP (see the GridwareConfig section and TestConfig section below). The different modules can be stacked. For a general overview of Jaas, see http://java.sun.com/developer/technicalArticles/Security/jaasv2/. Here the procedure consists of two steps in the GridwareConfig section:

  • Authenticate the user (for example with keystore or Unix login or with LDAP).
  • In the JGDILogin module, add the JMXPrincipal that gives the defined role to the user. This role is used later in the jmx.access file to check if the user has read-only or read-write access.
/*
 * Default login configuration for qmaster's jmx server
 */
GridwareConfig {

    /**
     *  Accepts all clients which have a certificate which is signed with
     *  the CA certificate.
     */
    com.sun.grid.security.login.GECATrustManagerLoginModule requisite
         caTop="${com.sun.grid.jgdi.caTop}";

    /*
     *  Accepts all clients which has a valid username/password.
     *
     *  The username/password validation is done with the authuser binary (included
     *  in the Grid Engine distribution, $SGE_ROOT/utilbin/$ARCH/authuser).
     *
     *  ATTENTION: The authuser binary needs the suid bit. It does not work if grid
     *  engine is installed on a nosuid file system.
     *
     */
    com.sun.grid.security.login.UnixLoginModule requisite
        sge_root="${com.sun.grid.jgdi.sgeRoot}"
        auth_method="system";

    /*
     * Username password authentication against LDAP.
     *
     * Alternative username/password authentication if
     * com.sun.grid.security.login.UnixLoginModule is not working.
     *
     * The LDAP specific parameters have to be adjusted to the local requirements
     * For details please have a look at the LdapLoginModule javadocs.
     *
     * ATTENTION: The LdapLoginModule is only available in java 6. The
     * parameter libjvm_path must point to a java 6 jvm
     * (qconf -sconf | grep libjvm_path)
     */
    /*
    com.sun.security.auth.module.LdapLoginModule requisite
        userProvider="ldap://sun-ds/ou=people,dc=sun,dc=com"
        userFilter="(&(uid={USERNAME})(objectClass=inetOrgPerson))"
        useSSL=false;
    */

    /*
     *  The JGDILoginModule adds a JGDIPrincipal to the subject. The username of
     *  the JGDIPrincipal is the name of the first trusted principal. This name
     *  treated as username for gdi communication.
     *  For each login a new jgdi session id is created.
     *
     *  In the jmxremote.access file users who can access the system are defined
     *  Any principal matching these entries is given the corresponding role.
     *  Usually a jmxPrincipal is defined to give a user access to the system.
     *  (e.g. com.sun.grid.security.login.UserPrincipal = xyz &
     *        jmxPrincipal="controlRole" gives user xyz access under controlRole
     *  )
     */
    com.sun.grid.jgdi.security.JGDILoginModule optional
        trustedPrincipal="com.sun.grid.security.login.UserPrincipal"
        trustedPrincipal1="com.sun.security.auth.UserPrincipal"
        jmxPrincipal="controlRole";
};

/*
 *  TestConfig accepts any user. Only for testing
 */
TestConfig {

    com.sun.grid.security.login.TestLoginModule requisite;

    com.sun.grid.jgdi.security.JGDILoginModule optional
        trustedPrincipal="com.sun.grid.security.login.UserPrincipal"
        jmxPrincipal="controlRole";
};

/*
 *  Mandatory if native jgdi is used with a csp system
 *  (e.g. jgdish in csp mode)
 */
jgdi {
   com.sun.security.auth.module.KeyStoreLoginModule required
                                                    keyStoreURL="file:./keystore"
                                                    debug=false;
};

java.policy

The java.policy file that is used by the JGDIAgent restricts the possibilities of code that can be run in sge_qmaster's JVM.

Usually changes here are only necessary to change the access to a subset of the overall functionality. To tweak the policy settings to your needs it is useful to run the JMX server with security debugging enabled and to consult the generated logging files. (qconf -mconf, additional_jvm_args = -Djavax.net.debug=ssl -Djava.security.debug=access,failure)

/*
**
** with LdapLoginModule
** grant principal com.sun.security.auth.UserPrincipal "controlRole"
**
** with jmxremote.password
** grant principal javax.management.remote.JMXPrincipal "controlRole"
**
*/
grant codeBase "file:${com.sun.grid.jgdi.sgeRoot}/lib/jgdi.jar"  {
   permission java.net.SocketPermission   "*:1024-", "accept,connect";
   permission java.net.SocketPermission   "localhost:1024-", "listen,resolve";
   permission java.lang.RuntimePermission "loadLibrary.jgdi";
   permission java.lang.RuntimePermission "shutdownHooks";
   permission java.lang.RuntimePermission "setContextClassLoader";
   permission javax.security.auth.AuthPermission "createLoginContext.jgdi";
   permission javax.security.auth.AuthPermission "doAs";
   permission javax.security.auth.AuthPermission "getSubject";
   permission java.util.PropertyPermission "*", "read";
   permission java.util.logging.LoggingPermission "control";

   permission java.lang.FilePermission "${com.sun.grid.jgdi.sgeRoot}/${com.sun.grid.jgdi.sgeCell}/common/jmx/-", "read";
   permission java.io.FilePermission "${com.sun.grid.jgdi.sgeRoot}/util/-", "execute";
   permission java.io.FilePermission "${com.sun.grid.jgdi.sgeRoot}/utilbin/-", "execute";
   permission javax.management.MBeanServerPermission "createMBeanServer";
   permission javax.management.MBeanPermission "*", "*";
   permission javax.management.MBeanTrustPermission "register";
   permission java.lang.management.ManagementPermission "monitor";
   permission java.lang.management.ManagementPermission "control";

   permission java.lang.RuntimePermission "setIO";
   permission java.io.FilePermission      "jgdi.stdout", "write";
   permission java.io.FilePermission      "jgdi.stderr", "write";
   permission java.io.FilePermission      "jgdi0.log.lck", "delete";
   permission java.io.FilePermission      "${com.sun.grid.jgdi.sgeRoot}/${com.sun.grid.jgdi.sgeCell}/common/jmx/*", "read";
   permission java.io.FilePermission      "${com.sun.grid.jgdi.sgeRoot}/lib/-", "read";
   permission java.lang.RuntimePermission "accessClassInPackage.sun.management.jmxremote";
   permission java.lang.RuntimePermission "accessClassInPackage.sun.management.resources";
   permission java.lang.RuntimePermission "accessClassInPackage.sun.management";
   permission java.lang.RuntimePermission "accessClassInPackage.sun.rmi.server";
   permission java.lang.RuntimePermission "accessClassInPackage.sun.management.snmp.util";
   permission java.lang.RuntimePermission "accessClassInPackage.sun.rmi.registry";

   permission java.util.PropertyPermission "java.rmi.server.randomIDs", "write";

   permission javax.security.auth.AuthPermission "modifyPrincipals";
   permission javax.security.auth.AuthPermission "createLoginContext.*";
   permission javax.security.auth.AuthPermission "createLoginContext.JMXPluggableAuthenticator";
   permission java.security.SecurityPermission "createAccessControlContext";

   permission javax.management.remote.SubjectDelegationPermission "javax.management.remote.JMXPrincipal.controlRole";
};

grant principal javax.management.remote.JMXPrincipal "controlRole" {
   permission javax.management.MBeanPermission "com.sun.grid.jgdi.management.mbeans.JGDIJMX#*", "*";
   permission javax.management.MBeanPermission "sun.management.*#*", "*";
   permission javax.security.auth.AuthPermission "createLoginContext.jgdi";
   permission javax.security.auth.AuthPermission "doAs";
   permission javax.security.auth.AuthPermission "getSubject";
   permission java.util.PropertyPermission "*", "read";
   permission java.util.PropertyPermission "user.timezone", "read,write";
   permission java.util.logging.LoggingPermission "control";
   permission java.io.FilePermission      "${com.sun.grid.jgdi.sgeRoot}/lib/-", "read";
   permission java.lang.management.ManagementPermission "monitor";
   permission java.net.SocketPermission "*", "resolve";

   permission javax.management.MBeanPermission "com.sun.management.UnixOperatingSystem#-[java.lang:type=OperatingSystem]", "isInstanceOf";
   permission javax.management.MBeanPermission "com.sun.management.UnixOperatingSystem#-[java.lang:type=OperatingSystem]", "getAttribute";
   permission javax.management.MBeanPermission "com.sun.management.UnixOperatingSystem#ProcessCpuTime[java.lang:type=OperatingSystem]", "getAttribute";
   permission javax.management.MBeanPermission "com.sun.management.UnixOperatingSystem#Name[java.lang:type=OperatingSystem]", "getAttribute";
   permission javax.management.MBeanPermission "com.sun.management.UnixOperatingSystem#Version[java.lang:type=OperatingSystem]", "getAttribute";
   permission javax.management.MBeanPermission "com.sun.management.UnixOperatingSystem#Arch[java.lang:type=OperatingSystem]", "getAttribute";
   permission javax.management.MBeanPermission "com.sun.management.UnixOperatingSystem#AvailableProcessors[java.lang:type=OperatingSystem]", "getAttribute";
   permission javax.management.MBeanPermission "com.sun.management.UnixOperatingSystem#CommittedVirtualMemorySize[java.lang:type=OperatingSystem]", "getAttribute";
   permission javax.management.MBeanPermission "com.sun.management.UnixOperatingSystem#TotalPhysicalMemorySize[java.lang:type=OperatingSystem]", "getAttribute";
   permission javax.management.MBeanPermission "com.sun.management.UnixOperatingSystem#FreePhysicalMemorySize[java.lang:type=OperatingSystem]", "getAttribute";   permission javax.management.MBeanPermission "com.sun.management.UnixOperatingSystem#TotalSwapSpaceSize[java.lang:type=OperatingSystem]", "getAttribute";
   permission javax.management.MBeanPermission "com.sun.management.UnixOperatingSystem#FreeSwapSpaceSize[java.lang:type=OperatingSystem]", "getAttribute";
   permission javax.management.MBeanPermission "javax.management.MBeanServerDelegate#-[JMImplementation:type=MBeanServerDelegate]", "addNotificationListener";
   permission javax.management.MBeanPermission "javax.management.MBeanServerDelegate#-[JMImplementation:type=MBeanServerDelegate]", "isInstanceOf";
   permission javax.management.MBeanPermission "javax.management.MBeanServerDelegate#-[JMImplementation:type=MBeanServerDelegate]", "getMBeanInfo";
   permission javax.management.MBeanPermission "com.sun.management.UnixOperatingSystem#-[java.lang:type=OperatingSystem]", "queryNames";
   permission javax.management.MBeanPermission "java.util.logging.Logging#-[java.util.logging:type=Logging]", "queryNames";
   permission javax.management.MBeanPermission "javax.management.MBeanServerDelegate#-[JMImplementation:type=MBeanServerDelegate]", "queryNames";
   permission javax.management.MBeanPermission "java.util.logging.Logging#-[java.util.logging:type=Logging]", "isInstanceOf";
   permission javax.management.MBeanPermission "java.util.logging.Logging#-[java.util.logging:type=Logging]", "getMBeanInfo";
   permission javax.management.MBeanPermission "com.sun.management.UnixOperatingSystem#-[java.lang:type=OperatingSystem]", "getMBeanInfo";

};

grant {
   permission java.util.logging.LoggingPermission "control";
   permission java.util.PropertyPermission "*", "read";
   permission java.util.PropertyPermission "user.timezone", "write";
   permission java.lang.RuntimePermission "setIO";
   permission java.lang.RuntimePermission "loadLibrary.jgdi";
   permission java.io.FilePermission      "jgdi.stdout", "write";
   permission java.io.FilePermission      "jgdi.stderr", "write";
   permission java.io.FilePermission      "${com.sun.grid.jgdi.sgeRoot}/lib/-", "read";
   permission java.io.FilePermission      "${com.sun.grid.jgdi.sgeRoot}/util/arch", "execute";
   permission java.io.FilePermission      "${com.sun.grid.jgdi.sgeRoot}/utilbin/-", "execute";
   permission javax.security.auth.AuthPermission "modifyPrincipals";
   permission java.io.FilePermission "${com.sun.grid.jgdi.caTop}", "read";
   permission java.io.FilePermission "${com.sun.grid.jgdi.caTop}/cacert.pem", "read";
   permission java.io.FilePermission "${com.sun.grid.jgdi.caTop}/ca-crl.pem", "read";
   permission java.io.FilePermission "${com.sun.grid.jgdi.caTop}/usercerts/-", "read";
   permission java.io.FilePermission "${com.sun.grid.jgdi.serverKeystore}", "read";
};

/*
grant {
   permission java.security.AllPermission;
};
*/

management.properties

This file describes the general JMX server configuration and the default template looks similar to this example and is usually adapted automatically during the installation process replacing the @@SGE_*@@ variables by concrete values.
The meaning of the @@SGE_*@@ variables is:

  • @@SGE_JMX_PORT@@ is the configured JMX port
  • @@SGE_JMX_SSL@@ is true or false if SSL shall be enabled for JMX or not
  • @@SGE_JMX_SSL_CLIENT@@ is true or false if client authentication is required
  • @@SGE_JMX_SSL_KEYSTORE@@ the keystore used for enabled SSL
  • @@SGE_JMX_SSL_KEYSTORE_PW@@ the corresponding keystore password
  • @@SGE_ROOT@@ the $SGE_ROOT root directory
  • @@SGE_CELL@@ the $SGE_CELL name usually 'default'
#####################################################################
#  Default Configuration File for JGDI JMX
#####################################################################
#
# The Management Configuration file (in java.util.Properties format)
# will be read if one of the following system properties is set:
#    -Dcom.sun.grid.jgdi.management.jmxremote.port=<port-number>
# or -Dcom.sun.grid.jgdi.management.config.file=<this-file>
#
# The default Management Configuration file is:
#
#       $SGE_ROOT/{$SGE_CELL|default}/common/jmx/management.properties
#
# ################ Management Agent Port #########################
#
# For setting the JMX RMI agent port use the following line
# com.sun.grid.jgdi.management.jmxremote.port=<port-number>
com.sun.grid.jgdi.management.jmxremote.port=@@SGE_JMX_PORT@@

#####################################################################
#        RMI Management Properties
#####################################################################
#
# If system property -Dcom.sun.grid.jgdi.management.jmxremote.port=<port-number>
# is set then
#     - A MBean server is started
#     - JRE Platform MBeans are registered in the MBean server
#     - RMI connector is published  in a private readonly registry at
#       specified port using a well known name, "jmxrmi"
#     - the following properties are read for JMX remote management.
#
# The configuration can be specified only at startup time.
# Later changes to above system property (e.g. via setProperty method),
# this config file, the password file, or the access file have no effect to the
# running MBean server, the connector, or the registry.
#

#
# ###################### RMI SSL #############################
#
# com.sun.grid.jgdi.management.jmxremote.ssl=true|false
#      Default for this property is true. (Case for true/false ignored)
#      If this property is specified as false then SSL is not used.
#

#For RMI monitoring without SSL use the following line
# com.sun.grid.jgdi.management.jmxremote.ssl=false
com.sun.grid.jgdi.management.jmxremote.ssl=@@SGE_JMX_SSL@@

# com.sun.grid.jgdi.management.jmxremote.ssl.enabled.cipher.suites=<cipher-suites>
#      The value of this property is a string that is a comma-separated list
#      of SSL/TLS cipher suites to enable. This property can be specified in
#      conjunction with the previous property "com.sun.management.jmxremote.ssl"
#      in order to control which particular SSL/TLS cipher suites are enabled
#      for use by accepted connections. If this property is not specified then
#      the SSL RMI Server Socket Factory uses the SSL/TLS cipher suites that
#      are enabled by default.
#

# com.sun.grid.jgdi.management.jmxremote.ssl.enabled.protocols=<protocol-versions>
#      The value of this property is a string that is a comma-separated list
#      of SSL/TLS protocol versions to enable. This property can be specified in
#      conjunction with the previous property "com.sun.management.jmxremote.ssl"
#      in order to control which particular SSL/TLS protocol versions are
#      enabled for use by accepted connections. If this property is not
#      specified then the SSL RMI Server Socket Factory uses the SSL/TLS
#      protocol versions that are enabled by default.
#

# com.sun.grid.jgdi.management.jmxremote.ssl.need.client.auth=true|false
#      Default for this property is false. (Case for true/false ignored)
#      If this property is specified as true in conjunction with the previous
#      property "com.sun.management.jmxremote.ssl" then the SSL RMI Server
#      Socket Factory will require client authentication.
#

#For RMI monitoring with SSL client authentication use the following line
#com.sun.grid.jgdi.management.jmxremote.ssl.need.client.auth=true
com.sun.grid.jgdi.management.jmxremote.ssl.need.client.auth=@@SGE_JMX_SSL_CLIENT@@


#
# ################ RMI User authentication ################
#
# com.sun.grid.jgdi.management.jmxremote.authenticate=true|false
#      Default for this property is true. (Case for true/false ignored)
#      If this property is specified as false then no authentication is
#      performed and all users are allowed all access.
#

# For RMI monitoring without any checking use the following line
# com.sun.grid.jgdi.management.jmxremote.authenticate=false
com.sun.grid.jgdi.management.jmxremote.authenticate=true

#
# ################ RMI Login configuration ###################
#
# com.sun.grid.jgdi.management.jmxremote.login.config=<config-name>
#      Specifies the name of a JAAS login configuration entry to use when
#      authenticating users of RMI monitoring.
#
#      Setting this property is optional - the default login configuration
#      specifies a file-based authentication that uses the password file.
#
#      When using this property to override the default login configuration
#      then the named configuration entry must be in a file that gets loaded
#      by JAAS. In addition, the login module(s) specified in the configuration
#      should use the name and/or password callbacks to acquire the user's
#      credentials. See the NameCallback and PasswordCallback classes in the
#      javax.security.auth.callback package for more details.
#
#      If the property "com.sun.management.jmxremote.authenticate" is set to
#      false, then this property and the password & access files are ignored.
#

# For a non-default login configuration use the following line
# com.sun.grid.jgdi.management.jmxremote.login.config=<config-name>
com.sun.grid.jgdi.management.jmxremote.login.config=GridwareConfig

#
# ################ RMI Password file location ##################
#
# com.sun.grid.jgdi.management.jmxremote.password.file=filepath
#      Specifies location for password file
#      This is optional - default location is
#      $JRE/lib/management/jmxremote.password
#
#      If the property "com.sun.grid.jgdi.management.jmxremote.authenticate" is set to
#      false, then this property and the password & access files are ignored.

# For a non-default password file location use the following line
# com.sun.grid.jgdi.management.jmxremote.password.file=filepath
com.sun.grid.jgdi.management.jmxremote.password.file=@@SGE_ROOT@@/@@SGE_CELL@@/common/jmx/jmxremote.password

#
# ################ RMI Access file location #####################
#
# com.sun.grid.jgdi.management.jmxremote.access.file=filepath
#      Specifies location for access  file
#      This is optional - default location is
#      $JRE/lib/management/jmxremote.access
#
#      If the property "com.sun.management.jmxremote.authenticate" is set to
#      false, then this property and the password & access files are ignored.
#      Otherwise, the access file must exist and be in the valid format.
#      If the access file is empty or non-existent then no access is allowed.
#

# For a non-default access file location use the following line
# com.sun.grid.jgdi.management.jmxremote.access.file=filepath
com.sun.grid.jgdi.management.jmxremote.access.file=@@SGE_ROOT@@/@@SGE_CELL@@/common/jmx/jmxremote.access


# For the JGDI keystore module use this settings for the server keystore and keystore password
com.sun.grid.jgdi.management.jmxremote.ssl.serverKeystore=@@SGE_JMX_SSL_KEYSTORE@@
com.sun.grid.jgdi.management.jmxremote.ssl.serverKeystorePassword=@@SGE_JMX_SSL_KEYSTORE_PW@@

jmx.access

The jmx access file defines which principals are mapped to a special role.

######################################################################
#     Default Access Control File for Remote JMX(TM) Monitoring
######################################################################
#
# Access control file for Remote JMX API access to monitoring.
# This file defines the allowed access for different roles.  The
# password file (jmxremote.password by default) defines the roles and their
# passwords.  To be functional, a role must have an entry in
# both the password and the access files.
#
# Default location of this file is $JRE/lib/management/jmxremote.access
# You can specify an alternate location by specifying a property in
# the management config file $JRE/lib/management/management.properties
# (See that file for details)
#
# The file format for password and access files is syntactically the same
# as the Properties file format.  The syntax is described in the Javadoc
# for java.util.Properties.load.
# Typical access file has multiple  lines, where each line is blank,
# a comment (like this one), or an access control entry.
#
# An access control entry consists of a role name, and an
# associated access level.  The role name is any string that does not
# itself contain spaces or tabs.  It corresponds to an entry in the
# password file (jmxremote.password).  The access level is one of the
# following:
#       "readonly" grants access to read attributes of MBeans.
#                   For monitoring, this means that a remote client in this
#                   role can read measurements but cannot perform any action
#                   that changes the environment of the running program.
#       "readwrite" grants access to read and write attributes of MBeans,
#                   to invoke operations on them, and to create or remove them.
#         This access should be granted to only trusted clients,
#                   since they can potentially interfere with the smooth
#         operation of a running program
#
# A given role should have at most one entry in this file.  If a role
# has no entry, it has no access.
# If multiple entries are found for the same role name, then the last
# access entry is used.
#
#
# Default access control entries:
# o The "monitorRole" role has readonly access.
# o The "controlRole" role has readwrite access.

monitorRole   readonly
controlRole   readwrite

jmx.password

This is also a possible simple authentication mechanism though not recommended. Usually the jaas login module is preferred since it is much more flexible. You can specify a password for the different roles there. If a simple login mechanism is required it is recommended to change management.properties to use TestConfig instead of GridwareConfig, which allows any valid Unix user to connect to JGDI JMX server without a password.

logging.properties

To enable JGDI and JMX logging the delivered logging file has to be adjusted and sge_qmaster or at least the JMX server has to be restarted. The generated logging files default to jgdi0.log, jgdi.stderr and jgdi.stdout in the master spooling directory. The logging can also be influenced by changing the additional_jvm_args configuration to enable additional debugging messages for example.

#
#  Java Logging Configuration for JMX MBean server
#

# Specify the handlers to create in the root logger
# (all loggers are children of the root logger)
# The following creates two handlers

# Per default we log to the console
#handlers = java.util.logging.ConsoleHandler

# Use FileHandler
handlers = java.util.logging.FileHandler

# ------------------------------------------------------------------------------
#  Definition of log levels
# ------------------------------------------------------------------------------
# Set the default logging level for the root logger
.level = INFO
#com.sun.grid.jgdi.JGDI.level = FINE
#com.sun.grid.jgdi.rmi.level = FINE
#com.sun.grid.jgdi.configuration.xml.XMLUtil.level = FINE
#com.sun.grid.jgdi.configuration.ClusterQueueTestCase.level = FINE
#com.sun.grid.jgdi.management.level = FINER
#com.sun.grid.jgdi.event.level = FINER
# For authuser login module debugging
#com.sun.grid.security.login.level = FINER
#com.sun.grid.util.expect.level = FINER

# ------------------------------------------------------------------------------
#  Settings for ConsoleHandler
# ------------------------------------------------------------------------------
# Set the default logging level for new ConsoleHandler instances
java.util.logging.ConsoleHandler.level = INFO

# Set the default formatter for new ConsoleHandler instances
java.util.logging.ConsoleHandler.formatter = com.sun.grid.jgdi.util.SGEFormatter

# ------------------------------------------------------------------------------
#  Settings for FileHandler
# ------------------------------------------------------------------------------
# Set the default logging level for new FileHandler instances
java.util.logging.FileHandler.level = ALL
# qmaster runs in qmaster spool dir, so the file is created there
java.util.logging.FileHandler.pattern=jgdi%u.log
java.util.logging.FileHandler.formatter=com.sun.grid.jgdi.util.SGEFormatter

#
# Possible columns:
#
#   time    timestamp of the log message
#   host    hostname of the log message
#   name    name of the logger
#   thread  id of the thread
#   level   log level (short form)
#   source  class and method name
#   level_long log_level long form
#
com.sun.grid.jgdi.util.SGEFormatter.columns = time thread source level message

#
#  Print the stacktrace of the log record
#
com.sun.grid.jgdi.util.SGEFormatter.withStacktrace=true

#
#  Delimiter between columns
#
com.sun.grid.jgdi.util.SGEFormatter.delimiter = |

Testing and Troubleshooting

To connect to the JMX server jconsole can be used for testing. It is the responsibility of the administrator to allow/disallow access to the system via JMX. To force also client authentication of jconsole the management.properties file must be configured with:

  • com.sun.grid.jgdi.management.jmxremote.ssl=true
  • com.sun.grid.jgdi.management.jmxremote.ssl.need.client.auth=true
% jconsole -J-Djava.security.manager=java.rmi.RMISecurityManager \
 -J-Djava.security.policy=$SGE_ROOT/util/rmiconsole.policy \
 -J-Djavax.net.ssl.trustStore=<server truststore> \
 [-J-Djavax.net.ssl.keyStore=/<safe>/mykeystore \
  -J-Djavax.net.ssl.keyStorePassword=<mykeystore_pw> \
  -J-Djavax.net.ssl.keyPassword=<mykeystore_pw> ] \
 [-J-Djavax.net.debug=ssl]

where <server truststore> usually is either:
/var/sgeCA/port5322/$SGE_CELL/private/keystore
(only the server certificate is accessible without password)
or a special truststore is made available by the administrator:

keytool -export -alias "root" \ 
        -keystore /var/sgeCA/port$SGE_QMASTER_PORT/$SGE_CELL/private/keystore -rfc -file /tmp/jmxserver.cer

keytool -import -file /tmp/jmxserver.cer -keystore /tmp/truststore
Enter keystore password:  <pwd>
...
Trust this certificate? [no]:  yes
Certificate was added to keystore

The optional arguments are required if client authentication is set to true or for debugging.

The following simple example can be used to connect via JMX and monitor events

% java [-Dcom.sun.grid.jgdi.keyStore=\          /var/sgeCA/port$SGE_QMASTER_PORT/$SGE_CELL/private/keystore \
-Dcom.sun.grid.jgdi.caTop="$SGE_ROOT/$SGE_CELL/common/sgeCA" \
-Djava.util.logging.config.file=util/shell_logging.properties ] \
-cp $SGE_ROOT/lib/juti.jar:$SGE_ROOT/lib/jgdi.jar \
com.sun.grid.jgdi.examples.jmxeventmonitor.Main

The optional arguments can be skipped and serve only to preset the login dialog with useful values. If a connection has been established once a preferences file is written, that is reused afterwards.
To have the correct environment variables set the SGE settings.(c)sh file has to be sourced. To get access to the keystore the command must be run by the admin user in the example above.

For troubleshooting the following settings and files might give some additional insights:

  • Messages file in the master spool directory if the JMX server can't be started
  • $SGE_ROOT/$SGE_CELL/common/bootstrap to check if jvm_threads is enabled at all
  • jgdi* log files in the master spool directory are the main source for finding out the reason for failure analysis
  • $SGE_ROOT/$SGE_CELL/common/jmx/logging.properties to enable more detailed logging
  • qconf -mconf with an additional_jvm_args parameter
    For example, add these two arguments -Djava.security.debug=all -Djavax.net.debug=ssl to trace any permission and authentication problems.

Removing the Software

This section consists of the following topics.

Topic Description
How to Remove the Software Interactively Procedure for removing the Sun Grid Engine software interactively.
How to Remove the Software Using the inst_sge Utility and a Configuration Template Procedure for removing the Sun Grid Engine software using the inst_sge utility and a configuration template.

How to Remove the Software Interactively

To remove the software interactively, follow the steps below.

Note
Remove the software from the execution hosts before removing it from the master host. If you remove the software from the master host first, you cannot automate the removal of the software from the execution hosts.
  1. Ensure that your environment variables are set up properly.
    Note
    If no cell name was specified during installation, the value of $SGE_CELL is default.
    • If you are using a C shell, type the following command:
      # source $SGE_ROOT/$SGE_CELL/common/settings.csh
      
    • If you are using a Bourne or Korn shell, type the following command:
      # . $SGE_ROOT/$SGE_CELL/common/settings.sh
      


  2. On the master host, issue the $SGE_ROOT/inst_sge -ux command.
    This example uninstalls the execution hosts: host1, host2 and host3.
    # $SGE_ROOT/inst_sge -ux -host "host1 host2 host3"
    
    Note
    You are not prompted for any information during this process. However, the output from this process will be displayed to the terminal window where you run the command.


  3. (Optional) If you have any shadow master hosts, uninstall them:
    # $SGE_ROOT/inst_sge -usm -host "host4"
    


  4. Uninstall the master host.
    # $SGE_ROOT/inst_sge -um
    

How to Remove the Software Using the inst_sge Utility and a Configuration Template

Unlike the interactive uninstallation method, the automated uninstallation method suppresses output during the process. Also, the automated method requires a properly formatted configuration file.

To remove the software using the inst_sge utility and a configuration template, follow these steps:

Note
Remove the software from the execution hosts before removing it from the master host. If you remove the software from the master host first, you cannot automate the removal of the software from the execution hosts.
  1. Ensure that your environment variables are set up properly.
    Note
    If no cell name was specified during installation, the value of $SGE_CELL is default.
    • If you are using a C shell, type the following command:
      # source $SGE_ROOT/$SGE_CELL/common/settings.csh
      
    • If you are using a Bourne or Korn shell, type the following command:
      # . $SGE_ROOT/$SGE_CELL/common/settings.sh
      


  2. Create a copy of the configuration template, $SGE_ROOT/util/inst_sge_modules/inst_sge_template.conf.
    # cd $SGE_ROOT/util/inst_sge_modules/
    # cp inst_sge_template.conf  my_configuration.conf
    


  3. Edit your configuration template.
    Every host that is in the EXEC_HOST_LIST_RM list will be removed.
    # Remove these execution hosts in automatic mode
    EXEC_HOST_LIST_RM="host1 host2 host3 host4"
    


  4. On the master host type the $SGE_ROOT/inst_sge -ux -auto command.
    This example uninstalls the execution hosts: host1, host2 and host3.
    Type the following command as one string, with a space between the -auto and the $SGE_ROOT/util/inst_sge_modules/my_configuration.conf components.
    # $SGE_ROOT/inst_sge -ux -auto $SGE_ROOT/util/inst_sge_modules/my_configuration.conf
    
    Note
    You are not prompted for any information during this process. However, the output from this process will be displayed to the terminal window where you run the command.


  5. (Optional) If you have any shadow master hosts, uninstall them.
    Type the following command as one string, with a space between the -auto and the $SGE_ROOT/util/inst_sge_modules/my_configuration.conf components.
    # $SGE_ROOT/inst_sge -usm -auto $SGE_ROOT/util/inst_sge_modules/my_configuration.conf
    


  6. Uninstall the master host.
    Type the following command as one string, with a space between the -auto and the $SGE_ROOT/util/inst_sge_modules/my_configuration.conf components.
    # $SGE_ROOT/inst_sge -um -auto $SGE_ROOT/util/inst_sge_modules/my_configuration.conf
    

Additional Software for the Microsoft Operating System

Microsoft Windows Services for UNIX (SFU) and Microsoft Subsystem for UNIX-based Applications (SUA) make it possible to integrate some Windows operating systems into existing UNIX environments. SFU and SUA provide components that simplify network administration and user management across the UNIX and Windows platforms.

Additional Software

The following sections describe the Microsoft Windows Services for UNIX (SFU) and Microsoft Subsystem for UNIX-based Applications (SUA) in detail.

Topic Description
Microsoft Services for UNIX Learn how Microsoft Windows Services for UNIX (SFU) makes it possible to integrate some Windows operating systems into existing UNIX environments.
Microsoft Subsystem for UNIX-based Applications Learn how Microsoft Subsystem for UNIX-based Applications (SUA) makes it possible to integrate some Windows operating systems into existing UNIX environments.
Changing Default Behavior to Case Sensitivity Choose between default behavior and case sensitivity for object names.
Disabling DEP Learn how to disable DEP for different Windows platforms.
Enabling suid Behavior for Interix Programs Learn how to enable suid Behavior for Interix Programs.

Microsoft Services for UNIX

Microsoft Windows Services for UNIX (SFU) makes it possible to integrate some Windows operating systems into existing UNIX environments. SFU provides components that simplify network administration and user management across the UNIX and Windows platforms. You can use SFU to do the following:

  • Integrate Windows hosts into Grid Engine clusters. This means that the execution and client environment of Grid Engine can be used on Microsoft Windows hosts. You must use Grid Engine in combination with SFU for this to occur.
  • Access the network file system (NFS). This makes it possible for you to share files between the UNIX and Windows environments.
  • Possibly access account and password services on UNIX and Windows systems (PCNFS, NIS) using the user mapping service.
  • Synchronize passwords and map authentication credentials between the UNIX and Windows operating systems. You can use the "single sign-on" capability for Windows and UNIX environments.
  • Execute UNIX shell scripts and applications to run on Windows platform-based computers in full-featured UNIX environments.

Interix, SFU's UNIX environment subsystem, offers the following features:

  • A complete, high-performance UNIX environment. You can use the csh shell or the ksh shell.
  • Several hundred tools and utilities.
  • A complete set of development tools and libraries that make it possible to port your UNIX-based applications to the Interix sub-system.

SFU is an essential prerequisite to install Grid Engine on Microsoft Windows Server 2003, Windows XP Professional with at least Service Pack 1, Windows 2000 Server with at least Service Pack 3, or Windows 2000 Professional with at least Service Pack 3.
For Microsoft Windows Server 2003 Release 2, Windows Server 2008, Windows Vista Enterprise, Windows Vista Ultimate, please see Microsoft Subsystem for UNIX-based Applications.

Unsupported Grid Engine Functionality

The following Grid Engine components are not supported in a Microsoft Windows environment and cannot be used on Windows Hosts even though they are standard to a Grid Engine installation:

  • Master and Scheduler (sge_qmaster and sge_shadowd)
  • Graphical User Interface (qmon)
  • DRMAA
  • qsh client command
Topic Description
How to Install Services for Unix Learn how to install Microsoft Services for Unix.
Troubleshooting SFU Learn how to troubleshoot Microsoft Services for Unix.
Configuring User Name Mapping Learn how to configure user name mapping.

How to Install Microsoft Services for Unix

System Requirements

The following system requirements apply to the SFU installation:

  • You must install at least Version 5.0 of Internet Explorer, before running the SFU setup.
  • You cannot install SFU on a system running Microsoft Services for Network File System. For example, Microsoft Services for NFS is a component of Windows Storage Server 2003.
  • You must install the latest Windows service pack before installing SFU and Grid Engine. Then, you can install additional Windows service packs as they become available.
  • The hard disk requirements for an SFU installation depend on which components you need to install. The following installation parameters apply:
    • The minimum disk space required is 20 MB.
    • The maximum disk space requirement is 360 MB.
    • SFU must be installed on a partition that is formatted with the NTFS file system.
  • You must disable Data Execution Prevention (DEP). DEP is not compatible with some parts of SFU and might cause segmentation faults. See http://support.microsoft.com/kb/875352 for more information about DEP. To disable DEP, see Disabling DEP.

You can find more details concerning SFU requirements at http://www.microsoft.com/windows/sfu/.

Services for UNIX Installation

Microsoft's SFU is required to install Grid Engine successfully. You can download SFU from Microsoft. Search the site for "Windows Services for Unix" to find the current download information.

  1. Get the SFU distribution media.

  2. Execute the application to unzip the files into a directory.
    This directory must be located on a file system that has at least 480 MBytes free space.

  3. Log in to the Windows system with the Administrator account.

  4. Start the setup.exe application that you unpacked previously.
    Figure showing Windows Services for UNIX the welcome window for the SFU setup wizard.

  5. Enter your User name and Organization.
    Figure showing Windows dialog box asking for your name and organization.

  6. Accept the license agreement for SFU.
    Figure showing Windows dialog box asking you to accept the End-User License Agreement.

  7. Choose the standard installation (recommended) or the custom installation.
    Figure showing Windows dialog box that offers the choice between standard and custom installation.
    If disk space is limited, you might want to choose the custom installation. Make sure that you install at least the following components:
    • Utilities -> Base Utilities
    • Interix GNU components -> Interix GNU utilities
    • Remote connectivity components -> Telnet Server and Windows Remote Shell
    • If you intend to use NFS shared file systems, you also need Authentication tools for NFS -> User Mapping and Server for NFS Authentication.
  8. Depending on the Windows operating system, you might be presented with the following two options concerning SFU security settings, shown in the dialog box below:
    "Figure showing Windows dialog box
    If you need further information, consult Microsoft's SFU documentation.

  9. Configure User Name Mapping.
    Note
    User Name Mapping is part of SFU and not part of Sun Grid Engine. Consult Microsoft documentation and support to set up user mapping correctly.

    Your selection in the dialog box, shown below, depends on the hosts and services that are currently provided in your Windows and UNIX environments. If there is no Remote User Mapping server in your environment, then you should select Local User Name Mapping Server.
    Figure explained below.

    Note
    You should install SFU and enable the User Name Mapping service on your host that acts as a Domain Controller for your windows environment. All other hosts should contact that Remote User Name Mapping Server. If you choose Local User Name Mapping Server, then you might either select Network Information Services (NIS) to access your passwd and group NIS-maps. Otherwise, select l if you can provide the files yourself.

    See Configuring User Name Mapping for further details.

  10. Depending on your previous selections, you can either enter the NIS Domain name and NIS Server name or the path of the passwd and group files.
    Figure showing Windows dialog box that asks you to configure local user name mapping.
    Below is an example of the files that have the standard UNIX format. This means that you can also use your /etc/passwd and /etc/group files from your UNIX environment.
    C:\Unix\etc\passwd 
    root:x:0:0:UNIX root user:/home/root:/bin/tcsh
    user1:x:1002:100:Full name of user1:/home/user1:/bin/tcsh
    C:\Unix\etc\group
    root::0:
    
    Note
    Some NIS maps do not contain an entry for the root user. If this is the case, follow these steps to map Administrator to root:
    1. First create a password file containing the root entry.
    2. If the SFU installation is finished, start the Services for UNIX Administration application and create the mapping: Administrator <-> root.
    3. Switch to NIS mapping.
    4. Use simple mapping or add manual mappings.
      At this point the installation starts installing components. Wait until all components are installed.
  11. When the installation process finishes, you might need to reboot the machine, depending on the version of Windows that you are using.

  12. Make sure that the Interix Subsystem Startup starts during boot time.
    If you intend to use NFS shares and user mapping, then also start Client for NFS and User Name Mapping.
    Depending on the installation options and your version of the Windows operating system, one or more of these services are disabled by default.

Post SFU Installation Tasks

There are several steps you should follow after you install the SFU software.

  1. Before you start using SFU and install Grid Engine, check that the user mapping is working correctly by following these steps:
    1. Open an Interix shell locally on the Interix host.
    2. Use the login command to switch to a known user that is not the Administrator.
    3. Verify the access permissions for NFS shares that should be accessible to that user.
    4. Try to access these network resources. If a user cannot access a Network drive, most likely the User Name Mapping is not working correctly.

  2. Check users' home directories.
    To enable the automounting of the users' home directories, use the following series of menus:
    Control Panel -> Administrative Tools -> Computer Management -> Users -> Properties -> Profile
    

    Click connect to, select a drive letter, and enter the path of the user's home directory in UNC notation: \\<server>\<share>\<user home>.

    Within the Interix subsystem, you might access all NFS shares through the special directory: /net/server/share.

    You might also create links to these directories to access the shares directly, for example, ln -s /net/myserver/export/share00/home/home.

  3. Enable Administrator names on your machines.
    Make sure that the administrator accounts on all machines that are enabled as execution hosts for Grid Engine use the same account name, such as Administrator.

    Also make sure that this user has manager privileges in your Sun Grid Engine cluster. If this is not the case, add the privileges using qconf -am administrator before the installation of the execution daemon.

  4. Set the CLI commands.
    This starts an editor. Make sure to set the EDITOR environment variable to vi, or your preferred UNIX editor, within the Interix subsystem before you start using UNIX commands.

  5. Mount NFS shares.
    There are two ways to mount NFS shares to the Interix host:
    • The recommended way is to use the auto mount functionality of Interix. All network shares that the Computer Browser service of Windows can find are automatically mounted to the /net directory of Interix. Although only some of these shares might be listed with ls /net, all shares are accessible. The syntax of the auto mount is /net/server/share, such as /net/myserver/home. A link to an auto-mounted share can be created to make it accessible under exactly the same name as on a UNIX host. For example, ln -s /net/myserver/home /home makes the users' UNIX home directories accessible through /home/username. Automounted shares are available starting at boot time. They are available for all users who have the permissions to access these shares. The shares cannot get lost by misconfiguration.
    • Network shares can also be mapped to drive letters by using the command nfsmount. The syntax is /usr/sbin/nfsmount -u: \\<computername>\<sharename> <devicename>. For example:
      /usr/sbin/nfsmount -u: \\\\myserver\\home Z:
      

      This drive is now accessible through /def/fs/Z. A link can be created to this drive to have the same path as on a UNIX host.

      Note
      As shown in the example above, all backslashes must be written twice because the shell interprets a single backslash as an escape character.



  6. Configure the users' home directories.
    If the users' home directories are located on an NFS server, follow these steps to configure the users' home directories in Windows:
    1. In the Profile tab of the users properties dialog box, select Connect.
    2. Select a free drive letter.
    3. Enter the path to the user's home directory in the UNC notation \\<server>\<share>\<directory>, for example, \\myserver\home\Peter.

Troubleshooting SFU

The following section describes some common problems that users may encounter when installing and using Grid Engine in a Services for UNIX environment on a Windows system.

  • Impossible to connect to the Interix subsystem through telnet or rsh.
    Make sure that the correct services are started. The corresponding Windows services must be disabled. The Interix versions of telnetd and rshd must be started. You can do this task by removing the pound sign (#) from the following lines in /etc/inetd.conf:
    #telnet stream tcp nowait NULL /usr/sbin/in.telnetd in.telnetd -i
    #shell stream tcp nowait NULL /usr/sbin/in.rshd in.rshd -a
    

    If you still cannot connect to the machine, check your firewall configuration. Do not block connections to corresponding ports:

    Service  |  Ports
    ---------+-----------
    ftp      |  20, 21
    ssh      |  22
    telnet   |  23
    rsh      |  514
    
  • The wrong default login shell is started. Why?
    Both the .rhost and host.equiv authentications fail if new user accounts are created and if the passwords of existing users are changed. In this case, the command regpwd needs to be called. After that, follow the steps to register passwords correctly.
  • Why is the access to NFS mounted home directories slow?
    User Name Mapping might be the cause. For a large number of user maps, installing User Name Mapping on a Domain Controller improves performance by reducing network traffic. You can create a User Name Mapping server pool. This method means that you use DNS round-robin to create a pool of computers running User Name Mapping. This provides improved performance on wide area networks and provides failover when one of the servers is no longer available.
  • How can I map user root if it does not exist in the NIS maps?
    First create a passwd file which contains an entry for the user root. Then, explicitly map the root account (no basic mapping) using the created passwd file. Finally, change the mapping to use the NIS maps. Note that the previous root mapping will persist.
  • NIS Server cannot be contacted during the SFU installation.
    Interrupt the SFU installation and make sure that there is no other service or application running which already configures or uses the NIS server. If this is the case, then disable this service for the duration of the SFU installation.
  • The Interix Subsystem of SFU or the User Mapping is not enabled after reboot.
    Make sure that Interix Subsystem Startup and User Name Mapping are automatically started after machine reboot. Also if you use NFS mounted directories, enable the service by default: Client for NFS.
  • Queues stick in unknown state for a very long time.
    After the installation or restart of an execution host, the corresponding queues have attached the unknown (u) state for a very long time. This is normal behavior for Windows machines. After a full load report interval, the u state should be gone. If this is not the case, then check that the sge_execd has been started on the corresponding machine.

Configuring User Name Mapping

User Name Mapping acts as a single clearinghouse that provides centralized user mapping services for the NFS client of Interix. User Name Mapping provides a map between the Windows users and groups on the NFS client, and the corresponding UNIX users and groups on the NFS server. In principle, these user and group names might not be identical. However, for users who intend to use Sun Grid Engine, these names must be identical.

User Name Mapping lets you maintain a single mapping database for the entire enterprise. This feature makes it easy to configure authentication for multiple computers running Windows Services for UNIX.

User Name Mapping also permits one-to-many mapping. This lets you associate multiple Windows accounts with a single UNIX account. To do this, you can use simple maps, which map Windows and UNIX accounts with identical names. You can also create advanced maps to associate Windows and UNIX accounts with different names, which you can use with simple maps. This feature can be useful, for example, when you do not need to maintain separate UNIX accounts for individuals and would rather use a few accounts to provide different classes of access permission.

Note
For information about simple and advanced maps, see "Simple and Advanced Maps" in Help for Services for UNIX. After the installation has finished, you can find Help for Services for UNIX in Start -> Programs -> Services for UNIX -> Help for Services for UNIX.

Microsoft Subsystem for UNIX-based Applications

Microsoft Subsystem for UNIX-based Applications allows you to integrate Windows operating systems with the existing UNIX environments. This subsystem provides components that simplify network administration and user management across UNIX and Windows platforms. You can use this subsystem to perform the following:

  • Integrate Windows hosts with Sun Grid Engine clusters - Enables you to use the execution and client environment of Sun Grid Engine on Microsoft Windows hosts. You must use Sun Grid Engine in combination with Microsoft Subsystem for UNIX-based Applications for this to happen.
  • Access the network file system (NFS) - Enables you to share files between the UNIX and Windows environments.
  • Synchronize passwords and map authentication credentials between the UNIX and Windows operating systems - Enables you to use the 'single sign-on' capability for Windows and UNIX environments.
  • Execute UNIX shell scripts and applications - Enables you to run shell scripts and applications on Windows platform-based computers in full-featured UNIX environments.

Microsoft Subsystem for UNIX-based Application's UNIX environment subsystem, Interix, offers the following features:

  • A complete, high-performance UNIX environment. You can use the csh or ksh shell.
  • Several hundred tools and utilities.
  • A complete set of development tools and libraries that make it possible to port your UNIX-based applications to the Interix sub-system.

Microsoft Subsystem for UNIX-based Applications is an essential prerequisite to install Sun Grid Engine on Microsoft Windows Server 2003 Release 2, Windows Server 2008, Windows Vista Enterprise, and Windows Vista Ultimate. For Microsoft Windows Server 2003, Windows XP Professional with at least Service Pack 1, Windows 2000 Server with at least Service Pack 3, or Windows 2000 Professional with at least Service Pack 3, see Microsoft Services for UNIX.

Unsupported Sun Grid Engine Functionality

The following Grid Engine components are not supported in the Microsoft Windows environment and cannot be used on Windows hosts even though they are standard to a Sun Grid Engine installation:

  • Master and Scheduler (sge_qmaster and sge_shadowd)
  • Graphical user interface (qmon)
  • DRMAA
  • qsh client command
Topic Description
How to Install a Subsystem for UNIX-based Applications Learn how to install a subsystem for UNIX-based applications.
Troubleshooting Microsoft Subsystem for UNIX-based Applications Learn how to troubleshoot a subsystem for UNIX-based applications.

How to Install a Microsoft Subsystem for UNIX-based Applications

This section describes how to install a Microsoft Subsystem for UNIX-based Applications.

System Requirements

The system requirements for a Subsystem for UNIX-based Applications installation are:

  • Microsoft Windows Server 2003 Release 2, Windows Server 2008, Windows Vista Enterprise, or Windows Vista Ultimate. Windows Vista Business and all lower Vista versions are not supported by this subsystem.
  • You must install the latest Windows service pack before installing Subsystem for UNIX-based Applications and Sun Grid Engine. You can install additional Windows service packs as they become available.
  • The hard disk requirement for a Subsystem for UNIX-based Applications installation depends on the components that you are planning to install. The following installation parameters apply.
    • The minimum disk space required is 182 MBytes.
    • The maximum disk space required is approximately 350 MBytes.
    • Subsystem for UNIX-based Applications must be installed on a partition that is formatted with the NTFS file system.
  • You must disable the Data Execution Prevention (DEP) feature. The DEP feature is not compatible with some parts of Subsystem for UNIX-based Applications and might cause segmentation faults. For more information about DEP, see http://support.microsoft.com/kb/875352. For information on how to disable DEP, see Disabling DEP.

You can find additional information about Subsystem for UNIX-based Applications requirements at http://technet.microsoft.com/en-us/library/cc779522.aspx.

Installing Subsystem for UNIX-based Applications

Microsoft Subsystem for UNIX-based Applications is required for installing Sun Grid Engine on Windows Vista, Windows Server 2008, and Windows 2003 R2. Subsystem for UNIX-based Applications is partially delivered with these versions of Windows, but you also need to download some components from the Microsoft web site.

Steps

  1. Install the components of Subsystem for UNIX-based Applications that are delivered with Windows.
    Note

    In this procedure Windows Vista is used as an example. Other supported Windows Versions function similarly. You must have the right administrative privileges to perform the installation.

    1. Click Start.


    2. Click Control Panel.


    3. Click Programs.


    4. Click the Turn Windows features on or off option from the Programs and Features panel.
      The Windows Features screen appears.


    5. Select the Subsystem for UNIX-based Applications option.
    6. You can also open the Services for NFS tree and select the appropriate option, if you prefer to use NFS shares.
      Note

      Ensure that you use SAMBA for networking shares, as you might have trouble setting up an environment that functions correctly with both Subsystem for UNIX-based Applications NFS and Subsystem for Unix NFS clients.


    7. Click OK.
      Windows installs the new features and might prompt you to insert the Windows installation DVD.

  2. Download and install the remaining components of Subsystem for UNIX-based Applications.

    1. Click Start > All Programs.
      You will notice a new folder named Subsystem for UNIX-based Applications in the Windows Start menu. This folder contains the link to the web page where the remaining components of Subsystem for UNIX-based Applications can be downloaded.


    2. Download the remaining components of Subsystem for UNIX-based Applications and double-click to open the file.
      The file will open in a WinZip Self-Extractor dialog box.
    3. Click Unzip.
      The utility unzips the files.
    4. Click OK.
      The Installer is started and the Subsystem for UNIX-based Applications setup wizard appears.


    5. Click Next.
      The Customer Information screen appears.


    6. Enter user name and organization name and click Next.
      The License and Support Information screen appears.


    7. Accept the terms of the license and click Next.
      The Installation Options screen appears.


    8. Select the Custom Installation option.
      The Selecting Components screen appears.


    9. Use the preset selections and select GNU Utilities. Ensure that you also select GNU SDK.
      The Security Settings screen appears.


    10. Depending on the Windows operating system that you are using, you might be presented with the above Subsystem for UNIX-based Applications security settings. Click Next.
      The Summary screen appears.


    11. Select the required disk volume and click Install.
      The installation wizard appears.


    12. Click Finish to exit the installation wizard.
    13. Reboot the host.
      After rebooting, you will notice a C Shell, a Korn Shell, and some additional links and documentation in the folder named Subsystem for UNIX-based Applications in the Windows Start menu.
      Note

      You must set the proper firewall rules to access the host.


  3. Ensure that the Interix Subsystem starts up during system booting.
    If you intend to use NFS shares, start the client for NFS. The mapping between UNIX and Windows user IDs is done by the Windows Active Domain Server; consult the Subsystem for UNIX-based Applications documentation for more information. Depending on the installation options and your version of the Windows operating system, one or more of these services are disabled by default.

Post Installation Tasks

You need to perform the following steps after installing Subsystem for UNIX-based Applications.

  1. Before you start using Subsystem for UNIX-based Applications and install Grid Engine, you need to check that the user mapping is working correctly.
    1. Open an Interix shell locally on the Interix host.
    2. Use the login command to switch to a known user which is not an Administrator.
    3. Verify the access permissions for network shares that should be accessible to that user.
    4. Try to access these network resources. If the user cannot access a network drive, most likely the User Name Mapping is not working correctly.

  2. Check the users' home directories.
    To enable the automounting of the users' home directories, click Start > Control Panel > Administrative Tools > Computer Management > Users > Properties > Profile.
    Click Connect to and select the required drive letter. Enter the path of the user's home directory in UNC notation, \\<server>\<share>\<user home>.
    Within the Interix subsystem, you can access all network shares through the special directory, /net/server/share.
    You can also create links to these directories to access the shares directly, for example, ln -s /net/myserver/export/share00/home/home.

  3. Enable Administrator names on your machines.
    Ensure that the administrator accounts on all machines that are enabled as execution hosts for Sun Grid Engine use the same account name, such as Administrator.
    Ensure that this user has manager privileges in your Sun Grid Engine cluster. If this is not the case, add the privileges using qconf -amadministrator before the installation of the execution daemon.

  4. Set the CLI commands.
    This opens an editor. Ensure that you set the EDITOR environment variable to vi, or your preferred UNIX editor, within the Interix subsystem before you start using UNIX commands.

  5. Mount network shares.
    There are two ways to mount network shares to the Interix host:
    • The recommended way is to use the automount functionality of Interix. All network shares that the Computer Browser service of Windows can find are automatically mounted to the /net directory of Interix. Eventhough, only some of these shares might be listed with ls /net, all shares are accessible. The syntax of the automount is /net/server/share, such as /net/myserver/home. A link to an automounted share can be created to make it accessible under exactly the same name as on a UNIX host, for example, ln -s /net/myserver/home /home makes the users' UNIX home directories accessible through /home/username. Automounted shares are available at boot time for all users who have the permissions to access these shares. The shares cannot get lost by misconfiguration.
    • Network shares can also be mapped to drive letters by using the net command of Windows. The syntax is /dev/fs/C/Windows/System32/net.exe <drive letter>: \\<computername>\<sharename> <devicename>. For example:
      /dev/fs/C/Windows/System32/net.exe Z: \\\\myserver\\home
      

      This drive is now accessible through /def/fs/Z. A link can be created to this drive to use the same path as on a UNIX host.

      Note
      As shown in the example above, all backslashes must be written twice because the shell interprets a single backslash as an escape character.


  6. Configure the users' home directories.
    If the users' home directories are located on a network share, follow these steps to configure the users' home directories in Windows.
    1. In the Profile tab of the users properties dialog box, select Connect.
    2. Select a free drive letter.
    3. Enter the path to the user's home directory in the UNC notation \\<server>\<share>\<directory>, for example, \\myserver\home\Peter.

The page Troubleshooting SUA does not exist.

Changing Default Behavior to Case Sensitivity

You might have to choose between default behavior and case sensitivity for object names, such as file names. Your choice will affect system security as well as how Microsoft Services for UNIX (SFU) and Microsoft Subsystem for UNIX-based Applications (SUA) function.

With Microsoft Windows, the names of most objects are case preserving, but case insensitive. So, you cannot have two files in the same directory named sample.txt and Sample.txt because Windows regards the names as identical.

However, the UNIX operating system is fully case sensitive. So, UNIX systems distinguish between object names even when the only difference between those names is the case of the object name characters. Therefore, sample.txt and Sample.txt could appear in the same directory and the UNIX system would distinguish between them when performing operations on the files. For example, the command rm S*.txt would delete Sample.txt but not sample.txt. To implement typical UNIX behavior, the server for NFS and the Interix subsystem are normally case sensitive when working with file names.

This behavior can present security issues, particularly for users who are accustomed to the case insensitive conventions of Windows. For example, a Trojan horse version of edit.exe, named EDIT.EXE, could be stored in the same directory as the original. If a user were to type edit at a Windows command prompt, the Trojan horse version (EDIT.EXE) could be executed instead of the standard version.

Caution
If case sensitivity is enabled, Windows users should be made aware of the security issues.

For Windows XP (Professional) and the Windows Server 2003 family, the default behavior of subsystems (other than the Win32 subsystem) is to preserve case but be case insensitive. In previous versions of Windows, such subsystems were fully case sensitive by default. To support standard UNIX behavior, the SFU and SUA setups allow you to change the default Windows XP and Windows Server 2003 family behavior for non-Win32 subsystems when installing the base utilities (the Interix subsystem) or Server for NFS. If you enable case sensitivity and then subsequently uninstall the base ut


Disabling DEP

How to Disable DEP for Windows XP Professional, Windows Server 2000 and Window Server 2003

  1. Right-click the My Computer icon on your desktop.
  2. Click Properties.
  3. In the Properties dialog box, click the Advanced tab.
  4. Click Settings in the Startup and Recovery section.
  5. In the next dialog box, click the Edit button to edit the boot command line of your Windows installation.
  6. Add /noexecute=alwaysoff or modify an existing /noexecute option.

How to Disable DEP for Windows Vista (Enterprise & Ultimate) and Windows Server 2008

  1. Click Start > All Programs > Accessories.
  2. Right-click Command Prompt.
  3. Left-click Run as Administrator.
  4. Click Allow, if the system prompts you for permission.
  5. Type the following text in the command prompt window.
    bcdedit.exe /set {current} nx AlwaysOff

Enabling suid Behavior for Interix Programs

According to the POSIX standard, a file has permissions that include bits to set both a UID (setuid) and a GID (setgid) when the file is executed. If either or both bits are set on a file, and a process executes that file, the process gains the UID or GID of the file.

When used carefully, this mechanism allows a non-privileged user to execute programs that run with the higher privileges of the file's owner or group.

When used incorrectly, however, this behavior can present security risks by allowing non-privileged users to perform actions that should only be performed by an administrator. For this reason, Windows Services for UNIX and Windows Subsystems for UNIX-based Applications setup does not enable support for this mechanism by default.

You should enable support for setuid behavior because Grid Engine runs programs that require this support. If you do not enable support for setuid behavior when installing Windows Services for UNIX, you can enable it later.


User Management on Windows Hosts

Every user of the Grid Engine execution environment of a Windows machine must have a user account that has the same name as on the UNIX hosts. User accounts contain information about the user, including name, password, various optional entries that determine when and how users log on and how their desktop settings are stored.

The following sections describe how you would use Windows user management to support Grid Engine.
Windows machines are referred to here using three different terms. The following table lists the terms and the operating systems which might run on each corresponding host:

Windows Host Microsoft Windows 2000, Microsoft Windows XP, Microsoft Windows 2000 Server, Microsoft Windows Server 2003
Windows Server Microsoft Windows 2000 Server, Microsoft Windows Server 2003
Windows Workstation Microsoft Windows 2000, Microsoft Windows XP
Topic Description
Managing Users on Windows Hosts Learn how to administer user accounts on Windows hosts.
Using Grid Engine in a Microsoft Windows Environment Learn how to use Sun Grid Engine in a Microsoft Windows environment.
How to Add Windows Hosts Later Learn how to add Microsoft Windows hosts at a later point in time.

Managing Users on Windows Hosts

It is possible to administer user accounts on all Windows hosts individually. Each Windows Host has an authentication center which validates user names and corresponding user rights. User accounts which are defined on a Windows workstation are referred to here as local user accounts or local users.

Each Windows Host has its own local domain, and each Windows Server has the ability to make that domain available to other hosts. Account names within a local domain and account names within a server domain can collide. To avoid such collisions, you must specify the correct user account by providing the domain name as a prefix to the user account name followed by a + (plus sign) character.

Windows User Example

The following is an example that illustrates the potential complexity of Windows host accounts interacting with Windows Domain accounts. Suppose Windows Workstation host named CRUNCH has a local user account named Peter. This Windows Workstation is part of the domain named ENGINEERING. This domain is provided by a Windows Server which also has a user account named Peter. In this example, the ENGINEERING domain is the default domain of the host named CRUNCH. The following table shows the possible results of what would happen if a person tried to log in to CRUNCH.

Table – Using Domain Accounts
Login Name Result
CRUNCH+Peter Peter is logged in with his account as a local user of the machine CRUNCH.
ENGINEERING+Peter Peter is logged in with the account provided by the Windows Server hosting the ENGINEERING domain.
Peter This approach is equivalent to using ENGINEERING+Peter because CRUNCH has ENGINEERING as its default domain. Otherwise, the local account would be used.

Each domain has a special user account that provides superuser access. The default name for that account is Administrator. For native Windows, the members of the Administrators group and of the Domain Admins group in the server domain also have superuser access. However, for Interix, only the user Administrator of the local domain is the superuser of the local host.

The local Administrator can start applications in an account without knowing the password of the user for that account. However, the application would not be able to access network resources because even the local Administrator is not fully trusted by the network, unlike the Unix super user root. Therefore, the Sun Grid Engine administrator uses the sgepasswd tool to register the users' passwords, as explained in Using Grid Engine in a Microsoft Windows Environment.

UNIX User Management

UNIX has no equivalent to the Windows domain concept. With UNIX, each user has a local account and is authenticated as a local account even if the underlying account information lies on an LDAP or NIS server. The UNIX super user root is similar to the local Windows super user Administrator. The UNIX super user can start applications and processes on behalf of UNIX accounts without knowing each corresponding password.


Using Sun Grid Engine in a Microsoft Windows Environment

The Grid Engine execution environment starts jobs on behalf of the submitting user. The execution daemon (sge_execd) on UNIX hosts runs as root so that it can start jobs on behalf of all users.

On Windows hosts, the execution daemon runs as the local Administrator user so that it can start jobs on behalf of users without knowing their password, but these jobs would not have the permissions to access network resources. Only fully authenticated users can access network resources. For a full authentication, the user's password is needed. Therefore, all users who want to submit jobs to a Windows execution host have to register their passwords with Grid Engine. The execution daemon still needs to run as the local Administrator to have the permissions to do several administrative tasks.

Registering Windows User Passwords

Users who want to start Grid Engine jobs on Windows execution hosts use the sgepasswd client application to register their Windows passwords. The following example shows Peter who has a user account in the domain ENGINEERING. Because ENGINEERING is the principal domain of the Windows execution host CRUNCH, Peter does not need to register his password for a specific domain. This should be the default in any properly set up single domain environment. In multiple domain environments, it might be necessary to register the password explicitly for a specific domain.

Note
You must run the sgepasswd command on a non-Windows host.
> sgepasswd
   Changing password for Peter
   New password:
   Re-enter new password:
   Password changed

Using the sgepasswd Command

The sgepasswd command changes the Grid Engine password file sgepasswd(5). This file contains a list of user names and their Windows passwords in encrypted form.

You can use sgepasswd to perform the following tasks:

  • To add a new entry for your user account.
  • To change your existing password, if you know your stored password.
    Caution
    If Grid Engine tries to run several of your jobs at once on a Windows execution host and is unable to access a correct password for your account, the Windows intrusion detection system could disable your account. To keep your account from being disabled, you must prevent your pending jobs from being run before you attempt to change your Windows user password. Once you have changed your password using sgepasswd on a non-Windows host and then on your Windows domain, you can allow your jobs to be run again.

Additionally, the root user can change or delete the password entries for other user accounts. sgepasswd is only available on non-Windows hosts.

The sgepasswd uses one of the following syntaxes:

sgepasswd [[ -D <domain> ] -d <user> ]

sgepasswd [ -D <domain> ] [ <user> ]

This command supports the following options:

-D domain By default, sgepasswd adds or modifies the current UNIX user name without a domain specification. You can use this switch to add a domain specification in front of the current user name. Consult your Microsoft Windows documentation for more information about domain users.
-d user Only root can use this parameter to delete entries from the sgepasswd(5) file.
-help Prints a listing of all options.

Additionally, the following environment variables affect the operation of this command.

SGE_CERTFILE Specifies the location of public key file. By default, sgepasswd uses the file $SGE_ROOT/$SGE_CELL/common/sgeCA/certs/cert.pem.
SGE_KEYFILE If set, this specifies the location of the private key file. The default file is /var/sgeCA/port$SGE_QMASTER_PORT/$SGE_CELL/private/key.pem.
SGE_RANDFILE If set, this specifies the location of the rand.seed file. The default file is /var/sgeCA/port$SGE_QMASTER_PORT/$SGE_CELL/private/rand.seed.

Adding Windows Hosts to Existing Grid Engine Systems

If you have a running Grid Engine system on which Windows support is not enabled, you can enable the support manually. The following steps provide a Windows-enabled Grid Engine system that allows additional Windows execution hosts.


How to Add Windows Hosts Later

  1. Copy Windows binaries to the $SGE_ROOT directory.

  2. Type the following command:
    qconf -mconf
    

    Set the execd_params to enable_windomacc=true.

  3. Type the following command:
    qconf -am <win_admin_name>
    


  4. Run the following command:
    $SGE_ROOT/util/sgeCA/sge_ca -init -days 365
    


  5. For a CSP installation, run the following command:
    $SGE_ROOT/util/sgeCA/sge_ca -user <win_admin_name>
    


  6. Type the following command:
    qconf -ah <new_win_hosts>
    


  7. Copy certificates to each Windows host.

  8. Set the owner of the certificates to ADMINUSER.
    Use a command similar to the following example:
    chown -R foo:bar /var/sgeCA/port <SGE_QMASTER_PORT>
    


  9. Run normal exec daemon installation on each execution host.

Other Installation Issues

Additional considerations for installing Sun Grid Engine software are identified in this section. These include the following topics:

Topic Description
How to Verify and Install Linux Motif Libraries Learn how to verify and install Linux Motif libraries.
How to Install the Software on a System With IPMP Describes how to install the Sun Grid Engine software on hosts with the Solaris Operating Environment IP Multipathing (IPMP) technology.

How to Verify and Install Linux Motif Libraries

On newer Linux systems, the libXm.so.2 Motif libraries are not always installed, which results in the inability to run the precompiled Linux qmon binary.

To correct this problem, follow these steps:

  1. Check if the libraries are already present.
    % ls -l /usr/X11R6/lib/libXm*
    

    If the /usr/X11R6/lib/libXm.so.2 points to a libXm.so.2.x version, you are done. Note that a symbolic link to /usr/X11R6/lib/libXm.so.3 does not work.
    If the libraries are not present, then continue following these steps.

  2. Download the corresponding openmotif libraries from http://www.ist.co.uk/DOWNLOADS/motif_download.html or from the SUSE 9.1 distribution (an additional rpm file called openmotif21-* is available).

  3. Install the missing libraries as root.
    For SUSE 9.1, you install the openmotif21-* package like any other package. For packages downloaded from http://www.ist.co.uk, install the libraries as shown in the following example.
    # rpm -i --prefix /tmp/test --force \
          openmotif-2.1.31-2_IST-JDS2003.i386.rpm
    # cd /tmp/test/OpenMotif-2.1.31/lib
    # cp libXm.so.2.1 /usr/X11R6/lib
    # cd /usr/X11R6/lib
    # ln -s libXm.so.2.1 libXm.so.2
    


  4. Test qmon.
    % ldd `which qmon`
    

How to Install the Software on a System with IPMP

This section describes how to install the Grid Engine software on hosts with the Solaris Operating Environment IP Multipathing (IPMP) technology.

What Is IP Multipathing?

IP Multipathing is a technology that allows TCP/IP interfaces to be grouped for failover and load balancing purposes. If an interface within an IP Multipathing group fails, the interface is disabled and its IP address is relocated to another interface in the group. Outbound IP traffic is distributed across the interfaces of a group. For further details on IP Multipathing, refer to the Solaris Operating Environment documentation at http://docs.sun.com/app/docs/doc/816-4554/ipmptm-1.

Issues Between IPMP and Grid Engine

When starting the Grid Engine daemons on a machine where the main interface is part of an IPMP group, error messages appear. When the IPMP load balancing distributes the connections across the interfaces in the group, the IP packets show up at the receiving end as coming from a different host from the one associated with the main interface. For example, on a machine with three interfaces named qfe0, qfe1, and qfe3, where the IP addresses for these interfaces are 10.1.1.1, 10.1.1.2, and 10.1.1.3 respectively, IPMP would need an extra address for each interface for testing. However, that requirement is ignored in this example. Each of these addresses has a host name associated with it. The hosts table looks like the following example:

10.1.1.1 sge
10.1.1.2 sge-qfe1
10.1.1.3 sge-qfe2

The machine's host name is sge. When a connection is established from sge to another machine, it might go through sge, sge-qfe1, or sge-qfe2. Upon installation, Grid Engine will only recognize sge. When Grid Engine receives a connection request from sge-qfe2, it closes the connection because the request is not from one of the authorized (or known) nodes.

To solve this problem, use the host_aliases files to "tell" Grid Engine that sge, sge1, and sge-qfe2 are all from the same machine. See the sge_h_aliases man page for details. The host_aliases file in this case would look like this:

sge sge-qfe1 sge-qfe2
Note
If you make any changes to the $SGE_ROOT/$SGE_CELL/common/host_aliases file, you must stop and restart all running Grid Engine daemons (sge_qmaster and sge_execd). To do this, log in as root to all your Grid Engine hosts and enter these commands:
/etc/init.d/sgemaster stop
/etc/init.d/sgeexecd stop
/etc/init.d/sgemaster start
/etc/init.d/sgeexecd start

Installing the Grid Engine Master Node With IPMP

There are two ways that you can fix this problem:

  • Ignore the error messages during installation. This method is operating system independent (except for MS Windows).
  • Temporarily disable IPMP on the interface associated with the machine's host name. This method only works on systems running at least Version 8 of the Solaris OS.

Ignoring the Error Messages

To ignore the error messages, follow these steps:

  1. Run the inst_sge -m command while ignoring the error messages during the start up of the daemons.

  2. Shut down the daemons with the /etc/init.d/sgemaster stop and /etc/init.d/sgemaster stop commands.
    Due to the networking errors, some daemons fail to shutdown and must be killed with the kill -9 command. To see which daemons failed to shutdown use this command: ps -e | grep sge_.

  3. Install the host_aliases file in the $SGE_ROOT/$SGE_CELL/common directory.

  4. Restart the daemons with the /etc/init.d/sgemaster start and /etc/init.d/sgeexecd start commands.

Temporarily Disabling IPMP

To temporarily disable IPMP, follow these steps:

  1. Identify the interface associated with the machine's host name.

  2. Verify that the interface has IPMP enabled by using the ifconfig interface | grep groupname command.

  3. Take note of the group name.

  4. Disable IPMP with this command: ifconfig interface group "" .

  5. Install the Grid Engine master node.

  6. Install the host_aliases file in the $SGE_ROOT/$SGE_CELL/common directory.

  7. Restart the daemons with the with the /etc/init.d/sgemaster and /etc/init.d/sgeexecd commands.

  8. Re-enable IPMP using the following command: ifconfig interface group _IPMP group.

Installing a Grid Engine on an Execution Host With IPMP

Once the host_aliases file is installed and the Grid Engine daemons are restarted, you can simply start the execution host installation without further problems.

Enabling Administrative and Submit Hosts With IPMP

You have two choices when enabling these hosts with IPMP:

  • Follow the same procedure used for the execution host (updating the host_aliases file before installation).
  • Add all the host names associated with the administrative or submit host with one of the following commands:
    • For the administrative host:
      qconf -ah <hostname> <alias 1> <alias 2> ...
      
    • For the submit host:
      qconf -as <hostname> <alias 1> <alias 2> ...
      

Upgrading From a Previous Release of the Software

Upgrading Guide (Printable)

Note
  • The following instructions will work only on the Sun Grid Engine 6.2 RR release.
  • The upgrade procedure is only able to upgrade your software from version 6.0 update 2 or higher. If you are running an older version of the Sun Grid Engine software, such as 5.3 or 6.0, you must upgrade to version 6.0 update 2 or higher and then upgrade again to version 6.2 as explained below. See How to Upgrade from 5.3 to 6.0.

About Upgrading the Software

Note
  • The upgrade procedure is now partly destructive. See the constraints.
  • The LD_LIBRARY_PATH variable is not set in Grid Engine 6.2 software. Remove the existing LD_LIBRARY_PATH settings from 6.0 before you start a 6.2 installation.
  • Before you begin the upgrade process, make sure that you source the existing $SGE_ROOT/$SGE_CELL/common/settings.sh or $SGE_ROOT/$SGE_CELL/common/settings.csh file.

The upgrade procedure uses the cluster configuration information from the older version of the software to install the Grid Engine 6.2 software on the master host. Beginning with the Sun Grid Engine 6.2 release, you can install 6.2 to a different $SGE_ROOT or $SGE_CELL and transfer the old configuration to this cluster. This method is called cloned cluster configuration. You might want to use this method to accomplish the following:

  • To test the upgrade before making the real upgrade.
  • To keep the old cluster running.

Before You Upgrade

Choose one of the following methods to upgrade to 6.2:

  • New 6.2 installation (different $SGE_ROOT or $SGE_CELL) using the same configuration as was used for the old cluster (cloned cluster configuration).
    If you use the cloned cluster configuration, you do not have to stop or in any way affect the original cluster. You simply install a new qmaster and transfer the configuration from the old cluster to the new one. Then, you manually restart the new execution daemons on all the original execution hosts.
    The disadvantage of the cloned configuration method is that you have to install the new qmaster and might loose some of the configuration information during the upgrade (see the constraints). Another disadvantage is that the original execution host will now have twice as many slots - one set for the old cluster and one for the new one.
  • Real upgrade of the existing cluster (same $SGE_ROOT and $SGE_CELL.)

Constraints

The following constraints apply to both upgrade methods:

  • Dynamic and static load values will be lost (only static values will be recreated).
  • The sharetree usage will be lost.
  • Neither jobs nor advanced reservations (ARs) will be replicated.
  • There might be running or pending jobs in the cluster when the configuration is saved. If you decide to install the new Sun Grid Engine version in the same $SGE_ROOT and $SGE_CELL, then you must remove all jobs from the old cluster before the old cluster is shutdown and the new software is installed.
  • The previous state of a disabled queue will be lost if the queue config initial_state is set to default.

To print this section, see the Upgrading Guide (Printable).


How to Back Up the Configuration of the Old Cluster

You can create this backup at any time before you start the upgrade procedure. The upgrade is the same for both types of the upgrade procedures. To create the backup, at least the qmaster daemon must be running.

What the Backup Contains

The backup saves the following files:

  • arseqnum
  • jobseqnum
  • act_qmaster
  • bootstrap
  • cluster_name
  • host_aliases
  • qtask
  • sge_aliases
  • sge_ar_request
  • sge_request
  • sge_qstat
  • sge_qquota
  • sge_qstat
  • shadow_masters
  • accounting
  • dbwriter.conf
  • jmx directory
Caution
  • During the upgrade procedure, you can select the next job ID. Do not select a job ID that is less than the last job ID in the accounting file in the backup. If you do, the accounting file will contain some job IDs twice. This leads to unexpected behaviors.
  • To avoid the problem, accept the suggested default for the next job ID. The upgrade procedure calculates a safe value for the default.

The backup process creates the following files:

  • sge_root - old $SGE_ROOT
  • sge_cell - old $SGE_CELL
  • ports - old $SGE_QMASTER_PORT and $SGE_EXECD_PORT
  • win_hosts - A list of registered windows execution hosts at the time of the backup

The standard qconf client is used to save the complete cluster configuration.

How to Back Up the Cluster

  1. Either download the backup script or get the backup script from the Sun Grid Engine 6.2 common package (util/upgrade_modules/save_sge_config.sh).
  2. (Optional) Verify that the script is executable.
  3. Source the $SGE_ROOT/$SGE_CELL/common/settings.sh (or .csh) file of the original cluster.
  4. Run the backup script.
    The backup script has one argument, which is the path to the directory in which to store the backup. The directory must not already exist, but the user must have permission to create it.
    Note
    You must run the backup script on an admin host (qconf -sh) as a manager or operator user (typically sgeadmin).
    # ./save_sge_config.sh /backups/sge_6.1_June10_2008
    

    The backup process displays a message confirming that the backup succeeded.


How to Install the 6.2 Software Using the Cloned Configuration Method

Additional Constraints for the New 6.2 Installation with Cloned Configuration

For the cloned cluster configuration, you must also define several new variables and directories that must be different from the original settings:

  • $SGE_ROOT
  • $SGE_CELL
  • $SGE_CLUSTER_NAME
  • $SGE_QMASTER_PORT
  • $SGE_EXECD_PORT
  • Master daemon spooling directory (qmaster_spool_dir)
  • Execution daemon spooling directory (execd_spool_dir)
  • Group ID range for the jobs (gid_range)
Caution
Only one SGE_Helper_Service.exe can run on an execution host. You cannot use the same Windows execution host for a 6.0 or 6.1 cluster and a 6.2 cluster.
Note
  • Because there have been significant changes in the Grid Engine 6.2 software, loading the configuration adds and removes some configuration attributes. Adding and removing configuration attributes might affect the operation of the cluster.
  • To ensure stability, you should always follow this process:
    1. Upgrade to the new $SGE_ROOT or $SGE_CELL (cloned cluster configuration).
    2. Test that the original cluster configuration did not change and that the functionality of the cluster remains intact.
    3. Perform the real upgrade of the original cluster, if desired.

Caution
Do not make both the new cluster and the old cluster available to your users. If you do, execution hosts would offer the original amount of slots for both clusters and might become overloaded.
  1. Back up the original cluster settings as described in How to Back Up the Cluster.
  2. (Optional) ARCo Upgrade Prerequisites
    If you use ARCo and you want to have the data from the old and new cluster in the same ARCo database, you cannot install the dbwriter on the new cluster, specifying the old dbwriter's database parameters, unless the dbwriter from the old cluster is stopped and all the data from the old cluster are inserted in the database. After installing dbwriter (with the same database parameters) on the new cluster, you must not again start the dbwriter on the old cluster, otherwise your database will be compromised.
    1. Wait to install ARCo on the new cluster until all the jobs are drained from the old cluster, the cluster is stopped and the old reporting file is processed completely.
      There should be no reporting or reporting.processing file in the $SGE_ROOT/$SGE_CELL/common directory of the old cluster.
      Note
      Jobs can be submitted and the reporting file generated on the new cluster, as long as there is no dbwriter installed on the new cluster.
      Caution
      • There cannot be more than one dbwriter process writing into the same ARCo database and schema.
      • If you create a new ARCo database for the new cluster, you cannot later merge it with the old ARCo database, due to the primary key constraints.

      Once the reporting file on the old cluster is processed, on dbwriter host:

    2. Source the cluster settings.sh (or .csh) file.
    3. Stop the dbwriter:
      # $SGE_ROOT/$SGE_CELL/common/sgedbwriter stop
      



  3. Extract the new 6.2 binaries and common files to the new $SGE_ROOT directory.

  4. Start the new upgrade installation of the qmaster from the new $SGE_ROOT directory.
    # ./inst_sge -upd
    

    This starts the upgrade procedure. See the Example Upgrade for Cloned Cluster Configuration.

    Tip
    To enable or disable some additional features like JMX, CSP, or use old IJS, you must provide additional flags to the upgrade script the same way you would for qmaster installation. For example, to upgrade a cluster and enable JMX thread in qmaster and CSP mode run:
    ./inst_sge -upd -jmx -csp


  5. Accept the displayed license.

  6. Enter the complete path to the backup directory.
    For example, /backups/sge_6.1_June10_2008. See Step 6 in the example.

  7. Enter the new $SGE_ROOT directory.
    The default is the current directory. For more information, see SGE_ROOT. See Step 7 in the example.

  8. Select a new $SGE_CELL directory.
    The default is the $SGE_CELL directory from the backup. For more information, see SGE_CELL. See Step 8 in the example.

  9. Select a new SGE_QMASTER_PORT number.
    The default is the $SGE_QMASTER_PORT number from the backup + 2. See Step 9 in the example.

  10. Select a new SGE_EXECD_PORT number.
    The default is the $SGE_EXECD_PORT number from the backup + 2. See Step 10 in the example.

  11. Select a new qmaster spooling directory
    The default is $SGE_ROOT/$SGE_CELL/spool/qmaster. See Step 11 in the example.

  12. Select a new $SGE_CLUSTER_NAME.
    The default is p$SGE_QMASTER_PORT. For more information, see SGE_CLUSTER_NAME. See Step 12 in the example.

  13. (Optional) Choose the JMX configuration.
    For more information about JMX, see JMX guide.
    If you started the upgrade using the -jmx option, one of the following choices appears:
    • Choose if you want to use JMX settings from the backup or use new settings.
      This question appears when JMX exists in the backup.
    • Choose a JMX port.
      This question appears when JMX does not exist in the backup.

  14. Select a spooling method.
    For more information on choosing a spooling mechanism, see Choosing Between Classic Spoooling and Database Spooling. See Step 14 in the example.

  15. Choose if you want to use interactive jobs support (IJS) settings from the backup or use the new defaults for 6.2.
    In most cases, you should use the new defaults which enable the new interactive jobs support. Step 15 in the example shows the new defaults.
    Caution
    If you changed QLOGIN_DAEMON, QLOGIN_COMMAND, RLOGIN_DEAMON, RLOGIN_COMMAND, RSH_DEAMON, or RSH_COMMAND configuration attributes, you should verify that the new IJS will not break your site-specific settings.



  16. Choose the group id range
    The default is the last group id from the backup + 100 and same range. See Step 16 in the example.

  17. Select the next job ID.
    The default is old jobseqnum + 1000, rounded up to the nearest 1000. See Step 17 in the example.

  18. (Optional) Select the next AR ID.
    This question appears only if arseqnum is in the backup. The default is old arseqnum + 1000, rounded up to the nearest 1000. See Step 18 in the example.

  19. Select automatic startup options.
    See Step 19 in the example.
    One of the following choices appears:
    • Choose whether to run qmaster as an SMF service.
      This question appears only on systems that run at least version 10 of the Solaris OS.
    • Choose whether to use RC scripts for qmaster.
      This question appears on platforms that are not running at least version 10 of the Solaris OS or if you started the upgrade using the -nosmf option.

  20. Load the old configuration.
    See Step 20 in the example.
    If this step fails with a critical error:
    1. Check the log file /tmp/sge_backup_date.log.
    2. Try to reload the configuration through the $SGE_ROOT/util/upgrade_modules/load_sge_config.sh script and the arguments displayed in the previous step.
    3. If the preceding steps do not resolve the problem, stop the upgrade process.

  21. (Optional) Upgrade ARCo.
    If you use ARCo, you need to upgrade it. If you want to use the same ARCo database, copy the $SGE_ROOT/$SGE_CELL/common/dbwriter.conf from the old cluster into the same directory on the new cluster, it will be sourced and you will be only prompted to enter any missing information during the installtion of dbwriter. See Upgrading ARCo step 6.

  22. Run the post upgrade procedures
    Info
    The post-upgrade procedures are easier when you have root access to all machines through ssh or rsh without having to enter a password. To use rsh instead of the default ssh, run the ./inst_sge command with -rsh argument. Example:
    # ./inst_sge -upd-execd -rsh
    1. Initialize the local execd spool directories
      This step creates the local execd spool directories on the execd hosts with the correct permissions. Run the following command as root from the master host in $SGE_ROOT directory:
      # ./inst_sge -upd-execd
      
    2. (Optional) Create new RC scripts for the whole cluster.
      Caution
      This command removes old RC scripts. To keep the old RC scripts, do not run this command.

      To start the services automatically after a reboot, run the following command as root from the master host in $SGE_ROOT directory:

      ## ./inst_sge -upd-rc
      
    3. (Optional) Install or update the Windows helper service.
      Perform this step to use the Windows execution hosts with the 6.2 cluster. When connecting to each Windows execution host, you are prompted for an administrator user to connect to the Windows host. If all your Windows hosts share the same administrative user, set the environment variable SGE_WIN_ADMIN to that user to access all Windows hosts without additional user intervention. Example:
      (sh, bash)# export SGE_WIN_ADMIN=Administrator
      (csh,tcsh)# setenv SGE_WIN_ADMIN Administrator
      

      To install or update the Windows helper service, run the following command as root from the master host in $SGE_ROOT directory:

      # ./inst_sge -upd-win
      
      Caution
      Only one SGE_Helper_Service.exe can run on an execution host. You cannot use the same Windows execution host for a 6.0 or 6.1 cluster and a 6.2 cluster.



  23. Start the new execution daemons.
    Optionally, if you can login without typing a password, you can start the whole cluster as root user from the $SGE_ROOT directory with a single command:
    # ./inst_sge -start-all
    

    This command starts the master daemon, shadow daemons, and all execution daemons.

Upgrade is complete.


Example Upgrade for Cloned Cluster Configuration

The following upgrade example uses a copy of the existing cluster configuration with a different $SGE_CELL. This example does not use JMX and there are no Service Tags. The steps in this example are referred to from the software upgrade description at How to Install 6.2 Using the Cloned Cluster Configuration Method.

Steps 4 and 5
# ./inst_sge -upd

Welcome to the Grid Engine Upgrade Procedure
--------------------------------------------

Before you continue with the upgrade, read these hints:

   - Your terminal window should have a size of at least
     80x24 characters

   - At any time during the upgrade process, use your standard
     interrupt key to abort the upgrade. Typically, the interrupt
     key combination is Ctrl-C.

The upgrade procedure will take approximately 1-2 minutes.

Hit <RETURN> to continue >>

Step 6
Type the complete path to the Grid Engine configuration backup directory.
-------------------------------------------------------------------------
Backup directory  >> /tmp/bck

Found backup from GE 6.1u4 version created on 2008-06-10_10:56:29
Continue with this backup directory (y/n) [y] >>

Step 7
The Grid Engine root directory is:

   $SGE_ROOT = /sge

If this directory is not correct (e.g. it may contain an automounter
prefix) enter the correct path to this directory or hit <RETURN>
to use default [/sge] >>

Your $SGE_ROOT directory: /sge

Hit <RETURN> to continue >>

Step 8
Grid Engine cells
-----------------

Grid Engine supports multiple cells.

If you are not planning to run multiple Grid Engine clusters or if you don't
know yet what is a Grid Engine cell it is safe to keep the default cell name

   default

If you want to install multiple cells you can enter a cell name now.

The environment variable

   $SGE_CELL=<your_cell_name>

will be set for all further Grid Engine commands.

Enter cell name [default] >> new_cell

Using cell >new_cell<.
Hit <RETURN> to continue >>

Step 9
Grid Engine TCP/IP communication service
----------------------------------------

The port for sge_qmaster is currently set by the shell environment.

   SGE_QMASTER_PORT = 21640

Now you have the possibility to set/change the communication ports by 
using the
 >shell environment< or you may configure it via a network service, 
configured
in local >/etc/service<, >NIS< or >NIS+<, adding an entry in the form

    sge_qmaster <port_number>/tcp

to your services database and make sure to use an unused port number.

How do you want to configure the Grid Engine communication ports?

Using the >shell environment<:                           [1]

Using a network service like >/etc/service<, >NIS/NIS+<: [2]

(default: 1) >>

Grid Engine TCP/IP communication service
----------------------------------------

Using the environment variable

   $SGE_QMASTER_PORT=21640

as port for communication.

Do you want to change the port number? (y/n) [n] >>

Step 10
Grid Engine TCP/IP communication service
----------------------------------------

The port for sge_execd is currently set by the shell environment.

   SGE_EXECD_PORT = 21641

Now you have the possibility to set/change the communication ports by 
using the
 >shell environment< or you may configure it via a network service, 
configured
in local >/etc/service<, >NIS< or >NIS+<, adding an entry in the form

    sge_execd <port_number>/tcp

to your services database and make sure to use an unused port number.

How do you want to configure the Grid Engine communication ports?

Using the >shell environment<:                           [1]

Using a network service like >/etc/service<, >NIS/NIS+<: [2]

(default: 1) >>

Grid Engine TCP/IP communication service
----------------------------------------

Using the environment variable

   $SGE_EXECD_PORT=21641

as port for communication.

Do you want to change the port number? (y/n) [n] >>

Step 11
Grid Engine qmaster spool directory
-----------------------------------

The qmaster spool directory is the place where the qmaster daemon stores
the configuration and the state of the queuing system.

The admin user >sgeadmin< must have read/write access
to the qmaster spool directory.

If you will install shadow master hosts or if you want to be able to start
the qmaster daemon on other hosts (see the corresponding section in the
Grid Engine Installation and Administration Manual for details) the account
on the shadow master hosts also needs read/write access to this directory.

The following directory

[/sge/new_cell/spool/qmaster]

will be used as qmaster spool directory by default!

Do you want to select another qmaster spool directory (y/n) [n] >>

Step 12
Unique cluster name
-------------------

The cluster name uniquely identifies a specific Sun Grid Engine cluster.
The cluster name must be unique throughout your organization. The name
is not related to the SGE cell.

The cluster name must start with a letter ([A-Za-z]), followed by letters,
digits ([0-9]), dashes (-) or underscores (_).

Enter new cluster name or hit <RETURN>
to use default [p21640] >>

Your $SGE_CLUSTER_NAME: p21640

Hit <RETURN> to continue >>

Step 14
creating directory: /sge/new_cell/spool/qmaster/job_scripts

Setup spooling
--------------
Your SGE binaries are compiled to link the spooling libraries
during runtime (dynamically). So you can choose between Berkeley DB
spooling and Classic spooling method.
Please choose a spooling method (berkeleydb|classic) [berkeleydb] >> classic

Initializing spooling database

Hit <RETURN> to continue >>

Step 15
Interactive Job Support (IJS) Selection
---------------------------------------

The backup configuration includes information for running
interactive jobs. Do you want to use the IJS information from
the backup ('y') or use new default values ('n') (y/n) [y] >> n


Using new interactive job support default setting for a new installation.
Hit <RETURN> to continue >>

Creating >act_qmaster< file

Step 16
Grid Engine group id range
--------------------------

When jobs are started under the control of Grid Engine an additional 
group id is set on platforms which do not support jobs. This is done 
to provide maximum control for Grid Engine jobs.

This additional UNIX group id range must be unused group id's in your 
system. Each job will be assigned a unique id during the time it is 
running. Therefore you need to provide a range of id's which will 
be assigned dynamically for jobs.

The range must be big enough to provide enough numbers for the 
maximum number of Grid Engine jobs running at a single moment on 
a single host. E.g. a range like >20000-20100< means, that Grid Engine 
will use the group ids from 20000-20100 and provides a range for 
100 Grid Engine jobs at the same time on a single host.

You can change at any time the group id range in your cluster configuration.

Please enter a range [34299-34498] >>

Using >34299-34498< as gid range. Hit <RETURN> to continue >>

Grid Engine cluster configuration
---------------------------------

Please give the basic configuration parameters of your Grid Engine
installation:

   <execd_spool_dir>

The pathname of the spool directory of the execution hosts. User >sgeadmin<
must have the right to create this directory and to write into it.

Default: [/sge/new_cell/spool] >>

Grid Engine cluster configuration (continued)
---------------------------------------------

<administrator_mail>

The email address of the administrator to whom problem reports are sent.

It is recommended to configure this parameter. You may use >none<
if you do not wish to receive administrator mail.

Please enter an email address in the form >user@foo.com<.

Default: [sgeadmin@qmaster.com] >>

The following parameters for the cluster configuration were configured:

   execd_spool_dir        /sge/new_cell/spool
   administrator_mail     sgeadmin@qmaster.com

Do you want to change the configuration parameters (y/n) [n] >>

Step 17
Provide a value to use for the next job ID.
-------------------------------------------

Backup contains last job ID 1. As a suggested value, we added 1000
to that number and rounded it up to the nearest 1000.
Increase the value, if appropriate.
Choose the new next job ID [2000] >> 

Hit <RETURN> to continue >>

Step 18
Provide a value to use for the next AR ID.
------------------------------------------

Backup contains last AR ID 1. As a suggested value, we added 1000
to that number and rounded it to the nearest 1000.
Increase the value, if appropriate.
Choose the new next AR ID [2000] >>

Hit <RETURN> to continue >>

Step 19
Creating >sgemaster< script
Creating >sgeexecd< script
Creating settings files for >.profile/.cshrc<

Hit <RETURN> to continue >>

qmaster startup script
----------------------

Do you want to start qmaster automatically at machine boot?
NOTE: If you select "n" SMF will be not used at all! (y/n) [y] >> n


Grid Engine qmaster startup
---------------------------

Starting qmaster daemon. Please wait ...
   starting sge_qmaster
Hit <RETURN> to continue >>

Step 20
Last step - load configuration from the backup
----------------------------------------------

load command: /sge/util/upgrade_modules/load_sge_config.sh /tmp/bck -mode "copy" -log C -newijs "false" -gid_range "34299-34498" -admin_mail "sgeadmin@qmaster.com" -execd_spool_dir "/sge/new_cell/spool"


Hit <RETURN> to continue >>


Loading saved cluster configuration from /tmp/bck (log in 
/tmp/sge_backup_load_2008-06-13_17:42:28.log)...

Loading saved cluster configuration from /tmp/bck (log in /tmp/sge_backup_load_2008-06-13_17:42:28.log)...
Done

If loading the configuration succeeded run these additional commands:
REQUIRED:
inst_sge -upd-execd
   This command initializes all execd spool directories.

inst_sge -upd-win
   This command connects to all Windows execution hosts and installs
   the new Windows helper service on each host.
   WARNING: If a helper service from a previous release is running
            on this host, the new helper service overwrites it. The
            host will run only in a 6.2 cluster.
   TIP: This action requires to enter a windows administrator user for each
        host interactively. If all your systems share the same administrator you
        can set the environment variable SGE_WIN_ADMIN to that user name.
        E.g.: (sh, bash) export SGE_WIN_ADMIN=Administrator
              (csh,tcsh) setenv SGE_WIN_ADMIN Administrator

OPTIONAL:
inst_sge -upd-rc
   This command creates new autostart scripts for the new cluster
   and removes any conflicting files.
   TIP: To disable SMF on Solaris systems, use the command
        inst_sge -upd-rc -nosmf

TIP: Use inst_sge -post-upd to do all above actions

How to Upgrade the Original Cluster to the 6.2 Software (Real Upgrade)

  1. (Optional) Test the cloned cluster, if you used the cloned cluster configuration method to transfer the configuration to a new 6.2 cluster.

  2. Back up the original cluster settings as described in How to Back Up the Cluster.

  3. Stop the scheduler:
    # qconf -ks
    


  4. Verify that no jobs are running on the cluster.

  5. Stop the old cluster:
    # qconf -ke all
    # $SGE_ROOT/$SGE_CELL/common/sgemaster stop
    


  6. (Optional) Stop the Berkeley DB server, if your cluster uses Berkeley DB server spooling.
    On the BDB server host:
    1. Source the cluster settings.sh (or .csh) file.
    2. Type the following command:
      # $SGE_ROOT/$SGE_CELL/common/sgebdb stop
      


  7. (Optional) If you use ARCo, ensure that the reporting file has been completely processed by the dbwriter.
    There should be no reporting or reporting.processing file in the $SGE_ROOT/$SGE_CELL/common directory.
    Once the reporting file is processed, on dbwriter host:
    1. Source the cluster settings.sh (or .csh) file.
    2. Stop the dbwriter:
      # $SGE_ROOT/$SGE_CELL/common/sgedbwriter stop
      
      Warning
      If you use ARCo, you must completely process the reporting file and stop the dbwriter before you continue.


  8. Extract the new 6.2 binaries and common files to the $SGE_ROOT directory.
    Caution
    Do not remove any of the $SGE_ROOT directory contents, except for the case where the new Sun Grid Engine 6.2 binaries differ from the existing installation. For example, you might have used your custom lx26-amd64 binaries, but Sun Grid Engine 6.2 uses lx24-amd64 for 2.6 kernels. In that case you must remove the old binaries manually. You must ensure that all binaries for the used architectures were updated and no architecture with the old version remains in the $SGE_ROOT directory.



  9. Start the new upgrade on the original qmaster host from the $SGE_ROOT directory.
    # ./inst_sge -upd
    
    Tip
    To enable or disable some additional features like JMX, CSP, or to use the old IJS, you must provide additional flags to the upgrade script in the same way that you would for qmaster installation. For example, to upgrade a cluster and enable the JMX thread in qmaster and use CSP mode, run the following command: ./inst_sge -upd -jmx -csp


  10. Accept the displayed license.

  11. Enter the complete path to the backup directory.
    For example, /backups/sge_6.1_June10_2008.

    Caution
    In case you you don't specify the original $SGE_ROOT and $SGE_CELL in the next two steps, the upgrade type attempted will not be the real upgrade! Instead the clone cluster configuration method will be used.
  12. Enter the $SGE_ROOT directory.
    The default is the current directory. For more information, see SGE_ROOT.

  13. Enter the $SGE_CELL directory.
    The default is default. For more information, see SGE_CELL.

  14. Select a new $SGE_CLUSTER_NAME.
    The default value is one of the following, depending on which is found first:
    • The existing SGE_CLUSTER_NAME ($SGE_ROOT/$SGE_CELL/common/cluster-name)
    • The SGE_CLUSTER_NAME from the backup
    • p$SGE_QMASTER_PORT
      For more information, see SGE_CLUSTER_NAME.

  15. (Optional) Select the JMX configuration.
    For more information about JMX, see JMX guide.
    If you started the upgrade using the -jmx option, one of the following choices appears:
    • Choose if you want to use JMX settings from the backup or use new settings.
      This question appears when JMX exists in the backup.
    • Choose a JMX port.
      This question appears when JMX does not exist in the backup.

  16. Choose if you want to keep the spooling method from the backup.

  17. (Optional) Select a spooling method.
    This is displayed if you chose not to use backup in the previous screen. See example. For more information on choosing a spooling mechanism, see Choosing Between Classic Spooling and Database Spooling.

  18. Choose if you want to use interactive jobs support (IJS) settings from the backup or use the new defaults for 6.2.
    In most cases, you should use the new defaults which enable the new interactive jobs support.
    Caution
    If you changed QLOGIN_DAEMON, QLOGIN_COMMAND, RLOGIN_DEAMON, RLOGIN_COMMAND, RSH_DEAMON, or RSH_COMMAND configuration attributes, you should verify that the new IJS will not break your site-specific settings.


  19. Select the next job ID.
    The default is old jobseqnum + 1000, rounded up to the nearest 1000.

  20. (Optional) Select the next AR ID.
    This question appears only if arseqnum is in the backup. The default is old arseqnum + 1000, rounded up to the nearest 1000.

  21. Choose automatic startup options.
    One of the following choices appears:
    • Choose whether to run qmaster as an SMF service.
      This question appears only on systems that run at least version 10 of the Solaris OS.
    • Choose whether to use RC scripts for qmaster.
      This question appears on platforms that are not running at least version 10 of the Solaris OS or if you started the upgrade using the -nosmf option.

  22. Load the old configuration.
    If this step fails with a critical error:
    1. Check the log file /tmp/sge_backup_date.log.
    2. Try to reload the configuration through the $SGE_ROOT/util/upgrade_modules/load_sge_config.sh script and the arguments displayed in the previous step.
    3. If the preceding steps do not resolve the problem, stop the upgrade process.

  23. (Optional) Copy the binaries and the common directory to all the hosts in the cluster, if not on a shared file system
    If you use local binaries or a local common directory for each host, you must copy all the new binaries and the common directory locally to each host. Ensure that all binaries are updated and no architecture with the old version remains in the $SGE_ROOT directory.
    Note
    If you do not perform this operation the qmaster host will have Sun Grid Engine 6.2 binaries, while the rest of the cluster will still have the old version and will not work as desired.



  24. (Optional) Upgrade ARCo.
    If you use ARCo, you need to upgrade it.See Upgrading ARCo step 6.

  25. Run the post upgrade procedures
    Info
    The post-upgrade procedures are easier when you have root access to all machines through ssh or rsh without having to enter a password. To use rsh instead of the default ssh, run the ./inst_sge command with -rsh argument. Example:
    # ./inst_sge -upd-execd -rsh
    1. Initialize the local execd spool directories
      This step creates the local execd spool directories on the execd hosts with the correct permissions. Run the following command as root from the master host in $SGE_ROOT directory:
      # ./inst_sge -upd-execd
      
    2. (Optional) Create new RC scripts for the whole cluster.
      Caution
      This command removes old RC scripts. To keep the old RC scripts, do not run this command.

      To start the services automatically after a reboot, run the following command as root from the master host in $SGE_ROOT directory:

      ## ./inst_sge -upd-rc
      
    3. (Optional) Install or update the Windows helper service.
      Perform this step to use the Windows execution hosts with the 6.2 cluster. When connecting to each Windows execution host, you are prompted for an administrator user to connect to the Windows host. If all your Windows hosts share the same administrative user, set the environment variable SGE_WIN_ADMIN to that user to access all Windows hosts without additional user intervention. Example:
      (sh, bash)# export SGE_WIN_ADMIN=Administrator
      (csh,tcsh)# setenv SGE_WIN_ADMIN Administrator
      

      To install or update the Windows helper service, run the following command as root from the master host in $SGE_ROOT directory:

      # ./inst_sge -upd-win
      
      Caution
      Only one SGE_Helper_Service.exe can run on an execution host. You cannot use the same Windows execution host for a 6.0 or 6.1 cluster and a 6.2 cluster.



  26. Start the new execution daemons.
    Optionally, if you can login without typing a password, you can start the whole cluster as root user from the $SGE_ROOT directory with a single command:
    # ./inst_sge -start-all
    

    This command starts the master daemon, shadow daemons, and all execution daemons.

Upgrade is complete.


How to Upgrade from 5.3 to 6.0

Before You Begin

Be sure to review Planning the Installation for the information that you will need during the upgrade process. If you have decided to use an administrative user, as described in User Names, you should create that user now. This procedure assumes that you have already extracted the Grid Engine software, as described in Loading the Distribution Files on a Workstation.

Note
While you can run Grid Engine 6.0 software concurrently with your older version of Grid Engine software, you should run the upgrade procedure when there are no running jobs.
Steps
  1. Log in to the master host as root.

  2. Load the distribution files.
    For details, see Loading the Distribution Files on a Workstation.

  3. Ensure that you have set the $SGE_ROOT environment variable by typing:
    # echo $SGE_ROOT
    

    If the $SGE_ROOT environment variable is not set, set it now by typing:

    # SGE_ROOT=sge-root; export SGE_ROOT
    


  4. Change to the sge-root installation directory.
    Select one of the two following options:
    • If the directory where the installation files reside is visible from the master host, change directories (cd) to the installation directory sge-root, and then proceed to Step 4 of How to Install the Master Host.
    • If the directory is not visible and cannot be made visible, do the following:
      • Create a local installation directory, sge-root, on the master host.
      • Copy the installation files to the local installation directory sge-root across the network (for example, by using ftp or rcp).
      • Change directories (cd) to the local sge-root directory.
  5. Run the upgrade command on the master host, and respond to the prompts.
    This command starts the master host installation procedure. You are asked several questions, and you might be required to run some administrative actions.

    The syntax of the upgrade command is:
    inst_sge -upd 5.3-sge-root-directory 5.3-cell-name
    

    In the following example, the 5.3 sge-root directory is /sge/gridware and the cell name is default.

    # ./inst_sge -upd /sge/gridware default
    Welcome to the Grid Engine Upgrade
    ----------------------------------
    
    Before you continue with the installation please read these hints:
    
       - Your terminal window should have a size of at least
         80x24 characters
    
       - The INTR character is often bound to the key Ctrl-C.
         The term >Ctrl-C< is used during the upgrade if you
         have the possibility to abort the upgrade
    
    The upgrade procedure will take approximately 5-10 minutes.
    After this upgrade you will get a running qmaster and schedd with
    the configuration of your old installation. If the upgrade was
    successfully completed it is necessary to install your execution hosts
    with the install_execd script.
    
    Hit <RETURN> to continue >>
    


  6. Choose an administrative account owner.
    In the following example, the value of sge-root is /opt/n1ge6, and the administrative user is sgeadmin.
    Grid Engine admin user account
    ------------------------------
    
    The current directory
    
       /opt/n1ge6
    
    is owned by user
    
       sgeadmin
    
    If user >root< does not have write permissions in this directory on *all*
    of the machines where Grid Engine will be installed (NFS partitions not
    exported for user >root< with read/write permissions) it is recommended to
    install Grid Engine that all spool files will be created under the user id
    of user >sgeadmin<.
    
    IMPORTANT NOTE: The daemons still have to be started by user >root<.
    
    Do you want to install Grid Engine as admin user >sgeadmin< (y/n) [y] >>
    


  7. Verify the $SGE_ROOT directory setting.
    In the following example, the value of $SGE_ROOT is /opt/n1ge6.
    Checking $SGE_ROOT directory
    ----------------------------
    
    The Grid Engine root directory is:
    
       $SGE_ROOT = /opt/n1ge6
    
    If this directory is not correct (e.g. it may contain an automounter
    prefix) enter the correct path to this directory or hit <RETURN>
    to use default [/opt/n1ge6] >>
    


  8. Set up the TCP/IP services for the Grid Engine software.
    1. If the TCP/IP services have not been configured, respond to the installation messages.
      Grid Engine TCP/IP service >sge_qmaster<
      ----------------------------------------
      
      There is no service >sge_qmaster< available in your >/etc/services< file
      or in your NIS/NIS+ database.
      
      You may add this service now to your services database or choose a port number.
      It is recommended to add the service now. If you are using NIS/NIS+ you should
      add the service at your NIS/NIS+ server and not to the local >/etc/services<
      file.
      
      Please add an entry in the form
      
        sge_qmaster <port_number>/tcp
      
      to your services database and make sure to use an unused port number.
      
      Please add the service now or press <RETURN> to go to entering a port number >>
      
    2. Start a new terminal session or window to add information to the /etc/services file or your NIS maps.
    3. Add the correct ports to the /etc/services file or your NIS services map, as described in Network Services.
      The following example shows how you might edit your /etc/services file.
      ...
      sge_qmaster     536/tcp
      sge_execd       537/tcp
      
      Note
      In this example, the entries for both sge_qmaster and sge_execd are added to /etc/services. Subsequent steps in this example assume that both entries have been made.
    4. Save your changes and return to the window where the installation script is running.
      Please add the service now or press <RETURN> to go to entering a port number >>
      

      Press <RETURN>. The installation procedure displays the following output:

      sge_qmaster 536
      
      Service >sge_qmaster< is now available.
      
      Hit <RETURN> to continue >>
      
      Grid Engine TCP/IP service >sge_execd<
      --------------------------------------
      
      Using the service
      
         sge_execd
      
      for communication with Grid Engine.
      
      Hit <RETURN> to continue >>
      


  9. Enter the name of your cell or press Return to use the default.
    The use of Grid Engine system cells is described in Cells.
    Grid Engine cells
    -----------------
    
    Grid Engine supports multiple cells.
    
    If you are not planning to run multiple Grid Engine clusters or if you don't
    know yet what is a Grid Engine cell it is safe to keep the default cell name
    
       default
    
    If you want to install multiple cells you can enter a cell name now.
    
    The environment variable
    
       $SGE_CELL=<your_cell_name>
    
    will be set for all further Grid Engine commands.
    
    Enter cell name [default] >>
    

    If you have decided not to use cells, the installation process displays the following information:

    Using cell >default<.
    Hit <RETURN> to continue >>
    


  10. Specify a spool directory.
    For guidelines on disk space requirements for the spool directory, see Disk Space Requirements. For information on where the spool directory is installed, see Spool Directories Under the Root Directory.
    Grid Engine qmaster spool directory
    -----------------------------------
    
    The qmaster spool directory is the place where the qmaster daemon stores
    the configuration and the state of the queuing system.
    
    The admin user >sgeadmin< must have read/write access
    to the qmaster spool directory.
    
    If you will install shadow master hosts or if you want to be able to start
    the qmaster daemon on other hosts (see the corresponding section in the
    Grid Engine Installation and Administration Manual for details) the account
    on the shadow master hosts also needs read/write access to this directory.
    
    The following directory
    
    [/opt/n1ge6/default/spool/qmaster]
    
    will be used as qmaster spool directory by default!
    
    Do you want to select another qmaster spool directory (y/n) [n] >>
    
    • If you want to accept the default spool directory, press Return to continue.
    • If you do not want to accept the default spool directory, then answer y. In the following example the /my/spool directory is specified as the master host spool directory.
      Do you want to select another qmaster spool directory (y/n) [n] >> y
      
      Please enter a qmaster spool directory now! >>/my/spool
      


  11. Set the correct file permissions.
    Verifying and setting file permissions
    --------------------------------------
    
    Did you install this version with >pkgadd< or did you already
    verify and set the file permissions of your distribution (y/n) [y] >> n
    
    Verifying and setting file permissions
    --------------------------------------
    
    We may now verify and set the file permissions of your Grid Engine
    distribution.
    
    This may be useful since due to unpacking and copying of your distribution
    your files may be unaccessible to other users.
    
    We will set the permissions of directories and binaries to
    
       755 - that means executable are accessible for the world
    
    and for ordinary files to
    
       644 - that means readable for the world
    
    Do you want to verify and set your file permissions (y/n) [y] >> y
    
    Verifying and setting file permissions and owner in >3rd_party<
    Verifying and setting file permissions and owner in >bin<
    Verifying and setting file permissions and owner in >ckpt<
    Verifying and setting file permissions and owner in >examples<
    Verifying and setting file permissions and owner in >install_execd<
    Verifying and setting file permissions and owner in >install_qmaster<
    Verifying and setting file permissions and owner in >mpi<
    Verifying and setting file permissions and owner in >pvm<
    Verifying and setting file permissions and owner in >qmon<
    Verifying and setting file permissions and owner in >util<
    Verifying and setting file permissions and owner in >utilbin<
    Verifying and setting file permissions and owner in >catman<
    Verifying and setting file permissions and owner in >doc<
    Verifying and setting file permissions and owner in >man<
    Verifying and setting file permissions and owner in >inst_sge<
    Verifying and setting file permissions and owner in >bin<
    Verifying and setting file permissions and owner in >lib<
    Verifying and setting file permissions and owner in >utilbin<
    
    Your file permissions were set
    
    Hit <RETURN> to continue >>
    


  12. Specify whether all of your Grid Engine system hosts are located in a single DNS domain.
    Select default Grid Engine hostname resolving method
    ----------------------------------------------------
    
    Are all hosts of your cluster in one DNS domain? If this is
    the case the hostnames
    
       >hostA< and >hostA.foo.com<
    
    would be treated as eqal, because the DNS domain name >foo.com<
    is ignored when comparing hostnames.
    
    Are all hosts of your cluster in a single DNS domain (y/n) [y] >>
    
    1. If all of your Grid Engine system hosts are located in a single DNS domain, then answer y.
      Are all hosts of your cluster in a single DNS domain (y/n) [y] >> y
      
      Ignoring domainname when comparing hostnames.
      
      Hit <RETURN> to continue >>
      
    2. If all of your Grid Engine system hosts are not located in a single DNS domain, then answer n.
      Are all hosts of your cluster in a single DNS domain (y/n) [y] >> n
      
      The domainname is not ignored when comparing hostnames.
      
      
      Hit <RETURN> to continue >>
      
      Default domain for hostnames
      ----------------------------
      
      Sometimes the primary hostname of machines returns the short hostname
      without a domain suffix like >foo.com<.
      
      This can cause problems with getting load values of your execution hosts.
      If you are using DNS or you are using domains in your >/etc/hosts< file or
      your NIS configuration it is usually safe to define a default domain
      because it is only used if your execution hosts return the short hostname
      as their primary name.
      
      If your execution hosts reside in more than one domain, the default domain
      parameter must be set on all execution hosts individually.
      
      Do you want to configure a default domain (y/n) [y] >>
      
    3. Press Return to continue.
      • If you want to specify a default domain, then answer y. In the following example, sun.com is specified as the default domain.
        Do you want to configure a default domain (y/n) [y] >> y
        
        
        Please enter your default domain >> sun.com
        
        Using >sun.com< as default domain. Hit <RETURN> to continue >>
        
      • If you do not want to specify a default domain, then answer n. In the following example, sun.com is specified as the default domain.
        Do you want to configure a default domain (y/n) [y] >> n
        


  1. Press Return to continue.
    Making directories
    ------------------
    
    creating directory: default/common
    creating directory: /opt/n1ge6/default/spool/qmaster
    creating directory: /opt/n1ge6/default/spool/qmaster/job_scripts
    Hit <RETURN> to continue >>
    


  2. Specify whether you want to use classic spooling or Berkeley DB.
    For more information on choosing the spooling mechanism, see Database Server and Spooling Host.
    Setup spooling
    --------------
    Your SGE binaries are compiled to link the spooling libraries
    during runtime (dynamically). So you can choose between Berkeley DB
    spooling and Classic spooling method.
    Please choose a spooling method (berkeleydb|classic) [berkeleydb] >>
    
    1. If you want to specify Berkeley DB spooling, press Return to continue.
      Please choose a spooling method (berkeleydb|classic) [berkeleydb] >>
      
      The Berkeley DB spooling method provides two configurations!
      
      1) Local spooling:
      The Berkeley DB spools into a local directory on this host (qmaster host)
      This setup is faster, but you can't setup a shadow master host
      
      2) Berkeley DB Spooling Server:
      If you want to setup a shadow master host, you need to use
      Berkeley DB Spooling Server!
      In this case you have to choose a host with a configured RPC service.
      The qmaster host connects via RPC to the Berkeley DB. This setup is more
      failsafe, but results in a clear potential security hole. RPC communication
      (as used by Berkeley DB) can be easily compromised. Please only use this
      alternative if your site is secure or if you are not concerned about
      security. Check the installation guide for further advice on how to achieve
      failsafety without compromising security.
      
      Do you want to use a Berkeley DB Spooling Server? (y/n) [n] >>
      
      • If you want to use a Berkeley DB spooling server, enter y.
        Do you want to use a Berkeley DB Spooling Server? (y/n) [n] >> y
        
        Berkeley DB Setup
        
        -----------------
        Please, log in to your Berkeley DB spooling host and execute "inst_sge -db"
        Please do not continue, before the Berkeley DB installation with
        "inst_sge -db" is completed, continue with <RETURN>
        
        Note
        Do not press Return until you have completed the Berkeley DB installation on the spooling server.

        Follow these steps to set up a Berkeley DB spooling server:

      1. Start a new terminal session or window.
      2. Log in to the spooling server.
      3. Install the software as described in How to Install the Berkeley DB Spooling Server.
      4. After you have installed the software on the spooling server, return to the master installation window, and press Return to continue.
      5. Enter the name of the spooling server. In the following example, vector is the host name of the spooling server.
        Berkeley Database spooling parameters
        -------------------------------------
        
        Please enter the name of your Berkeley DB Spooling Server! >> vector
        
      6. Enter the name of the spooling directory. In the following example, /opt/n1ge6/default/spooldb is the spooling directory.
        Please enter the Database Directory now!
        
        Default: [/opt/n1ge6/default/spooldb] >>
        Dumping bootstrapping information
        Initializing spooling database
        
        Hit <RETURN> to continue >>
        
      • If you do not want to use a Berkeley DB spooling server, enter n.
        Do you want to use a Berkeley DB Spooling Server? (y/n) [n] >> n
        
        
        Hit <RETURN> to continue >>
        
        Berkeley Database spooling parameters
        -------------------------------------
        
        Please enter the Database Directory now, even if you want to spool locally
        it is necessary to enter this Database Directory.
        
        Default: [/opt/n1ge6/default/spool/spooldb] >>
        

        Then specify an alternate directory, or press Return to continue.

        creating directory: /opt/n1ge6/default/spool/spooldb
        Dumping bootstrapping information
        Initializing spooling database
        
        Hit <RETURN> to continue >>
        
    2. If you want to specify classic spooling, then enter classic.
      Please choose a spooling method (berkeleydb|classic) [berkeleydb] >> classic
      
      Dumping bootstrapping information
      Initializing spooling database
      
      
      Hit <RETURN> to continue >>
      


  3. Enter a group ID range.
    For more information, see Group IDs.
    Grid Engine group id range
    --------------------------
    
    When jobs are started under the control of Grid Engine an additional group id
    is set on platforms which do not support jobs. This is done to provide maximum
    control for Grid Engine jobs.
    
    This additional UNIX group id range must be unused group id's in your system.
    Each job will be assigned a unique id during the time it is running.
    Therefore you need to provide a range of id's which will be assigned
    dynamically for jobs.
    
    The range must be big enough to provide enough numbers for the maximum number
    of Grid Engine jobs running at a single moment on a single host. E.g. a range
    like >20000-20100< means, that Grid Engine will use the group ids from
    20000-20100 and provides a range for 100 Grid Engine jobs at the same time
    on a single host.
    
    You can change at any time the group id range in your cluster configuration.
    
    Please enter a range >> 20000-20100
    
    Using >20000-20100< as gid range. Hit <RETURN> to continue >>
    


  4. Verify the spooling directory for the execution daemon.
    For information on spooling, see Spool Directories Under the Root Directory.
    Grid Engine cluster configuration
    ---------------------------------
    
    Please give the basic configuration parameters of your Grid Engine
    installation:
    
       <execd_spool_dir>
    
    The pathname of the spool directory of the execution hosts. User >sgeadmin<
    must have the right to create this directory and to write into it.
    
    Default: [/opt/n1ge6/default/spool] >>
    


  5. Enter the email address of the user who should receive problem reports.
    In this example, the user who will receive problem report is me@my.domain.
    Grid Engine cluster configuration (continued)
    ---------------------------------------------
    
    <administator_mail>
    
    The email address of the administrator to whom problem reports are sent.
    
    It's is recommended to configure this parameter. You may use >none<
    if you do not wish to receive administrator mail.
    
    Please enter an email address in the form >user@foo.com<.
    
    Default: [none] >> me@my.domain
    

    Once you answer this question, the installation process is complete. The system displays several screens of information before the script exits.

    The upgrade process uses your existing configuration to customize the installation. Output similar to the following is displayed:

    Creating >act_qmaster< file
    Creating >sgemaster< script
    Creating >sgeexecd< script
    creating directory: /tmp/centry
    Reading in complex attributes.
    Reading in administrative hosts.
    Reading in execution hosts.
    Reading in submit hosts.
    Reading in users:
        User "as114086".
        User "md121042".
    Reading in usersets:
        Userset "defaultdepartment".
        Userset "deadlineusers".
        Userset "admin".
        Userset "bchem1".
        Userset "bchem2".
        Userset "bchem3".
        Userset "bchem4".
        Userset "damtp7".
        Userset "damtp8".
        Userset "damtp9".
        Userset "econ1".
        Userset "staff".
    Reading in calendars:
        Calendar "always_disabled".
        Calendar "always_suspend".
        Calendar "test".
    Reading in projects:
        Project "ap1".
        Project "ap2".
        Project "high".
        Project "low".
        Project "p1".
        Project "p2".
        Project "staff".
    Reading in parallel environments:
        PE "bench_tight".
        PE "make".
    Creating settings files for >.profile/.cshrc<
    
    Caution
    Do not rename any of the binaries of the distribution. If you use any scripts or tools in your cluster that monitor the daemons, make sure to check for the new names.


  6. Create the environment variables for use with the Grid Engine software.
    Note
    If no cell name was specified during installation, the value of $SGE_CELL is default.
    • If you are using a C shell, type the following command:
      % source $SGE_ROOT/$SGE_CELL/common/settings.csh
      
    • If you are using a Bourne shell or Korn shell, type the following command:
      $ . $SGE_ROOT/$SGE_CELL/common/settings.sh
      


  7. Install or upgrade the execution hosts.
    There are two ways that you can install the Sun Grid Engine software on your execution hosts: installation or upgrade. If you install the execution hosts, the local spool directory configuration, and some execd parameters will be overwritten. If you upgrade the execution hosts, those files will remain untouched.
    1. To upgrade the software on the execution host, you need to log into each execution host and run the following command:
      # $SGE_ROOT/inst_sge -x -upd
      
    2. To install the software on the execution host:
      • If you only have a few execution hosts, you can install them interactively. You need to log into each execution host, and run the following command:
        # $SGE_ROOT/inst_sge -x
        

        Complete instructions for installing execution hosts interactively are in How to Install Execution Hosts.

      • If you have a large number of execution hosts, you should consider installing them non-interactively.
        Instructions for installing execution hosts in an automated way are in Using the inst_sge Utility and a Configuration Template.

  8. If you have configured load sensors on your execution hosts, you will need to copy these load sensors to the new directory location.

  9. Check your complexes.
    Both the structure of complexes and the rules for configuring complexes have changed. You can use qconf -sc to list your complexes. Review the log file that was generated during the master host upgrade, update.pid. The update.pid file will be placed in the master host spool directory, which is $SGE_ROOT/$SGE_CELL/spool/ by default.

    If necessary, you can use qconf -mc to reconfigure your complexes. For details, see Configuring Resource Attributes.

  10. Reconfigure your queues.
    During the upgrade process, a single default cluster queue is created. Within this queue you will find all of your installed execution hosts. It is recommended that you reconfigure your queues. For details, see Configuring Queues.

Administering Sun Grid Engine

Administering Guide (Printable)

Managing Your Cluster
Topic Description
Interacting With Sun Grid Engine as an Administrator Learn how you can use the command line interface, the graphical user interface (QMON), and the Distributed Resource Management Application API (DRMAA) to interact with the Sun Grid Engine system.
Understanding the Bootstrap File The bootstrap file contains parameters that are needed for starting up the Grid Engine components. The bootstrap file is created during the sge_qmaster installation. You cannot modify the bootstrap file in a running system.
Using qconf Learn how you can use qconf to manage most components of the Sun Grid Engine system. The qconf Cheatsheet lists the most commonly-used qconf commands.
Configuring Clusters Learn how to display, add, modify and delete cluster configurations. (Note: Adapt the global configuration and local host configurations to your site's needs immediately after installation.)
Configuring Hosts Learn how to configure the master host, the shadow master host, execution hosts, administration hosts, submit hosts, and host groups.
Configuring Queues Learn how to configure queues.
Configuring Queue Calendars Learn how to configure queue calendars.
Managing User Access Learn how to configure user access, configure projects, configure default requests, and use path aliasing.
Using Job Submission Verifiers Learn how to use Job Submission Verifiers to verify, modify or reject a job during the time of job submission.
Managing Resource Allocation and Scheduling
Topic Description
Configuring Resource Attributes Learn how to configure complex resource attributes to match the requirements of your environment.
Configuring Load Parameters Learn how to add site-specific load parameters and write your own load sensors.
Managing Policies Learn about the types of user policies that are available and how to match these policies to your scheduling and resource allocation needs.
Managing the Scheduler Learn how to modify the scheduler.
Managing Advance Reservations Learn how to maximize your computing resources by using advance reservations.
Managing Resource Quotas Learn how to use the resource quotas feature to limit resources by user, project, host, cluster queue, or parallel environment.
Managing Special Environments
Topic Description
Managing Parallel Environments Learn how to set up concurrent computing on parallel platforms in networked environments using Sun Grid Engine.
Managing Checkpointing Environments Learn how to use user-level checkpointing or kernel-level transparent checkpointing to manage your environment.
Other Administrative Tasks
Topic Description
Monitoring and Controlling SMF Services Learn how to monitor and Control SMF Services.
Generating Accounting Statistics Learn how to generate accounting information according to real time, user time, or system time.
Backing Up and Restoring System Configuration Learn how to back up your Grid Engine system configuration files automatically.
Fine-Tuning Your Environment Learn about how you can fine tune your environment.
Using DTrace for Performance Tuning Learn how to use DTrace for performance tuning.

To print this section, see the Administering Guide (Printable).


Interacting With Sun Grid Engine as an Administrator

As an administrator, you can choose to interact with the Sun Grid Engine system using the command line interface, the graphical user interface (QMON), and the Distributed Resource Management Application API (DRMAA).

The Command Line Interface

The command line interface provides more flexibility than the graphical user interface in configuring, monitoring, and controlling the Grid Engine system. Many experienced administrators find that using files and scripts is a more flexible, quicker, and more powerful way to change settings.

The following commands are central to Sun Grid Engine administration:

  • qconf – Add, delete, and modify the current Grid Engine configuration. For more information, see Using qconf.
  • qhost – View current status of the available Grid Engine hosts, the queues, and the jobs associated with the queues. For more information, see the qhost(1) man page.
  • qalter and qsub – Submit jobs. For more information, see the submit(1) man page.
  • qstat – Show the status of Grid Engine jobs and queues. For more information, see the qstat(1) man page.
  • qquota – List each resource quota that is being used at least once or that defines a static limit. For more information, see the qquota(1) man page.

For information on the ancillary programs that Sun Grid Engine provides and which users have access to these commands, see Command Line Interface Ancillary Programs.

QMON - The Graphical User Interface

You can use QMON, the graphical user interface (GUI) tool, to accomplish most Grid Engine system tasks. Only administrators can configure QMON using the specifically designed resource file. Reasonable defaults are compiled in $SGE_ROOT/qmon/Qmon. This file also includes a sample resource file. Refer to the comment lines in the sample Qmon file for detailed information on the possible customizations.

As the cluster administrator, you can do any of the following:

  • Install site-specific defaults in standard locations such as /usr/lib/X11/app-defaults/Qmon
  • Include QMON-specific resource definitions in the standard .Xdefaults or .Xresources files
  • Put a site-specific Qmon file in a location referenced by standard search paths such as XAPPLRESDIR

The Distributed Resource Management Application API (DRMAA)

You can automate Sun Grid Engine functions by writing scripts that run Sun Grid Engine commands and parse the results. However, for more consistent and efficient results, you can use the Distributed Resource Management Application API (DRMAA). For more information about the DRMAA concept and how to use it with the C and Java TM languages, see Automating Grid Engine Functions Through DRMAA.

For general information about these administration tools, see Choosing a User Interface.


Understanding the Bootstrap File

The bootstrap file contains parameters that are needed for starting up the Grid Engine components. The bootstrap file is created during the sge_qmaster installation. You cannot modify the bootstrap file in a running system.

Note
Any changes made to the bootstrap file become effective only after restarting the qmaster.

Bootstrap File Parameters

The following section provides a brief description of the individual parameters that compose the bootstrap configuration for a Sun Grid Engine cluster.

admin_user

The admin_user parameter is the administrative user account used by Sun Grid Engine for all internal file handling operations like status spooling, message logging, and so on. This parameter can be used in cases where the root user account does not have the corresponding file access permissions. For example, on a shared file system without global root read/write access.

As the admin_user parameter is set at installation time, you cannot change the parameter in a running system. You can manually change the admin_user parameter on a shutdown cluster. However, if access to the Sun Grid Engine spooling area is interrupted, it will result in unpredictable behavior.

The admin_user parameter has no default value. The default value can be defined during the master installation procedure.

default_domain

The default_domain parameter is needed if your Sun Grid Engine cluster covers hosts belonging to more than a single DNS domain. In this case, it can be used if your host name resolving yields both qualified and unqualified host names for the hosts in one of the DNS domains. The value of the default_domain parameter is appended to the unqualified host name to define a fully qualified host name.

The default_domain parameter will have no effect if the ignore_fqdn parameter is set to True.

As the default_domain parameter is set at installation time, you cannot change the parameter in a running system.

The default value for the default_domain parameter is None.

ignore_fqdn

The ignore_fqdn parameter is used to ignore the fully qualified domain name component of host names. This parameter should be set if all hosts belonging to a Sun Grid Engine cluster are part of a single DNS domain. The ignore_fqdn parameter is enabled if it is set to either True or 1. Enabling the ignore_fqdn parameter can solve problems with load reports caused due to different host name resolutions across the cluster.

As the ignore_fqdn parameter is set at installation time, you cannot change the parameter in a running system.

The default value for the ignore_fqdn parameter is True.

spooling_method

The spooling_method parameter defines how sge_qmaster(8) writes its configuration and the status information of a running cluster. The available spooling methods are berkeleydb and classic.

spooling_lib

The name of a shared library containing the spooling_method parameter to be loaded at sge_qmaster(8) initialization time. The extension characterizing a shared library like .so, .sl, or .dylib is not contained in the spooling_lib parameter.

If the spooling_method parameter is set to berkeleydb during installation, the spooling_lib parameter is set to libspoolb. If the classic option is chosen as spooling_method during installation, the spooling_lib parameter is set to libspoolc.

You should note that not all operating systems allow the dynamic loading of libraries. On such operating systems a certain spooling method with the default value berkeleydb is compiled into the binaries and the spooling_lib parameter will be ignored.

spooling_params

The spooling_params parameter defines parameters to the chosen spooling method. These parameters are required to initialize the spooling framework. For example, you can define parameters to open database files or to connect to a certain database server.

The spooling parameter value for the berkeleydb spooling method is [rpc_server:]database directory. For example, /sge_local/default/spool/qmaster/spooldb for spooling to a local file system or myhost:sge for spooling over a Berkeley DB RPC server.

The spooling parameter value for the classic spooling method is <common_dir>;<qmaster spool dir>. For example, /sge/default/common;/sge/default/spool/qmaster.

binary_path

The directory path where the Sun Grid Engine binaries reside. It is used within the Sun Grid Engine components to locate and start up other Sun Grid Engine programs.

The path name given here is searched for binaries as well as any directory below with a directory name equal to the current operating system architecture. Therefore, /usr/SGE/bin will work for all architectures, if the corresponding binaries are located in subdirectories named aix43, cray, lx24-x86, hp11, irix65, tru64, sol-sparc, and so on.

The default location for the binary path is <sge_root>/bin

qmaster_spool_dir

The location where the master spool directory resides. sge_qmaster(8) and sge_shadowd(8) need to have access to this directory. The master spool directory, in particular the job_scripts directory and the messages log file, may become quite large depending on the size of the cluster and the number of jobs. Ensure that you allocate enough disk space and regularly clean up the log files. For example, you can achieve this with the help of a cron(8) job.

As the qmaster_spool_dir parameter is set at installation time, you cannot change the parameter in a running system.

The default location for the qmaster_spool_directory parameter is <sge_root>/<cell>/spool/qmaster.

security_mode

The security_mode parameter defines the set of security features used by the Sun Grid Engine cluster. The possible security mode settings are None, AFS, DCE, KERBEROS, and CSP.

Note
Modifying the security_mode parameter generally will require you to reinstall the Sun Grid Engine cluster. For more information on editing the security_mode parameter, contact a Sun Grid Engine support specialist.
listener_threads

The listener_threads parameter defines the number of listener threads. The default number is set during installation.

worker_threads

The worker_threads parameter defines the number of worker threads. The default number is set during installation.

scheduler_threads

The scheduler_threads parameter defines the number of scheduler threads. The allowed values for this parameter are 0 and 1. The default value (1) is set during installation. For more information, refer qconf(1) -kt/-at option.

jvm_threads

The jvm_threads parameter defines the number of JVM threads. The allowed values are 0 and 1. The default value is set during installation.

Note
The bootstrap file parameters admin_user, default_domain, ignore_fqdn, and binary_path also effect the behavior of the execution daemon execd and require that execd be restarted at the same time as qmaster.

Using qconf

qconf is the central configuration command. As a system administrator, you can use qconf to add, delete, and modify the current Grid Engine configuration.

To learn how to most effectively use qconf, familiarize yourself with the following:

See the qconf Cheatsheet for a reference list of the most commonly used qconf commands. For more information, see the qconf(1) man page.

qconf Syntax

Note
While the syntax for qconf is fairly consistent, there are a few exceptions. For example, qconf -aprj does not take an argument. See the qconf Cheatsheet for more info.

The following options consistently apply when using qconf:

Function Syntax
Add a
Modify a value setting m
Delete d
Show s
Show complete list of currently-configured objects s<obj>l
Replace the entire list of settings r
Add from file A
Modify a value setting from file M

The following object specifications are used:

object Specification
Administrative Host h
Calendar cal
Checkpointing Environment ckpt
Cluster conf
Complex c
Execution Host e
Host Group hgrp
Manager m
Operator o
Parallel Environment p
Project prj
Queue q
Submit Host s
User user
User Access List u

See the qconf Cheatsheet for a reference list of commonly used commands.

qconf Command Execution Methods

The qconf command can be used to add new objects or modify existing objects from the specification in a file.

You can use qconf to do the following:

  • Add new objects
    qconf -a<obj_spec> <object name>
    
  • Add new objects from file
    qconf -A<obj_spec> <filename>
    
  • Modify existing objects using an editor
    qconf -m<obj_spec> <object name>
    
  • Modify existing objects from file
    qconf M<obj_spec> <filename>
    
  • Modify many objects at once (or modify an object non-interactively)
    qconf -{a,m,r,d}attr <obj_spec> <attrib> <value> <queue_list>|<host_list>
    
  • Modify many objects at once from file (or modify an object non-interactively)
    qconf -{A,M,R,D}attr <obj_spec> <filename>
    

Automating the qconf Command Using the Editor Environment Variable

You can use the EDITOR environment variable to automate the behavior of the qconf command. Change the value of this variable to point to an editor program that modifies a file whose name is given by the first argument. After the editor modifies the temporary file and exits, the system reads in the modifications, which take effect immediately.

Note
If the modification time of the file does not change after the edit operation, the system sometimes incorrectly assumes that the file was not modified. Therefore you should insert a sleep 1 instruction before writing the file, to ensure a different modification time.

You can use this technique with any qconf -m... command. However, the technique is especially useful for administration of the scheduler and the global configuration, as you cannot automate the procedure in any other way.


qconf Cheatsheet

For a detailed list of available qconf commands, see the qconf(1) man page.

Complex Configuration

Command Description
qconf -sc Show the current complex configuration.
qconf -mc Modify the complex configuration using an editor.
qconf -Mc filename Modify the current complex configuration by overwriting the current configuration as specified by the file filename.

For more information, see How to Configure the Complex From the Command Line.

Calendar Configuration

Command Description
qconf -scal calendarname Show the configuration for the specified calendar.
qconf -scall Show a list of all currently configured calendars.
qconf -acal calendarname Add a new calendar definition to the Sun Grid Engine environment.
qconf -Acal filename Add a new calendar definition to the Sun Grid Engine environment from the file filename.
qconf -mcal calendarname Modify the specified calendar configuration using an editor.
qconf -Mcal filename Modify the current calendar configuration by overwriting the current configuration as specified by filename.
qconf -dcal calendarname Delete a calendar.

For more information, see How to Configure Queue Calendars From the Command Line.

Checkpointing Environment Configuration

Command Description
qconf -sckpt checkpointname Show the configuration of the specified checkpointing environment.
qconf -sckptl Show a list of all currently configured checkpointing environments.
qconf -ackpt checkpointname Add a checkpointing environment to the Sun Grid engine environment.
qconf -Ackpt filename Add a checkpointing environment from the file filename.
qconf -mckpt checkpointname Modify the specified checkpointing environment using an editor.
qconf -Mckpt filename Modify a checkpointing environment from file.
qconf -dckpt checkpointname Delete a checkpointing environment.

For more information, see How to Configure Checkpointing Environments From the Command Line.

Cluster Configuration

Command Description
qconf -sconf [host,...|global] Show a local host configuration or the global cluster configuration.
qconf -sconfl Show a list of all currently configured hosts.
qconf -aconf host,... Add a host configuration.
qconf -Aconf filelist Add the configuration specified in the files enlisted in the comma-separated filelist.
qconf -mconf [host,...|global] Modify the one or more local host configurations or the global cluster configuration.
qconf -Mconf filelist Modify the configurations specified in the files enlisted in the comma separated filelist.
qconf -dconf host,... Delete one or more hosts from the configuration list.

For more information, see How to Configure Clusters From the Command Line.

Event Client Configuration

Command Description
qconf -secl Show event client list.
qconf -kec [eventclientname,...|all] Shuts down event clients registered at the master daemon.

Host Configuration

Command Description
qconf -se host Show the configuration for the specified execution host.
qconf -sel Show a list of all currently-configured execution hosts.
qconf -ae Add an execution host.
qconf -Ae filename Add an execution host from file )filename.
qconf -me host Modify the specified execution host using an editor.
qconf -Me filename Modify an execution host from file filename.
qconf -de host,... Delete one ore more execution hosts.
qconf -sh Show a list of all currently-configured administrative hosts.
qconf -ah host,... Add one ore more administrative hosts.
qconf -dh host,... Delete one ore more administrative hosts from the administrative host list.
qconf -ss Show a list of all currently-configured submit hosts.
qconf -as host,... Add one more submit hosts.
qconf -ds host,... Delete a submit host.

For more information, see Configuring Hosts.

Host Group Configuration

Command Description
qconf -shgrp groupname Show the configuration for the specified host group.
qconf -shgrpl Show a list of all currently configured host groups.
qconf -ahgrp hostgroupname Add a host group.
qconf -Ahgrp filename Add a host group configuration from file filename.
qconf -mhgrp hostgroupname Modify the specified host group using an editor.
qconf -Mhgrp filename Modify a host group configuration from file filename.

Fore more information, see How to Configure Host Groups From the Command Line.

Parallel Environment Configuration

Command Description
qconf -sp pename Show the configuration for the specified parallel environment.
qconf -spl Show a list of all currently configured parallel environments.
qconf -ap pename Add a new parallel environment.
qconf -Ap filename Add a parallel environment from file filename.
qconf -mp pename Modify the specified parallel environment using an editor.
qconf -Mp filename Modify a parallel environment from file filename.
qconf -dp pename Delete the specified parallel environment.

For more information, see How to Configure Parallel Environments From the Command Line.

Project Configuration

Command Description
qconf -sprj projectname Show the configuration for the specified project.
qconf -sprjl Show a list of all currently configured projects.
qconf -aprj Add a new project.
qconf -Aprj filename Add a new project from file filename.
qconf -mprj projectname Modify the specified project using an editor.
qconf -Mprj filename Modify a project from file filename.
qconf -dprj projectname Delete a project.

For more information, see How to Configure Projects From the Command Line.

Queue Configuration

The most flexible way to automate the configuration of queue instances is to use the qconf command with the qselect command. With the combination of these commands, you can build up your own custom administration scripts.

Command Description
qconf -sq wc_queue_list Show one or multiple cluster queues or queue instances.
qconf -sql Show a list of all currently configured cluster queues.
qconf -aq queuename Add a new queue.
qconf -Aq filename Add a queue from file filename.
qconf -mq queuename Modify the specified queue using an editor.
qconf -Mq filename Modify a queue from file filename.
qconf -dq queuename Delete a queue.
-purge allows you to delete an entire list attribute from the underlying queue instances of a cluster queue, whereas -dattr only deletes a value or an item from a list attribute.

For more information, see Configuring Queues.

Scheduler Configuration

Command Description
qconf -ssconf Show the current scheduler configuration.
qconf -msconf Modify the scheduler configuration using an editor.
qconf -tsm Trigger scheduler monitoring.

For more information, see Managing the Scheduler.

Sharetree Configuration

Command Description
qconf -sstree Show the sharetree.
qconf -sstnode nodelist Show one or more sharetree node.
qconf -rsstnode nodelist Show one or more sharetree nodes and their children.
qconf -astree Add a sharetree.
qconf -Astree filename Add a sharetree from filename.
qconf -astnode nodelist Add one or more sharetree nodes.
qconf -mstree Modify a sharetree.
qconf -Mstree filename Modify a sharetree from filename.
qconf -dstree Delete a sharetree.
qconf -dstnode nodelist Delete or more sharetree nodes.

User Configuration

Command Description
qconf -suser username,... Show the configuration for one or more users.
qconf -suserl Show a list of all currently-configured users.
qconf -auser Add a user to the list of registered users.
qconf -Auser filename Add a user from file filename.
qconf -muser username Modify the specified user configuration using an editor.
qconf -Muser filename Modify a user from file filename.
qconf -duser username,... Delete one or more users.
qconf -sm Show a list of all currently-configured managers.
qconf -am username Add a user to the manager list.
qconf -dm username,... Delete one or more managers.
qconf -so Show a list of all currently-configured operators.
qconf -ao username Adds a user to the operator list.
qconf -do username,... Deletes one or more operators.

For more information, see Configuring User Access.

User Access List Configuration

Command Description
qconf -su aclname [,...] Displays the configuration for the specified access list.
qconf -sul Displays a list of all currently configured user access lists.
qconf -au username [,...] accesslistname [,...] Adds a user or users to an access list or lists.
qconf -Au filename Adds a user access list from file.
qconf -mu aclname Opens an editor to modify the specified access list configuration.
qconf -Mu filename Modifies a user access list from file.
qconf -du username [,...] aclname [,...] Deletes a user or users to an access list or lists.
qconf -dul aclname [,...] Deletes one or more user access lists.

For more information, see How to Configure User Access Lists From the Command Line.


Configuring Clusters

The basic cluster configuration is a set of information that is configured to reflect site dependencies and to influence Grid Engine system behavior. Site dependencies include valid paths for programs such as mail or xterm. A global configuration is provided for the master host as well as for every host in the Grid Engine system pool. In addition, you can configure the system to use a configuration local to each host to override particular entries in the global configuration.

The cluster administrator should adapt the global configuration and local host configurations to the site's needs immediately after the installation. The configurations should then be kept up to date.

Cluster Configuration Recommendations

The following recommendations apply to all clusters:

  • Only in the smallest clusters does it makes sense for the execution daemon spooling directories to be located on a NFS-mounted file system.
  • Configure the default priority for all users' jobs to be less than 0. If the default priority is left at 0, users are only able to reduce the priorities of their jobs because they are not allowed to submit jobs with a priority greater than 0. If the default priority is set at less than 0, then users have room to increase the priority of their jobs. -100 is a suggested value.

Task User Interface Description
How to Configure Basic Clusters CLI or QMON Learn how to display and modify the global configuration and local host configurations and how to add and delete local host configurations.

How to Configure Clusters From the Command Line

Note
You must be an administrator to use the qconf command to change cluster configurations.

To configure clusters from the command line, use the following arguments for the qconf command:

  • To show the global cluster configuration, type the following command:
    % qconf -sconf [global]
    

    The -sconf and -sconf global options are equivalent. They both show the global configuration.

  • To show a local host configuration, type the following command:
    % qconf -sconf <host,...>
    

    The -sconf option shows the specified local host's configuration.

  • To add a new local host configuration, type the following command:
    % qconf -aconf <host,...>
    

    The -aconf option adds new local host configurations. For each host, an editor is invoked that is used to enter the configuration for that host. The configuration is then registered with the sge_qmaster. The following host names are invalid, reserved, or otherwise not allowed to be used:

    • global
    • template
    • all
    • default
    • unknown
    • none
  • To modify a local host configuration, type the following command:
    % qconf -mconf <host,...>
    

    The -mconf option modifies the local configuration of the specified execution host or master host. For more information, see Cluster Configuration Parameters.

  • To modify the global configuration, type the following command:
    % qconf -mconf [global]
    

    The -mconf option modifies the global configuration. For more information, see Cluster Configuration Parameters.

  • To delete a local host configuration, type the following command:
    % qconf -dconf <host,...>
    

    The -dconf option deletes the configuration of the specified execution host or master host.

For more information on how to configure the cluster from file or how to modify many objects at once, see Using qconf. For more examples of available qconf commands, see the qconf(1) man page.


Cluster Configuration Parameters

For detailed information about these parameters, see the sge_conf(5) man page.

Unless otherwise mentioned, the global configuration entry for the following parameters can be overwritten by the execution host local configuration.
Parameter Description
execd_spool_dir The execution daemon spool directory path. Note: If you would like to change the global execd_spool_dir that was set during installation, you must restart all affected execution daemons. Default path: $SGE_ROOT/$SGE_CELL/spool
mailer The absolute pathname to the electronic mail delivery agent on your system. It must accept mailer -s <email subject> <recipient>. The default depends on the operating system of the host on which the Sun Grid Engine master installation was run. Common values are /bin/mail/ or /usr/bin/Mail.
xterm The absolute pathname to the X Window System terminal emulator, xterm(1). Default path: /usr/bin/X11/xterm
load_sensor A list of executable shell script paths or programs that are started by the execution daemons to retrieve site configurable load information.
prolog The executable path of a shell script that is started before the execution of Sun Grid Engine jobs with the same environment setting as the Sun Grid Engine jobs to be started afterwards. This procedure automates the execution of general site specific tasks like the preparation of temporary file systems with the need for the same context information as the job. For a complete list of special variables that can be used with prolog, see the sge_conf(5) man page. Default value: NONE
epilog The executable path of a shell script that is started after the execution of Sun Grid Engine jobs with the same environment setting as the Sun Grid Engine jobs that have just completed. This procedure automates the execution of general site specific tasks like the cleaning up of temporary file systems with the need for the same context information as the job. Default value: NONE
shell_start_mode Deprecated.
login_shells A list of the executable names of the command interpretors that should started as login shells. login_shells a global configuration parameter only. Default value: sh,csh,tcsh,ksh
min_uid Places a minimum value on user IDs that may use the cluster. min_uid is a global configuration parameter only. Default value: 0
min_gid Please a minimum value on group IDs that may use the cluster. min_gid is a global configuration parameter only. Default value: 0
user_lists A list of user access lists that grant users access to the cluster. user_lists is a global configuration parameter only. Default value: NONE
xuser_lists A list of user access lists that deny users access to the cluster. xuser_lists is a global configuration parameter only. Default value: NONE
projects A list of all projects that are granted access to Sun Grid Engine. projects is a global configuration parameter only. Default value: NONE
xprojects A list of all projects that are denied access to Sun Grid Engine. xprojects is a global configuration parameter only. Default value: NONE
enforce_project If set to true, users are required to request a project whenever submitting a job. enforce_project is a global configuration parameter only. Default value: false
enforce_user If set to true, a user must exist to allow for job submission. If set to auto, a user object for the submitting user will automatically be created. enforce_user is a global configuration parameter only. Default value: false
load_report_time The time interval between system load reports. Note: Reporting load too frequently may block the master daemon, especially if your cluster contains a large number of execution hosts. Default value: 00:00:40
max_unheard The amount of time (in seconds) that the master daemon could not contact or was not contacted by a host's execution daemon. If the prescribed amount of time passes, all queues residing on that particular host are set to status unknown. max_unheard is a global configuration parameter only. Default value: 00:05:
reschedule_unknown Determines whether the jobs on a host set to unknown should be rescheduled and then sent to other hosts. reschedule_unknown controls the time (in hours) that Sun Grid Engine will wait to reschedule jobs after a host becomes unknown. Default value: 00:00:00
finished_jobs Deprecated.
gid_range A list of range expressions that identify processes belonging to the same job. Since there is no default value, the administrator will need to assign one.
qlogin_command Executed on the client side of a qlogin request. Usually, this is the builtin qlogin mechanism.
qlogin_daemon Specifies the mechanism that is to be started on the server side of a qlogin request. Usually, this is the builtin mechanism. It is also possible to configure an external executable by specifying the full qualified pathname.
rlogin_command Executed on the client side of a qrsh request without a command argument to be executed remotely. Usually, this is the builtin mechanism. If no value is given, a special Grid Engine component is used. The command is automatically started with the target host and port number as parameters. The Grid Engine rlogin client has been extended to accept and use the port number argument. You can only use clients, such as ssh, which also understand this syntax.
rlogin_daemon Specifies the mechanism that is to be started on the server side of a qrsh request without a command argument to be executed remotely. Usually, this is the builtin mechanism.
rsh_command Executed on the client side of a qrsh request with a command argument to be executed remotely. Usually, this is the builtin mechanism. If no value is given, a specialized Grid Engine component is used. The command is automatically started with the target host and port number as parameters similar to those required for telnet(1) plus the command with its arguments to be executed remotely. The Grid Engine rsh client has been extended to accept and use the port number argument. You can only used clients, such as ssh, which can understand this syntax.
rsh_daemon Specifies the mechanism that is to be started on the server side of a qrsh(1) request with a command argument to be executed remotely. Usually, this is the builtin mechanism. If no value is given, a specialized Grid Engine component is used.
max_aj_instances Defines the maximum amount of array tasks that can scheduled to run simultaneously per array job. An instance of an array task will be created within the master daemon when it gets a start order from the scheduler. The instance will be destroyed when the array task finishes. This feature is most useful for very large clusters and very large array jobs. max_aj_instances is a global configuration parameter only. Default value: 2000
max_aj_tasks Defines the maximum number of array job tasks within an array job. The master daemon will reject all array job submissions that request more than max_aj_tasks array job tasks. Default value: 75000
max_u_jobs The number of active jobs, which each user can have in the system simultaneously. A value greater than 0 defines the limit. max_u_jobs is a global configuration parameter only. Default Value: 0 (unlimited)
max_jobs The number of active jobs simultaneously allowed in the Sun Grid Engine system. A value greater than 0 defines this limit. max_jobs is a global configuration parameter only. Default value: 0 (unlimited)
max_advance_reservations The number of active advance reservations simultaneously allowed in Sun Grid Engine. A value greater than 0 defines this limit. max_advance_reservations is a global configuration parameter only. Default value: 0 (unlimited)
auto_user_oticket The number of override tickets assigned to automatically created user objects. User objects are created automatically if the enforce_user attribute is set to auto. auto_user_oticket is a global configuration parameter only. Default value: 0
auto_user_fshare The number of functional shares assigned to automatically created user objects. User objects are created automatically if the enforce_user attribute is set to auto. auto_user_fshare is a global configuration parameter only. Default value: 0
auto_user_default_project The default project assigned to automatically created user objects. User objects are created automatically if the enforce_user attribute is set to auto. auto_user_default_project is a global configuration parameter only. Default value: NONE
auto_user_delete_time The number of seconds of inactivity after which automatically created user objects will be deleted. User objects are created automatically if the enforce_user attribute is set to auto. If the user has no active or pending jobs for the specified amount of time, the object will automatically be deleted. A value of 0 can be used to indicate that the automatically created user object is permanent and should not be automatically deleted. auto_user_delete_time is a global configuration parameter only. Default value: 86400
delegated_file_staging This flag must be set to true when the prolog and epilog are ready for delegated file staging.
reprioritize Deprecated.
qmaster params A list of additional parameters that can be passed to the Sun Grid Engine master daemon. For a complete list of these parameters, see the sge_conf(5) man page.
reporting_params A list of parameters that are used to define the behavior of reporting modules in the Sun Grid Engine master daemon. For a complete list of these parameters, see the sge_conf(5) man page.
jsv_url This setting defines a server JSV instance which will be started and triggered by the sge_qmaster(8) process.
jsv_allowed_mod If there is a server JSV script defined with jsv_url parameter, then all qalter(1) or qmon(1) modification requests for jobs are rejected by the master daemon. With the jsv_allowed_mod parameter, an administrator has the possibility to allow a set of switches which can then be used with clients to
modify certain job attributes.

How to Configure Clusters With QMON

Note
Any change to a local host configuration supersedes the global configuration parameters.
  1. On the QMON Main Control window, click the Cluster Configuration button, as shown below.
    cluster configuration button on the QMON main control window

  2. You can view all currently configured clusters and perform the following tasks from the Cluster Configuration dialog box:
    • To add a new local host configuration, do the following:
      1. Click the Add button.
        The Cluster Settings dialog box appears.
      2. Enter the name of the host and configuration information.
        The following host names are invalid, reserved, or otherwise not allowed to be used:
        • global
        • template
        • all
        • default
        • unknown
        • none
      3. When you finish making changes, click the OK button to save your changes and close the dialog box.
        Click the Cancel button to close the dialog box without saving changes.
    • To modify the global host configuration, do the following:
      1. Select the name global and then click the Modify button.
        The Cluster Settings dialog box appears, as shown in the following figure. You can modify all parameters. For more information, see Cluster Configuration Parameters.
        Dialog box titled Cluster Settings. Shows General Settings tab with global configuration parameters you can set. Shows OK and Cancel buttons.
      2. When you finish making changes, click the OK button to save your changes and close the dialog box.
        Click the Cancel button to close the dialog box without saving changes.
    • To modify a local host configuration, do the following:
      1. Select a host name and then click the Modify button.
        The Cluster Settings dialog box appears. You can modify only those parameters that are feasible for local host changes. For more information, see Cluster Configuration Parameters.
      2. When you finish making changes, click the OK button to save your changes and close the dialog box.
        Click the Cancel button to close the dialog box without saving changes.
        Note
        The Advanced Settings tab shows a corresponding behavior, depending on whether you are modifying a configuration or are adding a new configuration. The Advanced Settings tab provides access to more rarely used cluster configuration parameters.
    • To delete a local host, select the host whose configuration you want to delete, and then click Delete.

See the sge_conf(5) man page for a complete description of all cluster configuration parameters.


Configuring Hosts

For more information on hosts and daemons, see How the System Operates.

Note
The master host requires no further configuration other than that performed by the installation procedure. For information about how to initially set up the master host, see How to Install the Master Host. For information about how to configure dynamic changes to the master host, see How to Configure the Shadow Master Host below.
Task User Interface Description
How to Configure Execution Hosts CLI or QMON Learn how to display, add, modify, and delete execution hosts. An execution host is a system that has permission to run Grid Engine system jobs.
How to Configure Administration Hosts CLI or QMON Learn how to display, add, and delete administration hosts. An administration host is a system that has permission to carry out administrative activity for the Grid Engine system.
How to Configure Submit Hosts CLI or QMON Learn how to display, add, and delete submit hosts. A submit hosts is a system that can submit and control batch jobs only.
How to Configure Host Groups CLI or QMON Learn how to display, add, modify, and delete host groups. Host groups enable you to use a single name to refer to multiple hosts.
Configuring a Shadow Master Host From a Command Line CLI Learn how to start and configure shadow master hosts. Shadow master hosts are machines in the cluster that can detect a failure of the master daemon and take over its role as master host.
How to Kill Daemons CLI or QMON Learn how to kill the master daemon and execution daemons from the command line and how to kill execution daemons with QMON.
How to Restart Daemons CLI Learn how to restart the master and execution daemons from the command line.
How to Migrate the Master Daemon to Another Host CLI Learn how to migrate the master daemon to another host from the command line.
How to Migrate the Master Daemon to Another Host Using a Script CLI Learn how to migrate the master daemon to another host using a script.

How to Configure Execution Hosts From the Command Line

Before You Begin

Before you configure an execution host, you must first install the software on the execution host as described in How to Install Execution Hosts.

Note
Administrative or submit commands are allowed from execution hosts only if the execution hosts are also declared to be administration or submit hosts. See How to Configure Administration Hosts and How to Configure Submit Hosts.
Configuration Commands

To configure execution hosts from the command line, use the following arguments for the qconf command:

  • To show a specific execution host configuration, type the following command:
    qconf -se <hostname>
    

    The -se option (show execution host) shows the configuration of the specified execution host as defined in host_conf.

  • To show a list of all configured execution hosts, type the following command:
    qconf -sel
    

    The -sel option (show execution host list) shows a list of hosts that are configured as execution hosts.

  • To add an execution host, type the following command:
    % qconf -ae [ <hostname> ]
    

    The -ae option (add execution host) displays an editor that contains a configuration template for an execution host. The editor is either the default vi editor or the editor that corresponds to the EDITOR environment variable. If you specify exec-host, which is the name of an already configured execution host, the configuration of this execution host is used as a template. To configure the execution host, change and save the template. See the host_conf(5) man page for a detailed description of the template entries to be changed. Note: Before you configure an execution host, you must first install the software on the execution host as described in How to Install Execution Hosts.

  • To modify an execution host, type the following command:
    % qconf -me <hostname>
    

    The -me option (modify execution host) displays an editor that contains the configuration file template for the specified execution host. The editor is either the default vi editor or the editor that corresponds to the EDITOR environment variable. To modify the execution host configuration, change and save the configuration file template. See the host_conf(5) man page for a detailed description of the template entries to be changed. For information on host configuration parameters, see Host Configuration Parameters.

  • To modify an execution host from file, type the following command:
    % qconf -Me <filename>
    

    The -Me option (modify execution host) uses the content of filename as an execution host configuration template. The configuration in the specified file must refer to an existing execution host. The configuration of this execution host is replaced by the file content. This qconf option is useful for changing the configuration of offline execution hosts, for example, in cron jobs, as the -Me option requires no manual interaction. For information on host configuration parameters, see Host Configuration Parameters.

  • To delete an execution host, type the following command:
    % qconf -de <hostname>
    

    The -de option (delete execution host) deletes the specified host from the list of execution hosts. All entries in the execution host configuration are lost.

For more information on how to configure hosts from file or modify many objects at a time, see Using qconf.


How to Configure Execution Hosts With QMON

Before You Begin

Before you configure an execution host, you must first install the software on the execution host as described in How to Install Execution Hosts.

Note
Administrative or submit commands are allowed from execution hosts only if the execution hosts are also declared to be administration or submit hosts. See How to Configure Administration Hosts and How to Configure Submit Hosts.
Steps
  1. Click the Host Configuration button on the QMON Main Control window.

  2. Click the Execution Host tab.
    shows the execution host tab
    Note the following in the Execution Host tab:
    • The Hosts list displays the execution hosts that are already defined.
    • The Load Scaling list displays the currently configured load-scaling factors for the selected execution host. See Configuring Load Parameters for information about load parameters.
    • The Access Attributes list displays access permissions. See Managing User Access for information about access permissions.
    • The Consumables/Fixed Attributes list displays resource availability for consumable and fixed resource attributes associated with the host. See Configuring Resource Attributes for information about resource attributes.
    • The Reporting Variables list displays the variables that are written to the reporting file when a load report is received from an execution host. See Defining Reporting Variables for information about reporting variables.
    • The Usage Scaling list displays the current scaling factors for the individual usage metrics CPU, memory, and I/O for different machines. Resource usage is reported by sge_execd periodically for each currently running job. The scaling factors indicate the relative cost of resource usage on the particular machine for the user or project running a job. These factors could be used, for instance, to compare the cost of a second of CPU time on a 400 MHz processor to that of a 600 MHz CPU. Metrics that are not displayed in the Usage Scaling list have a scaling factor of 1.
  3. You can show, add, modify, or delete an execution host from the Execution Host Tab.
    • To show an execution host, select a host name from the Host list.
      The execution host's attributes are displayed.
    • To add or modify an execution host, click the Add button and then type its name in the Host field or click the Modify button.
      The Add/Modify Exec Host dialog box appears as shown in the following figure. For more information, see Host Configuration Parameters.
      Dialog box titled Add/Modify Exec Host. Shows Scaling tab with Load Scaling and Usage Scaling tables. Shows OK and Cancel buttons.
      • To define scaling factors, click the Scaling tab.
        • The Load column of the Load Scaling table lists all available load parameters. The Scale Factor column lists the corresponding definitions of the scaling. You can edit the Scale Factor column. Valid scaling factors are positive floating-point numbers in fixed-point notation or scientific notation.
        • The Usage column of the Usage Scaling table lists the current scaling factors for the usage metrics CPU, memory, and I/O. The Scale Factor column lists the corresponding definitions of the scaling. You can edit the Scale Factor column. Valid scaling factors are positive floating-point numbers in fixed-point notation or scientific notation.
      • To define the resource attributes to associate with the host, click the Consumables/Fixed Attributes tab.
        The Consumables/Fixed Attributes table lists all resource attributes for which a value is currently defined.
        Dialog box titled Add/Modify Exec Host. Shows Consumables/Fixed Attributes tab with table of attributes. Shows Ok and Cancel buttons.
        Tip
        Use the Complex Configuration dialog box if you need more information about the current complex configuration, or if you want to modify it. For details about complex resource attributes, see Configuring Resource Attributes.

        You can enhance the list by clicking either the Name or the Value column name. The Attribute Selection dialog box appears, which includes all resource attributes that are defined in the complex.

        • To add an attribute to the Consumables/Fixed Attributes table, select the attribute, and then click OK.
        • To modify an attribute value, double-click a Value field, and then type a value.
        • To delete an attribute, select the attribute, and then press Control-D or click mouse button 3.
          Click Ok to confirm that you want to delete the attribute.
      • To define user access permissions to the execution host based on previously configured user access lists, click the User Access tab.
        Dialog box titled Add/Modify Exec Host. Shows User Access tab with user access lists. Shows Ok and Cancel buttons.
      • To define reporting variables, click the Reporting Variables tab.
        Dialog box titled Add/Modify Exec Host. Shows Reporting Variables tab with variable lists. Shows Ok and Cancel buttons.
        The Available list displays all the variables that can be written to the reporting file when a load report is received from the execution host.
        • To add a variable to the Selected list, select a reporting variable from the Available list, and then click the red right arrow.
        • To remove a reporting variable from the Selected list, select the variable, and then click the left red arrow.
      • To define project access permissions to the execution host based on previously configured projects, click the Project Access tab.
        Dialog box titled Add/Modify Exec Host. Shows Project Access tab with project access lists. Shows Ok and Cancel buttons.
    • To delete an execution host, select the execution host name from the Hosts list and then click the Delete button.

After you have configured the execution hosts, see Monitoring Hosts.


How to Configure Administration Hosts From the Command Line

To configure administration hosts from the command line, use the following arguments for the qconf command:

  • To show a list of all currently configured administration hosts, type the following command:
    % qconf -sh
    

    The -sh option (show administration hosts) displays a list of all currently configured administration hosts.

  • To add an administration host, type the following command:
    % qconf -ah <hostname>
    

    The -ah option (add administration host) adds the specified host to the list of administration hosts.

  • To delete an administration host, type the following command:
    % qconf -dh <hostname>
    

    The -dh option (delete administration host) deletes the specified host from the list of administration hosts.

For more information on how to configure hosts from file or modify many objects at a time, see Using qconf and the qconf(1) man page.


How to Configure Administration Hosts With QMON

  1. On the QMON Main Control window, click the Host Configuration button.
    The Host Configuration dialog box appears and displays the Administration Host tab as shown in the following figure.
    "Dialog box titled Host Configuration. Shows Administration Host tab with Hosts list. Shows Add
    Note
    The Administration Host tab is displayed by default when you click the Host Configuration button for the first time.


  2. From the Host Configuration dialog box, you can choose from the following functions:
    • To add a new administration host, type its name in the Host field, and then click Add, or press the Return key.
    • To delete an administration host, select the administrative host name from the list, and then click Delete.

How to Configure Submit Hosts From the Command Line

Note
No administrative commands are allowed from submit hosts unless the hosts are also declared to be administration hosts. See How to Configure Administration Hosts for more information.

To configure submit hosts from the command line, use the following arguments for the qconf command:

  • To show a list of the names of all currently configured submit hosts, type the following command:
    % qconf -ss
    

    The -ss option (show submit hosts) displays a list of the names of all currently configured submit hosts.

  • To add a host to the list of submit hosts, type the following command:
    % qconf -as <hostname>
    

    The -as option (add submit host) adds the specified host to the list of submit hosts.

  • To delete a specified host from the list of submit hosts, type the following command:
     
    % qconf -ds <hostname>
    

    The -ds option (delete submit host) deletes the specified host from the list of submit hosts.

For more information on how to configure hosts from file or modify many objects at a time, see Using qconf or the qconf(1) man page.


How to Configure Submit Hosts With QMON

Note
No administrative commands are allowed from submit hosts unless the hosts are also declared to be administration hosts. See How to Configure Administration Hosts for more information.
  1. On the QMON Main Control window, click the Host Configuration button.
    The Host Configuration button is shown below.


  2. Click the Submit Host tab in the Host Configuration dialog box.
    The Submit Host tab is show below.
    "Dialog box titled Host Configuration. Shows Submit Host tab with Host list. Shows Add

  3. You can view all currently configured submit hosts and perform the following tasks from the Submit Host tab:
    • To add a submit host, type the submit host name in the Host field, and then click Add or press the Return key.
    • To delete a submit host, select the submit host name from the list, and then click Delete.

How to Configure Host Groups From the Command Line

Host groups enable you to use a single name to refer to multiple hosts. A host group can include other host groups as well as multiple individual hosts. Host groups that are members of another host group are subgroups of that host group.

For example, you might define a host group called @bigMachines that includes the following members:

  • @solaris64
  • @solaris32
  • fangorn
  • balrog

The initial @ sign indicates that the name is a host group. The host group @bigMachines includes all hosts that are members of the two subgroups @solaris64 and @solaris32. @bigMachines also includes two individual hosts, fangorn and balrog.

Commands

To configure host groups from the command line, use the following arguments for the qconf command:

  • To show a host group configuration, type the following command:
    % qconf -shgrp <hostgroupname>
    

    The -shgrp option (show host group) shows the configuration of the specified host group.

  • To show the host group as a tree, type the following command:
    % qconf -shgrp_tree <hostgroupname>
    

    The -shgrp_tree option (show host group as a tree) shows the configuration of the specified host group and its sub-hostgroups as a tree.

  • To show a host group with a resolved host list, type the following command:
    % qconf -shgrp_resolved <hostgroupname>
    

    The -shgrp_resolved option (show host group with resolved host list) shows the configuration of the specified host group with a resolved host list.

  • To show the host group list, type the following command:
    % qconf -shgrpl
    

    The -shgrpl option (show host group list) displays a list of all host groups.

  • To add a host group to the list of host groups, type the following command:
    % qconf -ahgrp <hostgroupname>
    

    The ahgrp option (add host group) adds a new host group to the list of host groups. See the hostgroup(5) man page for a detailed description of the configuration format.

  • To add a host group from a file, type the following command:
    % qconf -Ahgrp <filename>
    

    The -Ahgrp option (add host group from a file) displays an editor that contains a host group configuration defined in filename. The editor is either the default vi editor or the editor that corresponds to the $EDITOR environment variable. To configure the host group, change and save the configuration file template.

  • To modify a host group, type the following command:
    % qconf -mhgrp <hostgroupname>
    

    The -mhgrp option (modify host group) displays an editor that contains the configuration of the specified host group as template. You can modify the groupname, add a host to the hostlist, add a host group as a subgroup, and remove a host or host group. The editor is either the default vi editor or the editor that corresponds to the EDITOR environment variable. To modify the host group configuration, change and save the configuration file template.

  • To modify a host group from a file, type the following command:
    % qconf -Mhgrp <filename>
    

    The -Mhgrp option (modify host group from a file) uses the content of filename as host group configuration template. The configuration in the specified file must refer to an existing host group. You can modify the groupname, add a host to the hostlist, add a host group as a subgroup, and remove a host or host group. The configuration of this host group is replaced by the file content.

  • To delete a host group, type the following command:
    qconf -dhgrp <hostgroupname>
    

    The -dhgrp option (delete host group) deletes the specified host group from the list of host groups. All entries in the host group configuration are lost.

For more information on how to configure host groups from file or modify many objects at a time, see Using qconf or the qconf(1) man page.


How to Configure Host Groups With QMON

Host groups enable you to use a single name to refer to multiple hosts. A host group can include other host groups as well as multiple individual hosts. Host groups that are members of another host group are subgroups of that host group.

For example, you might define a host group called @bigMachines that includes the following members:

  • @solaris64
  • @solaris32
  • fangorn
  • balrog

The initial @ sign indicates that the name is a host group. The host group @bigMachines includes all hosts that are members of the two subgroups @solaris64 and @solaris32. @bigMachines also includes two individual hosts, fangorn and balrog.

Steps
  1. On the QMON Main Control window, click the Host Configuration button.
    The Host Configuration dialog box appears.

  2. Click the Host Groups tab.
    The Host Groups tab is shown below.
    Dialog box titled Host Configuration. Shows Host Groups tab with Hostgroup and Members lists.

  3. You can perform the following tasks from the Host Groups tab:
    • To add a host group, do the following:
      1. Click Add.
        The Add/Modify Host Group dialog box appears, as shown in the following figure.
        Dialog box titled Add/Modify Host Group. Shows fields for defining host groups and their members. Shows Ok and Cancel buttons.
      2. Type a host group name in the Hostgroup field.
        The host group name must begin with an @ sign.
      3. Click Ok to save your changes.
        Click Cancel to leave the dialog box without saving your changes.
    • To modify a host group, do the following:
      1. Select the host group in the Hostgroup list and click Modify.
      2. From the Add/Modify Host Group dialog box, you can do the following:
        • To add a host to the Members list for the selected host group, type the host name in the Host field and click the red arrow.
        • To add a host group as a subgroup, select a host group name from the Defined Host Groups list and click the red arrow.
        • To remove a host or a host group, select it from the Members list and click the trash icon.
      3. Click Ok to save your changes.
        Click Cancel to leave the dialog box without saving your changes.
    • To delete a host group, select the host group and click Delete.

Configuring a Shadow Master Host From the Command Line

Shadow master hosts are machines in the cluster that can detect a failure of the master daemon and take over its role as master host. When the shadow master daemon detects that the master daemon has failed abnormally, it starts up a new master daemon on the host where the shadow master daemon is running.

Note
The file $SGE_ROOT/$SGE_CELL/common/act_qmaster contains the name of the host that is actually running the sge_qmaster daemon.

If the master daemon is shut down gracefully, the shadow master daemon does not start up. If you want the shadow master daemon to take over after you shut down the master daemon gracefully, remove the lock file that is located in the sge_qmaster spool directory. The default location of this spool directory is $SGE_ROOT/$SGE_CELL/spool/qmaster.

The automatic failover start of a master daemon on a shadow master host takes approximately one minute. Meanwhile, you get an error message whenever a Grid Engine system command is run.

Shadow Master Host File

The shadow master host file, $SGE_ROOT/$SGE_CELL/common/shadow_masters, contains the following:

  • The name of the primary master host, which is the machine where the master daemon initially runs
  • The names of the shadow master hosts

The format of the shadow master host file is as follows:

  • The first line of the file defines the primary master host
  • The following lines define the shadow master hosts, one host per line

The order of the shadow master hosts is significant. The primary master host is the first line in the file. If the primary master host fails to proceed, the shadow master defined in the second line takes over. If this shadow master also fails, the shadow master defined in the third line takes over, and so forth.

Starting Shadow Master Hosts

To start a shadow master host, the system must be sure either that the old master daemon has terminated, or that it will terminate without performing actions that interfere with the newly started shadow master.

In very rare circumstances, you might not be able to determine whether the old master daemon has terminated or that it will terminate. In such cases, an error message is logged to the messages log file of the sge_shadowd daemons on the shadow master hosts.

If any attempts to open a tcp connection to a master daemon permanently fails, make sure that no master daemon is running, and then restart the master daemon manually on any of the shadow master machines. See How to Restart Daemons From the Command Line for further details.

Configuring Shadow Master Hosts Environment Variables

Three environment variables affect the takeover time for a shadow master:

Variable Description
SGE_DELAY_TIME This variable controls the interval in which sge_shadowd pauses if a takeover bid fails. This value is used only when there are multiple sge_shadowd instances that are contending to be the master (the default is 600 seconds).
SGE_CHECK_INTERVAL This variable controls the interval in which the sge_shadowd checks the heartbeat file (the default is 60 seconds).
SGE_GET_ACTIVE_INTERVAL This variable controls the interval when a sge_shadowd instance tries to take over when the heartbeat file has not changed.

These variables interact in the following ways:

  1. The master host updates the heartbeat file every 30 seconds.
  2. The sge_shadowd daemon checks for changes to the heartbeat file every number of seconds defined by the SGE_CHECK_INTERVAL variable. This value must be greater than 30 seconds.
    • If the heartbeat file has been updated, the sge_shadowd daemon restarts the waiting clock.
    • If the heartbeat file has not been updated, the sge_shadowd daemon continues to wait until the number of seconds defined by the SGE_CHECK_INTERVAL variable expires. This action ensures that the sge_shadowd daemon is not too agressive in trying to take over and allows the master host some leeway in updating the heartbeat file.
  3. When the SGE_GET_ACTIVE_INTERVAL has expired, the sge_shadowd daemon takes over if the heartbeat file is still not updated.

A reasonable configuration might be to set the SGE_CHECK_INTERVAL to 45 seconds and the SGE_GET_ACTIVE_INTERVAL to 90 seconds. So, after about 2 minutes, the takeover will occur. If you want to check the operation of the shadow host after you have configured these environment variables, you will have to disconnect the master host's network cable to simulate a failure.


How to Configure a Shadow Master Host From the Command Line

Shadow master hosts are machines in the cluster that can detect a failure of the master daemon and take over its role as master host. When the shadow master daemon detects that the master daemon has failed abnormally, it starts up a new master daemon on the host where the shadow master daemon is running.

If the master daemon is shut down gracefully, the shadow master daemon does not start up. If you want the shadow master daemon to take over after you shut down the master daemon gracefully, remove the lock file that is located in the sge_qmaster spool directory. The default location of this spool directory is $SGE_ROOT/$SGE_CELL/spool/qmaster.

The automatic failover start of a master daemon on a shadow master host takes approximately one minute. Meanwhile, you get an error message whenever a Grid Engine system command is run.

  1. Create the shadow_masters file.
    The shadow master host file, $SGE_ROOT/$SGE_CELL/common/shadow_masters, contains the following:
    • The first line of the file defines the primary master host.
    • The following lines define the shadow master hosts, one host per line.
      If the primary master host fails to proceed, the shadow master defined in the second line takes over. If this shadow master also fails, the shadow master defined in the third line takes over, and so forth.
  2. Verify correct permissions.
    All master shadow hosts must have both read and write permission to the master daemon spool directory.

  3. Start the shadow daemon on all shadow master hosts.
    To start a shadow master host, the system must be sure either that the old master daemon has terminated, or that it will terminate without performing actions that interfere with the newly started shadow master. In very rare circumstances, you might not be able to determine whether the old master daemon has terminated or that it will terminate. In such cases, an error message is logged to the messages log file of the sge_shadowd daemons on the shadow master hosts.
    If any attempts to open a tcp connection to a master daemon permanently fails, make sure that no master daemon is running, and then restart the master daemon manually on any of the shadow master machines. See How to Restart Daemons From the Command Line for further details.
    As root on each host, run the following:
    $SGE_ROOT/default/common/sge_masters -shadowd
    
    Note
    The file $SGE_ROOT/$SGE_CELL/common/act_qmaster contains the name of the host that is actually running the sge_qmaster daemon.


  4. You can configure the takeover time for a shadow master host in the following ways:
    • To configure the frequency that sge_shadowd pauses if a takeover fails, modify the SGE_DELAY_TIME environment variable. This value is only used when there are multiple sge_shadowd instances that are contending to be the master. The default is 600 seconds.
    • To configure the frequency that a sge_shadowd instance tries to take over when the heartbeat file has not changed, modify the SGE_GET_ACTIVE_INTERVAL environment variable.
    • To configure the frequency that sge_shadowd checks the heartbeat file, modify the SGE_CHECK_INTERVAL environment variable. The default is 60 seconds.
      A reasonable configuration might be to set the SGE_CHECK_INTERVAL to 45 seconds and the SGE_GET_ACTIVE_INTERVAL to 90 seconds. So, after about 2 minutes, the takeover will occur. If you want to check the operation of the shadow host after you have configured these environment variables, you will have to disconnect the master host's network cable to simulate a failure.
      These variables interact in the following ways:
      1. The master host updates the heartbeat file every 30 seconds.
      2. The sge_shadowd daemon checks for changes to the heartbeat file every number of seconds defined by the SGE_CHECK_INTERVAL variable. This value must be greater than 30 seconds.
        • If the heartbeat file has been updated, the sge_shadowd daemon restarts the waiting clock.
        • If the heartbeat file has not been updated, the sge_shadowd daemon continues to wait until the number of seconds defined by the SGE_CHECK_INTERVAL variable expires. This action ensures that the sge_shadowd daemon is not too aggressive in trying to take over and allows the master host some leeway in updating the heartbeat file.
      3. When the SGE_GET_ACTIVE_INTERVAL has expired, the sge_shadowd daemon takes over if the heartbeat file is still not updated.

Host Configuration Parameters

host_conf reflects the format of the template file for the execution host configuration.

For detailed information about these parameters, see the host_conf(5) man page.

Parameter Description
hostname Name of the execution host.
load_scaling A list of scaling values to be applied to each or part of the load values being reported by the execution daemon on the host and being defined in the cluster global host complex. The load scaling factors are intended to level hardware or operating system specific differences between execution hosts.
complex_values Defines quotas for resource attributes managed via this host.
load_values A list of load values. This entry cannot be configured but is only displayed in reaction to the qconf -se command.
processors Deprecated.
usage_scaling Equivalent to load_scaling, the only valid attributes to scaled are cpu for CPU time consumption, mem for Memory consumption aggregated over the lifetime of jobs and io for data transferred via any I/O devices. Default: NONE (no scaling)
user_lists A list of user access lists that grant users access to the host. Default value: NONE
xuser_lists A list of user access lists that deny users access to the host. Default value: NONE
projects A list of all projects that are granted access to the host. Default value: NONE
xprojects A list of all projects that are denied access to the host. Default value: NONE
report_variables A list of variables that are written to the reporting file when a load report arrives from an execution host.

How to Kill Daemons From the Command Line

Before You Begin
Note
You must have manager or operator privileges to use these commands. See Managing Users Access for more information about manager and operator privileges.

To keep scheduled and running jobs from being affected by a shutdown procedure, you must first disable all cluster queues, queue instances, and queue domains. Then, wait for all active jobs to finish.

To disable cluster queues, queue instances, and queue domains, type the following command:

% qmod -d {<clusterqueuename> | <queueinstancename> | <queuedomainname>}

Use the qmod -d command for each cluster queue, queue instance, or queue domain before you run the qconf sequence described below. The qmod -d command prevents new jobs from being scheduled to the disabled queue instances.

Next, you should wait until no jobs are running in the queue instances before you kill the daemons. For information about cluster queues, queue instances, and queue domains, see About Configuring Queues.

Kill Commands

To kill daemons from the command line, use the following arguments for the qconf command:

  • To shut down the master daemon, type the following command:
    % qconf -km
    

    The qconf -km command forces the sge_qmaster process to terminate.

  • To shut down the executive daemons and not cancel active jobs, type the following command:
    % qconf -ke {<hostname>[,...] | all}
    

    The qconf -ke command shuts down the execution daemons. However, it does not cancel or lose active jobs. Jobs that finish while no sge_execd is running on a system are not reported to sge_qmaster until sge_execd is restarted. Use a comma-separated list of the execution hosts you want to shut down, or specify all to shut down all execution hosts in the cluster.

  • To shut down the execution daemons and kill all active jobs, type the following command:
    % qconf -kej {<hostname>[,...] | all}
    

    The qconf -kej command kills all currently active jobs and brings down all execution daemons. Use a comma-separated list of the execution hosts that you want to shut down, or specify all to shut down all execution hosts in the cluster.

For more information on how to configure hosts from file or modify many objects at a time, see Using qconf or the qconf(1) man page.


How to Kill Daemons With QMON

Before You Begin

To keep scheduled and running jobs from being affected by a shutdown procedure, you must first disable all cluster queues, queue instances, and queue domains. Then, wait for all active jobs to finish.

To disable cluster queues, queue instances, and queue domains, do the following:

  1. On the QMON Main Control window, click the Queue Control button.
    The Cluster Queue dialog box appears.

  2. You can do the following from the Cluster Queue dialog box:
    • To disable cluster queues, click the Cluster Queues tab, select a cluster queue and then click the Disable Button.
    • To disable queue instances and queue domains, click the Queue Instances tab, select a queue instance or queue domain and then click the Disable button.

  3. Click the Ok button to save your changes.
    Click Cancel to close the dialog box without saving any changes.

  4. Wait for all active jobs to finish.
Steps
To kill the master daemon, you must use the command line interface. For more information, see How to Kill Daemons From the Command Line.

To kill an execution daemon With QMON, do the following:

  1. On the QMON Main Control window, click the Host Configuration button.

  2. Click the Execution Host tab.

  3. In the Execution Host dialog box, select a host, and click Shutdown.

For information on how to restart execution daemons, see How to Restart Daemons From the Command Line.


How to Restart Daemons From the Command Line

To restart daemons from the command line, do the following:

  1. Log in as root on the machine on which you want to restart Grid Engine system daemons.

  2. type one of the following commands:
  • To restart the sge_master, type the following command:
    % $SGE_ROOT/$SGE_CELL/common/sgemaster
    
  • To restart the sge_execd, type the following command:
    % $SGE_ROOT/$SGE_CELL/common/sgeexecd
    

How to Migrate qmaster to Another Host From the Command Line

  1. Stop the master daemon on the current master host.
    Type the following command:
    qconf -km
    


  2. Edit the $SGE_ROOT/$SGE_CELL/common/act_qmaster file according to the following guidelines:
    • Confirm the new master host's name. To get the new master host name, type the following command on the new master host:
      $SGE_ROOT/utilbin/$SGE_ARCH/gethostname
      
    • In the act_qmaster file, replace the current host name with the new master host's name returned by the gethostname utility.

  3. On the new master host, start sge_qmaster:
    $SGE_ROOT/$SGE_CELL/common/sgemaster
    

How to Migrate qmaster to Another Host Using a Script

Note
Because the spooling database cannot be located on an NFS-mounted file system, the following procedure requires that you use the Berkeley DB RPC server for spooling. If you configure spooling to a local file system, you must transfer the spooling database to a local file system on the new sge_qmaster host.
  1. Check that the new master host has read/write access.
    The new master host must have read/write access to the qmaster spool directory and common directory as does the current master. If the administrative user is the root user (check the global cluster configuration for the setting of admin_user), you should verify that the root user can create files in these directories under the root user name.

  2. Run the migration script on the new master host.
    On the new master host, type the following command as the root user:
    # $SGE_ROOT/$SGE_CELL/common/sgemaster -migrate
    

    This command stops sge_qmaster on the old master host and starts it on the new master host. The master host name listed in the file $SGE_ROOT/$SGE_CELL/common/act_qmaster is automatically changed to the new master host. If qmaster is not running, warning messages will appear and a delay of about one minute will occur until qmaster is started on the new host.

  3. Modify the shadow_masters file if necessary.
    1. Check if the $SGE_ROOT/$SGE_CELL/common/shadow_masters file exists.
      If the file exists, you can add the new qmaster host to this file and remove the old master host, depending on your requirements.
    2. Then stop and restart the sge_shadowd daemons by issuing the following commands on the respective machines:
      $SGE_ROOT/$SGE_CELL/common/sgemaster -shadowd stop
      $SGE_ROOT/$SGE_CELL/common/sgemaster -shadowd start
      

Important Notes About Migration

The migration procedure migrates to the host on which the sgemaster -migrate command is issued. If the file primary_qmaster exists, any subsequent calls of sgemaster on the machine contained in the primary_qmaster file will cause a migration back to that machine. To avoid such a situation, change or delete the $SGE_ROOT/$SGE_CELL/common/primary_qmaster file.

Note
Existence of the primary_qmaster file does not imply that the qmaster is actually running.

Although jobs may continue to run during the migration procedure, the grid should be inactive. While the migration is taking place, any Grid Engine commands that are running, such as qsub or qstat, will return an error.

If the current qmaster is down, the scheduler will not shut down until it times out waiting for contact with the qmaster.

The shadow_masters file has no direct effect on the migration procedure. This file only exists if one or more shadow masters have been configured. For more information on how to set up shadow masters, see Configuring a Shadow Master Host From the Command Line.


Managing User Access

You need to perform the following tasks to set up a user for the Grid Engine system:

  • Assign required logins – Users must have identical accounts on all hosts that they use within the Sun Grid Engine system. No login is required on the machine where sge_qmaster runs. For more information, see Configuring User Access.
  • Set access permissions – The Grid Engine software enables you to restrict user access to the entire cluster, to queues, and to parallel environments. For more information, see Configuring User Access. In addition, you can grant users permission to suspend or enable certain queues. See How to Configure User Access Parameters for more information.
  • Declare a Grid Engine System user – To add users to the share tree or to define functional or override policies for users, you must declare those users to the Grid Engine system. For more information, see Managing Policies and Configuring User Access.
  • Set up project access – If projects are used for the definition of share-based, functional, or override policies, you should give the user access to one or more projects. Otherwise the user's jobs might end up in the lowest possible priority class, which would result in the jobs having access to very few resources. See Managing Policies for more information.
  • Set file access restrictions – Users of the Grid Engine system must have read access to the directory $SGE_ROOT/$SGE_CELL/common. Before a job starts, the execution daemon creates a temporary working directory for the job and changes ownership of the directory to the job owner. The execution daemon runs as root. The temporary directory is removed as soon as the job finishes. The temporary working directory is created under the path defined by the queue configuration parameter tmpdir. See the queue_conf(5) man page for more information. Make sure that temporary directories can be created under the tmpdir location. The directories should be set to Grid Engine system user ownership. Users should be able to write to the temporary directories.
  • Set up site dependencies – By definition, batch jobs do not have a terminal connection. Therefore UNIX commands like stty in the command interpreter's startup resource file (for example, .cshrc for csh) can lead to errors. Check for the occurrence of stty in startup files. Avoid the commands that are described in Verifying the Installation. Because batch jobs are usually run off line, the Grid Engine system notifies users of error events in the following ways:
  • Logs error messages to the Grid Engine system log file.
  • Sends an email to the job owner – If the error log file can't be opened, email is the only way to notify the user of an error event. Therefore, the email system should be properly installed for Grid Engine users.
  • Set up Grid Engine system definition files – You can set up the following definition files for Grid Engine users:
Topic Description
Configuring User Access Learn how to configure manager accounts, operator accounts, user access lists, and user objects.
Configuring Projects Learn how to provide a means to organize joint computational tasks from multiple users. A project also defines resource usage policies for all jobs that belong to such a project.
Configuring Default Requests Learn what default requests are and how they are formatted.
Using Path Aliasing Learn about path aliasing, the format of path-aliasing files, and how path-aliasing files are interpreted.

Configuring User Access

There are four categories of users that each have access to their own set of Grid Engine system commands:

  • Managers – Managers have full capabilities to manipulate the Grid Engine system. By default, the superusers of all administration hosts have manager privileges.
  • Operators – Users who can perform the same commands as managers except that they cannot change the configuration. Operators are supposed to maintain operation.
  • Users – People who can submit jobs to the grid and run them if they have a valid login ID on at least one submit host and one execution host. Users have no cluster management or queue management capabilities.
  • Owners – Users who can suspend or resume and disable or enable the queues they own. Typically, users are owners of the queue instances that reside on their workstations. Queue owners can be managers, operators, or users. These privileges are necessary for successful use of qidle. Users are commonly declared to be owners of the queue instances that reside on their desktop workstations. See How to Configure Owners Parameters for more information.

For information on which command capabilities are available to the different user categories, see Command Line Interface Ancillary Programs.

Task User Interface Description
How to Configure Manager Accounts CLI or QMON Learn how to display, add, and delete managers.
How to Configure Operator Accounts CLI or QMON Learn how to display, add, and delete operators.
How to Configure User Access Lists CLI or QMON Learn how to display, add, and delete user access lists.
How to Configure User Objects CLI or QMON Learn how to display, add, modify, and delete user objects.

How to Configure Manager Accounts From the Command Line

To configure a manager account from the command line, use the following arguments for the qconf command:

  • To display a list of all managers, type the following command:
    qconf -sm
    

    The -sm option (show managers) displays a list of all Grid Engine system managers.

  • To add a manager, type the following command:
    qconf -am <username>
    

    The -am option (add manager) adds one or more users to the list of Grid Engine system managers. By default, the root accounts of all trusted hosts are Grid Engine system managers. See About Hosts and Daemons for more information.

  • To delete a manager, type the following command:
    qconf -dm <username>
    

    The -dm option (delete manager) deletes the specified users from the list of Grid Engine system managers.

For more information on how to configure the cluster from file or how to modify many objects at once, see Using qconf.


How to Configure Manager Accounts With QMON

  1. On the QMON Main Control window, click the User Configuration button.
    The Manager tab appears, shown in the following figure, and lists all accounts that have administrative permission.
    "Dialog box titled User Configuration. Shows Manager tab with list of managers. Shows Add

  2. You can perform the following functions from the the Manager tab:
    • To add a manager to the Grid Engine system, do the following:
      1. Type the user name in the field above the manager account list.
      2. Click Add or press the Return key.
    • To delete a manager account, select it, and then click Delete.

How to Configure Operator Accounts From the Command Line

To configure a operator account from the command line, use the following arguments for the qconf command:

  • To display a list of operators, type the following command:
    qconf -so 
    

    The -so option (show operators) displays a list of all Grid Engine system operators.

  • To add an operator, type the following command:
    qconf -ao <username>
    

    The -ao option (add operator) adds one or more users to the list of Grid Engine system operators.

  • To delete an operator, type the following command:
    qconf -do <username>
    

    The -do option (delete operator) deletes the specified users from the list of Grid Engine system operators.

For more information on how to configure the cluster from file or how to modify many objects at once, see Using qconf.


How to Configure Operator Accounts With QMON

  1. On the QMON Main Control window, click the User Configuration button, and then click the Operator tab.
    The Operator tab lists all accounts that currently have restricted administrative permission. If the account also has manager access, then that overrides operator access. See Configuring Manager Accounts With QMON.

  2. You can perform the following functions from the Operator tab:
    • To add a new operator account, type its name in the field above the operator account list.
      Click Add or press the Return key.
    • To delete an operator account, select it, and then click Delete.

How to Configure User Access Lists From the Command Line

The administrator can restrict access to queues and other facilities, such as parallel environment interfaces. Access can also be restricted to certain users or user groups.

For the purpose of restricting access permissions, the administrator creates and maintains access lists (ACLs). The ACLs contain user names and UNIX group names. The ACLs are then added to access-allowed or access-denied lists in the queue or in the parallel environment interface configurations. For more information, see the queue_conf(5) or sge_pe(5) man pages.

Users who belong to ACLs that are listed in access-allowed-lists have permission to access the queue or the parallel environment interface. Users who are members of ACLs in access-denied-lists cannot access the resource in question.

ACLS can also be used to define projects, to which assigned users can submit their jobs. The administrator can also restrict access to cluster resources on a per project basis. For more information on projects, see Configuring Projects.

Client Commands

To configure user access lists from the command line, use the following arguments for the qconf command:

  • To display a user access list, type the following command:
    qconf -su <accesslistname> [,...]
    

    The -su option (show user access list) displays the specified access lists.

  • To display all user access lists, type the following command:
    qconf -sul
    

    The -sul option (show user access lists) displays all access lists currently defined.

  • To add a one or more users to an access list, type the following command:
    qconf -au <username> [,...] <accesslistname> [,...]
    

    The -au option (add user) adds one or more users to the specified access lists. The Grid Engine software provides the following default accesslistnames:

    • arusers – The arusers access list restricts the use of qrsub to request an advance reservation. Only users listed in this access list may use this option. For more information, see How to Enable a User to Create Advance Reservations.
    • deadlineusers – The deadlineusers access list restricts the use of the qsub -dl date_time switch, which specifies the deadline initiation time. The deadline initiation time is the time at which a job has to reach top priority in order to be completed within a given deadline. Only usernames listed in the deadlineusers access list are allowed to request this option. For more information, see Configuring the Urgency Policy.
    • defaultdeparment – The defaultdepartment access list automatically includes all users who are not assigned to an administrator-created department. For more information, see Configuring the Functional Policy.
  • To add a user access list from a file, type the following command:
    qconf -Au <filename> 
    

    The -Au option (add user access list from file) uses a configuration file, filename, to add an access list.

  • To modify a user access list, type the following command:
    qconf -mu <accesslistname>
    

    The -mu option (modify user access list) modifies the specified access lists. For more information, see User Access List Configuration Parameters.

  • To modify a user list from file, type the following command:
    qconf -Mu <filename>
    

    The -Mu option (modify user access list from file) uses a configuration file, filename, to modify the specified access lists. For more information, see User Access List Configuration Parameters.

  • To delete one more more users from an access list, type the following command:
    qconf -du <username> [,...] <accesslistname> [,...]
    

    The -du option (delete user) deletes one or more users from the specified access lists.

  • To delete a user list, type the following command:
    qconf -dul <access-list-name> [,...]
    

    The -dul option (delete user list) completely removes userset lists.

For more information on how to configure the cluster from file or how to modify many objects at once, see Using qconf.


User Access List Configuration Parameters

For more information, see the access_list(5) man page.

Access lists are used in Grid Engine to define access permissions of users to queues or parallel environments.

Parameter Description
name Name of the access list.
type The type of access list. Can be defined as an ACL, DEPT, or both.
oticket Amount of override tickets currently assigned to the department.
fshare Current functional share of the department.
entries A list of UNIX users or primary UNIX groups that are assigned to the access list or department.

How to Configure User Access Lists With QMON

The administrator can restrict access to queues and other facilities, such as parallel environment interfaces. Access can also be restricted to certain users or user groups.

For the purpose of restricting access permissions, the administrator creates and maintains access lists (ACLs). The ACLs contain user names and UNIX group names. The ACLs are then added to access-allowed or access-denied lists in the queue or in the parallel environment interface configurations. For more information, see the queue_conf(5) or sge_pe(5) man pages.

Users who belong to ACLs that are listed in access-allowed-lists have permission to access the queue or the parallel environment interface. Users who are members of ACLs in access-denied-lists cannot access the resource in question.

ACLS can also be used to define projects, to which assigned users can submit their jobs. The administrator can also restrict access to cluster resources on a per project basis. For more information on projects, see Configuring Projects.

Steps
  1. On the QMON Main Control window, click the User Configuration button, and then click the Userset tab.
    The Userset tab appears as shown in the following figure.
    "Dialog box titled User Configuration. Shows Userset tab with list of usersets. Shows Add
    In the Grid Engine system, a userset can be either an access list, a Department, or both. The check boxes below the Usersets list indicate the type of the selected userset. This section describes access lists. Departments are explained in How to Configure User Objects.
    The Usersets list displays all available access lists. The Grid Engine software provides the following default access lists:
    • arusers – The arusers access list restricts the use of qrsub to request an advance reservation. Only users listed in this access list may use this option. For more information, see How to Enable a User to Create Advance Reservations.
    • deadlineusers – The deadlineusers access list restricts the use of the qsub -dl date_time switch, which specifies the deadline initiation time. The deadline initiation time is the time at which a job has to reach top priority in order to be completed within a given deadline. Only usernames listed in the deadlineusers access list are allowed to request this option. For more information, see Configuring the Urgency Policy.
    • defaultdeparment – The defaultdepartment access list automatically includes all users who are not assigned to an administrator-created department. For more information, see Configuring the Functional Policy.

To display currently defined users and groups, select the userset.

Note
The names of groups are prefixed with an @ sign.


  1. You can perform the following tasks from the Userset Tab:
    • To add a new userset, click Add.
      An Access List Definition dialog box appears, as shown in the figure below. The Users/Groups list displays all currently defined users and groups.
      "Dialog box titled QMON. Shows Userset Name and User/Group fields
      • To add a new access list definition, type the name of the access list in the Userset Name field and click Ok.
      • To add a new user or group to the access list, type a user or group name in the User/Group field and then click Ok. Be sure to prefix group names with an @ sign.
      • To delete a user or group from the Users/Groups list, select it and then click the trash icon.

    • To modify an existing userset, select it, and then click Modify.
      An Access List Definition dialog box appears with the name of the current userset in the Userset Name field.

    • To delete a userset, select it, and then click Delete.
  2. To save your changes and close the dialog box, click OK.
    Click Cancel to close the dialog box without saving changes.

How to Configure User Objects From the Command Line

To configure a user object from the command line, use the following arguments for the qconf command:

  • To display a the configuration for a specific user, type the following command:
    qconf -suser <username>
    

    The -suser option (show user) displays the configuration of the specified user.

  • To display a list of all currently defined users, type the following command:
    qconf -suserl
    

    The -suserl option (show user list) displays a list of all currently defined users.

  • To add a user, type the following command:
    qconf -auser
    

    The -auser option (add user) opens a template user configuration in an editor. See the user(5) man page. The editor is either the default vi editor or the editor specified by the EDITOR environment variable. After you save your changes and exit the editor, the changes are registered with sge_qmaster.

  • To add a user from file, type the following command:
    qconf -Auser <filename>
    

    The -Auser option (add user from file) parses the specified file and adds the user configuration. The file must have the format of the user configuration template.

  • To modify a user, type the following command:
    qconf -muser <username>
    

    The -muser option (modify user) enables you to modify an existing user entry. The option loads the user configuration in an editor. From the editor, you can modify a user's configuration, including the user's default project. After you save your changes and exit the editor, the changes are registered with sge_qmaster. For more information, see User Configuration Parameters.

  • To modify a user from file, type the following command:
    qconf -Muser <filename>
    

    The -Muser option (modify user from file) parses the specified file and modifies the user configuration. The file must have the format of the user configuration template. For more information, see User Configuration Parameters.

  • To delete a user, type the following command:
    qconf -duser <username> [,...]
    

    The -duser option (delete user) deletes one or more user objects.

For more information on how to configure the cluster from file or how to modify many objects at once, see Using qconf.


User Configuration Parameters

For detailed information about these parameters, see the user(5) man page.

A user entry is used to store ticket and usage information on a per user basis. Maintain user entries for all users participating in a Grid Engine system is required if Grid Engine is operated under a user share tree policy. For more information, see Configuring the Share-Based Policy.

Parameter Description
name User name
oticket Amount of override tickets currently assigned to the user.
fshare Functional share of the user.
default_project Default project of the user.
delete_time Deprecated.

How to Configure User Objects With QMON

  1. On the QMON Main Control window, click the User Configuration button.

  2. Click the User tab.
    The User tab appears as shown in the following figure:
    "Dialog box titled User Configuration. Shows User tab with list of users and User field. Shows Add

  3. You can view all currently configured users and perform the following functions from the User Tab:
    • To add a new user, type a user name in the field above the User list, and then click Add or press the Return key.
    • To delete a user, select the user name in the User list, and then click Delete.
      The Delete Time column is read-only. The column indicates the time at which automatically created users are to be deleted from the Grid Engine system. Zero indicates that the user will never be deleted.
    • To assign a default project, do the following:
      1. Select a user, and then click the Default Project column heading.
        A Project Selection dialog box appears, as shown below.
        "Dialog box titled Select an Item. Shows Available Projects list and Select a Project field. Shows OK
        You can assign a default project to each user. The default project is attached to each job that users submit, unless those users request another project to which they have access.
        Departments are used for the configuration of the functional policy and the override policy. Departments differ from access lists in that a user can be a member of only one department, whereas one user can be included in multiple access lists. For more details, see Configuring the Functional Policy and Configuring the Override Policy.
        A Userset is identified as a department by the Department flag. A Userset can be defined as both a department and an access list at the same time. However, the restriction of only a single appearance by any user in any department applies.
      2. Select a project for the highlighted user entry.
      3. Click OK to assign the default project and close the dialog box.
        Click Cancel to close the dialog box without assigning the default project.

Configuring Projects

Projects provide a means to organize joint computational tasks from multiple users. A project also defines resource usage policies for all jobs that belong to such a project.

Projects must be declared before they can be used in any of the three scheduling policy policies. Projects are used in three policy areas:

Grid Engine system managers define projects by giving them a name and some attributes. Grid Engine users can attach a job to a project when they submit the job. Attachment of a job to a project influences the job's dispatching, depending on the project's share of share-based, functional, or override tickets.

Task User Interface Description
How to Define Projects CLI or QMON Learn how to display, add, modify, and delete projects.

How to Configure Projects From the Command Line

Note
If you would like to give all users access to a project, leave the default ACL setting at NONE. If the ACL is set to defaultdepartment, users cannot submit jobs to the project.

To define projects from the command line, use the following arguments for the qconf command:

  • To display the configuration for a specific project, type the following command:
    qconf -sprj <projectname>
    

    The -sprj option (show project) displays the configuration of a particular project.

  • To display a list of all currently-configured projects, type the following command:
    qconf -sprjl
    

    The -sprjl option (show project list) displays a list of all currently defined projects.

  • To add a project, type the following command:
    qconf -aprj
    

    The -aprj option (add project) opens a template project configuration in an editor. See the project(5) man page. The editor is either the default vi editor or the editor specified by the EDITOR environment variable. After you save your changes and exit the editor, the changes are registered with sge_qmaster.

  • To add a project from file, type the following command:
    qconf -Aprj <filename>
    

    The -Aprj option (add project from file) parses the specified file and adds the new project configuration. The file must have the format of the project configuration template.

  • To modify a project, type the following command:
    qconf -mprj <projectname>
    

    The -mprj option (modify project) enables you to modify an existing user entry. The option loads the project configuration in an editor. The editor is either the default vi editor or the editor specified by the EDITOR environment variable. After you save your changes and exit the editor, the changes are registered with sge_qmaster. For more information, see Project Configuration Parameters.

  • To modify a project from file, type the following command:
    qconf -Mprj <filename>
    

    The -Mprj option (modify project from file) parses the specified file and modifies the existing project configuration. The file must have the format of the project configuration template. For more information, see Project Configuration Parameters.

  • To delete a project from the command line, type the following command:
    qconf -dprj <projectname>
    

    The -dprj option (delete project) deletes one or more projects.

For more information on how to configure projects from file or how to modify many objects at once, see Using qconf.


Project Configuration Parameters

For detailed information about these parameters, see the project(5) man page.

Jobs can be submitted to projects, and a project can be assigned with a certain level of importance via the functional or the override policy. Their level of importance is then inherited by the jobs executing under that project.

Parameter Description
name The name of the project.
oticket The amount of override tickets currently assigned to the project.
fshare The current functional share of the project.
acl A list of user access lists referring to users who are allowed to submit jobs to the project.
xacl A list of user access lists referring to users who are not allowed to submit jobs to the project.

How to Configure Projects With QMON

Grid Engine system managers can define and update definitions of projects by using the Project Configuration dialog box.

  1. On the QMON Main Control window, click the Project Configuration button.
    The Project Configuration dialog box appears, as shown below. The currently defined projects are displayed in the Projects list. The project definition of a selected project is displayed under Configuration.
    "Dialog box titled Project Configuration. Shows Projects and Configuration lists. Shows Add

  2. You can perform the following tasks from the Project Configuration dialog box:
    • To add a new project, do the following:
      1. Click Add.
        The Add/Modify Project dialog box appears.
      2. Click Ok to save changes.
        Click Cancel to leave the dialog box without saving your changes.
    • To modify a project, do the following:
      Note
      If you would like to give all users access to a project, leave the default ACL setting at NONE. If the ACL is set to defaultdepartment, users cannot submit jobs to the project.
      1. Select a project, and then click Modify.
        The Add/Modify Project dialog appears, as shown below.
        "Dialog box titled Add/Modify Project. Shows Name
        The Add/Modify Project dialog box shows the following access lists:
        • The User Lists column shows access lists that contain users who have permission to access the project.
        • The Xuser Lists column shows access lists that contain users who do not have permission to access the project.
        • If both lists are empty, all users can access the project. If a user belongs to an access list that is in the User Lists and to an access list that is in the Xuser Lists, the user is denied access to the project.
          The name of the selected project is displayed in the Name field. The project defines the access lists of users who are permitted access or who are denied access to the project. See Configuring Users for more information.
      2. To add or remove users from the User Lists or Xuser Lists, click the button at the right of the User Lists or the Xuser Lists.
        The Select Access Lists dialog box, shown in the figure below, appears. Currently defined access lists are displayed under Available Access Lists. Currently selected access lists are displayed under chosen access lists. You can select access lists in either list. You can move access lists from one list to the other by using the red arrows.
        "Dialog box titled Select Access Lists. Shows Available Access Lists and Chosen Access Lists. Shows Ok
      3. Click Ok to save your changes.
        Click Cancel to close the dialog box without saving any changes.
    • To delete a project, select it, and then click Delete.

Configuring Default Requests

Typically, a user defines a request profile for a particular job. If the user does not specify any requests for a job, the scheduler considers any queue to which the user has access without further restrictions. Batch jobs are normally assigned to queues with respect to this request profile. The scheduler considers only those queues that satisfy the set of requests for this job.

As an administrator, the Grid Engine software enables you to configure default requests that define resource requirements for jobs even when the user does not specify resource requirements explicitly.

You can configure default requests globally for all users of a cluster, as well as privately for any user. The default request configuration is stored in default request files. The global request file is located under $SGE_ROOT/$SGE_CELL/common/sge_request. The user-specific request file can be located either in the user's home directory or in the current working directory. The working directory is where the qsub command is run. The user-specific request file is called .sge_request.

If these files are present, they are evaluated for every job. The order of evaluation is as follows:

  1. The global default request file
  2. The user default request file in the user's home directory
  3. The user default request file in the current working directory
Note
The requests specified in the job script or supplied with the qsub command take precedence over the requests in the default request files. See Submitting Jobs for details about how to request resources for jobs explicitly.

You can prevent the Grid Engine system from using the default request files by using the qsub -clear command, which discards any previous requirement specifications.

Format of Default Request Files

The format of both the local and the global default request files is as follows:

  • Default request files can contain any number of lines. Blank lines and lines that begin with a # sign are skipped.
  • Each line not to be skipped can contain any qsub option, as described in the qsub(1) man page. More than one option per line is allowed. The batch script file and the argument options to the batch script are not considered to be qsub options. Therefore these items are not allowed in a default request file.
  • The qsub -clear command discards any previous requirement specifications in the currently evaluated request file or in request files processed earlier.

Suppose a user's local default request file is configured the same as test.sh, the script in the following example.

Example of a Default Request File

# Local Default Request File
# exec job on a sun4 queue offering 5h cpu
-l arch=solaris64,s_cpu=5:0:0
# exec job in current working dir
-cwd

To run the script, the user types the following command:

% qsub test.sh

The effect of running the test.sh script is the same as if the user specified all qsub options directly in the command line, as follows:

% qsub -l arch=solaris64,s_cpu=5:0:0 -cwd test.sh
Note
Like batch jobs submitted using qsub, interactive jobs submitted using qsh consider default request files also. Interactive or batch jobs submitted using QMON also take these request files into account.

Using Path Aliasing

In Solaris and in other networked UNIX environments, users often have the same home directory, or part of it, on different machines. For example, consider user home directories that are available across NFS and automounter. A user might have a home directory /home/foo on the NFS server. This home directory is accessible under this path on all properly installed NFS clients that are running automounter. However, /home/foo on a client is just a symbolic link to /tmp_mnt/home/foo. /tmp_mnt/home/foo is the actual location on the NFS server from where automounter physically mounts the directory.

A user on a client host might use the qsub -cwd command to submit a job from somewhere within the home directory tree. The -cwd flag requires the job to be run in the current working directory. However, if the execution host is the NFS server, the Grid Engine system might not be able to locate the current working directory on that host. The reason is that the current working directory on the submit host is /tmp_mnt/home/foo, which is the physical location on the submit host. This path is passed to the execution host. However, if the execution host is the NFS server, the path cannot be resolved, because its physical home directory path is /home/foo, not /tmp_mnt/home/foo.

Other occasions that can cause similar problems are the following:

  • Fixed NFS mounts with different mount point paths on different machines. An example is the mounting of home directories under /usr/people on one host and under /usr/users on another host.
  • Symbolic links from outside into a network-available file system.

To prevent such problems, the Grid Engine software enables both the administrator and the user to configure a path aliasing file. The locations of two such files are as follows:

  • $SGE_ROOT/$SGE_CELL/common/sge_aliases – A global cluster path-aliasing file for the cluster
  • $HOME/.sge_aliases – A user-specific path-aliasing file
    Note
    Only an administrator should modify the global file.

Format of Path-Aliasing Files

Both path-aliasing files share the same format:

  • Blank lines and lines that begin with a # sign are skipped.
  • Each line, other than a blank line or a line preceded by #, must contain four strings separated by any number of blanks or tabs. The strings are as follows:
    • The first string specifies a source path.
    • The second string specifies a submit host.
    • The third string specifies an execution host.
    • The fourth string specifies the source path replacement.
  • Both the submit host and the execution host strings can be an * (asterisk), which matches any host.

How Path-Aliasing Files Are Interpreted

The files are interpreted in the following order:

  1. After qsub retrieves the physical current working directory path, the global path-aliasing file is read, if present.
    The user path-aliasing file is read afterwards, as if the user path-aliasing file were appended to the global file.

  2. Lines not to be skipped are read from the top of the file, one by one. The translations specified by those lines are stored, if necessary.
    A translation is stored only if both of the following conditions are true:
    • The submit host string matches the host on which the qsub command is run.
    • The source path forms the initial part either of the current working directory or of the source path replacements already stored.
  3. After both files are read, the stored path-aliasing information is passed to the execution host along with the submitted job.

  4. On the execution host, the path-aliasing information is evaluated. The source path replacement replaces the leading part of the current working directory if the execution host string matches the execution host. In this case, the current working directory string is changed. To be applied, subsequent path aliases must match the replaced working directory path.
    The following is an example of how the NFS automounter problem described earlier can be resolved with an aliases file entry.
Example – Path Aliasing File
# cluster global path aliases file
# src-path  subm-host   exec-host   dest-path
/tmp_mnt/   *           *           /

Configuring Queue Calendars

Queue calendars define the availability of queues according to the day of the year, the day of the week, or the time of day. You can configure queues to change their status at specified times. You can change the queue status to disabled, enabled, suspended, or resumed (unsuspended).

The Grid Engine system enables you to define a site-specific set of calendars, each of which specifies status changes and the times at which the changes occur. These calendars can be associated with queues. Each queue can attach a single calendar, thereby adopting the availability profile defined in the attached calendar.

The syntax of the calendar format is described in detail in the calendar_conf(5) man page. A few examples are given in the next sections, along with a description of the corresponding administration facilities.

For information about configuring queues, see Configuring Queues.

Task User Interface Description
How to Configure Queue Calendars CLI or QMON Learn how to display, add, modify, and delete queue calendars.

How to Configure Queue Calendars From the Command Line

To configure execution hosts from the command line, use the following arguments for the qconf command:

  • To display the configuration for a specific calendar, type the following command:
    qconf -scal <calendarname>
    
  • To display a list of all configured calendars, type the following command:
    qconf -scall
    
  • To add a new calendar configuration, type the following command:
    qconf -acal <calendarname>
    

    The -acal option (add calendar) adds a new calendar configuration named calendarname to the cluster. An editor with a template configuration appears, enabling you to define the calendar.

  • To add a calendar from file, type the following command:
    qconf -Acal <filename>
    

    The -Acal option (add calendar from file) adds a new calendar configuration to the cluster. The added calendar is read from the specified file.

  • To modify a calendar, type the following command:
    qconf -mcal <calendarname>
    

    The -mcal option (modify a calendar configuration) modifies an existing calendar configuration named calendarname. An editor opens calendarname, enabling you to make changes to the definition.

  • To modify a calendar from file, type the following command:
    qconf -Mcal <filename>
    

    The -Mcal option (modify calendar from file) modifies an existing calendar configuration. The calendar to modify is read from the specified file.

  • To delete a calendar, type the following command:
    qconf -dcal <calendarname> [,...]
    

    The -dcal option (delete calendar) deletes the specified calendar.

For more information on how to configure the queue calendars from file or how to modify many objects at once, see Using qconf.


The page How to Configure Queue Calendars With QMON does not exist.

Using Job Submission Verifiers

Job Submission Verifiers (JSVs) allow users and administrators to define rules that determine which jobs are allowed to enter into a cluster and which jobs should be rejected immediately. A JSV is a script or binary that can be used to verify, modify, or reject a job during the time of job submission or on the master host.

The following are examples of how an administrator might use JSVs:

  • To verify that a user has write access to certain file systems.
  • To make sure that jobs do not contain certain resource requests, such as memory resource requests (h_vmem or h_data).
  • To add resource requests to a job that the submitting user may not know are necessary in the cluster.
  • To attach a user's job to a specific project or queue to ensure that cluster usage is accounted for correctly.
  • To inform a user about certain job details like queue allocation, account name, parallel environment, total number of tasks per node, and other job requests.

A verification can be performed by a client JSV instance at the time of job submission, by a server JSV instance on the master host, or by a combination client JSVs and server JSVs. In general, client JSVs should meet your cluster's needs. See below for more information on what client JSVs and server JSVs have to offer.

Job Submission Verifier Topics

Topic Description
Understanding the Differences Between Client JSVs and Server JSVs Before you get started, it is important that you learn the differences between client JSVs and server JSVs.
Writing JSV Scripts Learn how different programming languages can impact the performance of your cluster, about JSV script-based functions, and how to write JSV scripts.
JSV Verification Process Learn about how the Sun Grid Engine system executes JSVs.
Configuring JSVs Learn how to configure JSVs.
JSV Communication Protocol Learn about the communication protocol used by JSV instances to communicate with a client process and/or the master daemon.

Understanding the Differences Between Client JSVs and Server JSVs

To maximize this feature's usefulness, it is important to understand the differences between client JSVs and server JSVs.

  Client JSV Server JSV
Configured by Any user (see below under Configuration Location(s) for information on the exception) An administrator or by a user who has been given administrative access in the global configuration
Executed by Client application Master daemon
Lifetime of JSV Process After the verification is completed regardless of the result, the JSV instance is stopped. The JSV remains active and unchanged as long as the master daemon remains running.
Configuration Location(s) You can configure client JSVs in the following four locations:
  • By adding the -jsv submit parameter and a jsv_url to a submit client (qsub, qrsh, qsh, and qlogin only) at the time of job submission.
    For more information jsv_url syntax, see the sge_types(1) man page. For more information on -jsv, see the qsub(1) man page.
  • By adding the -jsv switch to the $cwd/.sge_request file.
    For more information on sge_request file syntax, see the sge_request(5) man page.
  • By adding the -jsv switch to the $HOME/.sge_request file.
  • By adding the -jsv switch to the $SGE_ROOT/$SGE_CELL/common/sge_request file.
You can only configure server JSVs in the following location:
  • By including a jsv_url in the global configuration.
    For more information on the cluster configuration parameters, see the sge_conf(5) man page. For more information jsv_url syntax, see the sge_types(1) man page.
Additional Information Note the following about client JSVs:
  • The execution context for all client JSV configuration options is the same.
  • Client JSVs do not consume cluster resources because they are executed during job submission. In clusters where hundreds of jobs are submitted per second,server JSVs can reduce the submit rate by 30% or more.
  • Client JSVs log activity to stdout.
  • Client JSVs are an excellent way to test server JSV scripts.
  • A client JSV defined in the global sge_request file can only be defined by an administrator.
  • A client JSV defined in the global sge_request file is executed last of all client JSVs. This means this JSV script can touch a job after all user-defined JSV scripts have been executed.
Note the following about server JSVs:
  • Server JSVs can not be bypassed by end users.
  • Server JSV scripts can perform administrative functions.
  • Server JSVs can log activity to the global message file.
  • Server JSVs can perform tasks specific to one host.
  • Server JSVs are always executed last, after all client JSVs have been executed.
  • If you use a server JSV, then all qalter and qmon modification requests are rejected by the master daemon. As an administrator, you can configure the jsv_allowed_mod parameter in the global configuration to allow a set of switches to be used with submit clients.

Configuring JSVs

You can configure JSVs in up to five different locations. Each configured JSV will be executed by the Grid Engine system when a job is submitted, which means that a job can be verified up to five different times.

Generally, a client JSV, which consumes less cluster resources, will meet your computing needs. For more information on the differences between client JSVs and server JSVs, see Understanding the Differences Between Client JSVs and Server JSVs.

JSV Configuration Tasks

Task Interface Description
How to Configure a Client JSV CLI Learn how to configure a client JSV.
How to Configure a Server JSV CLI Learn how to configure a server JSV.

How to Configure a Client JSV From the Command Line

Generally, client JSVs will meet your cluster's needs. For more information on the differences between client JSVs and server JSVs, see Understanding the Differences Between Client JSVs and Server JSVs.
  1. Write a JSV script that meets your cluster's job verification needs.

  2. Configure a client JSV in one of the following locations:
    You may configure a client JSV in all of the following locations. The only distinction is that the last option (modifying the global sge_request file) is always executed last.
  • By adding the -jsv submit parameter to a submit client (qsub, qrsh, qsh, and qlogin only) specifying a jsv_url at the time of job submission.
    qsub -jsv <jsv_url>
    
  • By adding the -jsv switch to the .cwd/.sge_request file.
    For more information on sge_request file syntax, see the sge_request(5) man page.
  • By adding the -jsv switch to the $HOME/.sge_request file.
  • By adding the -jsv switch to the $SGE_ROOT/$SGE_CELL/common/sge_request file.
    Since this option modifies the global sge_request file, it must be configured by an administrator. This option is always executed after the first three options are executed by the Sun Grid Engine System.

If a server JSV is configured, it will be executed after all configured client JSVs have been executed. For more information, see the JSV Verification Process page.

For more information on -jsv, see the qsub(1) man page. For more information jsv_url syntax, see the sge_types(1) man page.


How to Configure a Server JSV From the Command Line

Unless you need to write a script that requires administrative privileges, it generally makes sense to use client JSVs for your verification needs because they will consume less of your cluster's resources. For more information, see Understanding the Differences Between Client JSVs and Server JSVs.
  1. Write a JSV script that meets your cluster's job verification needs.

  2. Include a jsv_url in the global configuration.
    For more information on the cluster configuration parameters, see the sge_conf(5) man page. For more information jsv_url syntax, see the sge_types(1) man page.

A configured server JSV will always be executed after all configured client JSVs are executed.


Writing JSV Scripts

JSV scripts can be written in any scripting language, including Unix shells, Perl or Tcl.

The following tools are available to assist you in writing a JSV script:

  • The JSV Script Functions are available in Bourne shell, Tcl or Perl scripts after sourcing the files jsv_include.sh, jsv_include.tcl, or JSV.pm.
  • The files and corresponding JSV script templates are located in the directory $SGE_ROOT/util/resource/jsv/.

To maximize the performance of your script, the following suggestions should be considered:

  • Use a scripting language that supports precompilation or other performance improvement methods during runtime.
  • Use a scripting language that avoids forking processes. Startup of additional applications is expensive and will slow down JSV scripts heavily.
  • Avoid accessing files and other input/output devices in your scripts.
Performance Considerations for Server JSV Scripts

Since server JSV instances are executed by the master daemon, it is important to choose a scripting language that minimizes the amount of resources that are consumed from the master daemon process. Simple scripting languages, like the Bourne shell, execute a lot of external commands to perform very basic operations. If used in a JSV script, the job acceptance rate could decrease by more than 90%. By contrast, a more efficient scripting language, like Perl and Tcl, should impact the job acceptance rate by less than 10% with a well-designed JSV script.

Performance Considerations for Client JSV Scripts

A client side JSV script is started whenever a user submits a job. If a series of job submissions are done from the same terminal or script, it is important to use an efficient scripting language, such as Perl or Tcl, to minimize the script's impact on client submit rate.

Topic Description
JSV Script Functions A list of JSV script functions and parameters.
Example - Writing a JSV Script Using the Bourne Shell A hypothetical example of how a JSV script can be written using UNIX.
Example - Writing a JSV Script Using Tcl A hypotehtical example of how a JSV script can be written using Tcl.





Submit Parameters

The JSV submit parameters communicated in the JSV Communication Protocol via the PARAM command are initially specified at the command line during the time of job submission or by a corresponding functionality in QMON. The parameters are almost identical to the switches used by qsub. The values for these parameters are described below:

Parameter Description
a This value of the a parameter defines or redefines the time and date at which a job is eligible for execution. If the -a command line switch or the corresponding value in QMON was specified during the submission of the job, then the parameter named a will be sent to configured JSV instances. The value for this parameter has the format [CCYY]MMDDhhmm.SS. This parameter can be changed by using JSV scripts.
ac The value for the ac parameter represents the job context. The context can be specified with the qsub command line switches -ac, -dc and -sc or corresponding functionality in QMON. The outcome of the evaluation of all -ac, -dc, and -sc options or corresponding values in QMON are passed to defined JSV instances as parameters with the name ac. The value for this attribute is a comma separated list of variable/value pairs (ac_list).
ar The value of the ar parameter contains the advance reservation ID. Jobs can be assigned to Advance Reservations using the -ar command line switch or corresponding functionality in QMON.
A The value of the A parameter identifies the account to which the resource consumption for the job should be charged. If this option or a corresponding in QMON is specified, then this value will be passed to defined JSV instances as a parameter with the name A.
b The value of the b parameter indicates explicitly whether command should be treated as binary or script. The value specified with this option or the corresponding value specified by QMON will only be passed to defined JSV instances if the value is yes or y.
c_interval c_occasion The value of the qsub -c parameter defines or redefines whether the job should be checkpointed, and, if so, what the circumstances are. The value specified with this option or the corresponding value specified in QMON will be passed to defined JSV instances. The interval will be available as a parameter with the name c_interval. The character sequence specified will be available as a parameter with the name c_occasion. Please note that changing c_interval will overwrite previous settings of c_occasion and vice versa.
ckpt The value of the ckpt parameter selects the checkpointing environment. If this option or a corresponding value in QMON is specified, then this value will be passed to all defined JSV instances as a parameter with the name ckpt.
cwd The value of the cwd parameter is the absolute path to the current working directory. JSV scripts can remove the path from jobs during the verification process by setting the value of this parameter to an empty string. As a result, the job behaves as if -cwd was not specified during job submission. If this option or a corresponding value in QMON is specified (and a script does not remove the path), then this value will be passed to all defined JSV instances as parameters with the name cwd.
dc The dc parameter removes the given variables from the job's context. The outcome of the evaluation of all -ac, -dc, and -sc options or corresponding values in QMON are passed to defined JSV instances as parameters with the name ac.
display The display parameter directs xterm to use display_specifier to contact the X server. If this option or a corresponding value in QMON is specified, then this value will be passed to defined JSV instances as a parameter with the name display. This value will also be available in the job environment, which can be passed to JSV scripts.
dl The dl parameter specifies the deadline initiation time. The format for the date_time value is [CCYY]MMDDhhmm.SS. If this option or a corresponding value in QMON is specified, then this value will be passed to defined JSV instances as a parameter with the name dl.
e The e parameter defines or redefines the path used for the standard error stream of the job. If this option or a corresponding value in QMON is specified, then this value will be passed to defined JSV instances as a parameter with the name e.
h The h parameter places holds on a job. If this option is specified with qsub or during job submission in QMON, then the parameter h with the value u will be passed to the defined JSV instances indicating that the job will be in user hold after the submission finishes.
hold_jid The hold_jid parameter defines or redefines the job dependency list of the submitted job. If this option or a corresponding value in QMON is specified, then this value will be passed to all defined JSV instances as a parameter with the name hold_jid.
hold_jid_ad The hold_jid_ad parameter defines or redefines the job array dependency list of the submitted job. If this option or a corresponding value in QMON is specified, then this value will be passed to all defined JSV instances as a parameter with the name hold_jid_ad.
i The i parameter defines or redefines the file used for the standard input stream of the job. If this option or a corresponding value in QMON is specified, then this value will be passed to all defined JSV instances as a parameter with the name i.
j The j parameter specifies whether or not the standard error stream of the job is merged into the standard output stream. The value specified with this option or the corresponding value specified in QMON will only be passed to defined JSV instances if the value is yes or y. The name of the parameter will be j.
js The js parameter defines or redefines the job share of the job relative to other jobs. If this option or a corresponding value in QMON is specified, then this value will be passed to all defined JSV instances as parameters with the name js.
l_hard
l_soft
The qsub -l parameter launches the job for a Grid Engine queue meeting the given resource request list. If this option or a corresponding value in QMON is specified, then these hard and soft resource requirements will be passed to defined JSV instances as parameters with the names l_hard and l_soft. If regular expressions will be used for resource requests, then these expressions will be passed as they are. Also, shortcut names will not be expanded.
m The m parameter defines or redefines under which circumstances mail is to be sent to the job owner or to the users defined with the -M option described below. If this option or a corresponding value in QMON is specified, then this value will be passed to defined JSV instances as a parameter with the name m.
M The M parameter defines or redefines the list of users to which the server that executes the job has to send mail. If this option or a corresponding value in QMON is specified, then this value will be passed to all defined JSV instances as a parameter with the name M.
masterq The masterq parameter defines or redefines a list of cluster queues, queue domains, and queue instances that may be used to become the master queue of this parallel job. If this option or a corresponding value QMON is specified, then this hard resource requirement will be passed to defined JSV instances as a parameter with the name masterq.
notify The notify parameter sends warning signals to a running job prior to sending signals. This option provides the running job a configured time interval to do cleanup operations. This option is used, the parameter named notify with the value y will be passed to defined JSV instances.
now The now parameter tries to start the job immediately or not at all. The value specified with this option or the corresponding value specified in QMON will only be passed to defined JSV instances if the value is yes. The name of the parameter will be now. The value will also be y when the long for yes was specified during submission.
N The N parameter provides the name of the job. The value specified with this option or the corresponding value specified in QMON will be passed to defined JSV instances as the parameter with the name N.
o The o parameter provides the path used for the standard output stream for the job. If this option or a corresponding value in QMON is specified, then this value will be passed to all defined JSV instances as a parameter with the name o.
p The p parameter defines or redefines the priority of the job relative to other jobs. If this option or a corresponding value in QMON is specified and the priority is not 0, then this value will be passed to all defined JSV instances as a parameter with the name p.
pe_name
pe_min
pe_max
The qsub -pe parameter specifies a parallel environment and a slot or slot range n-m. If this option or a corresponding value in QMON is specified, then the parameters pe_name, pe_min, and pe_max will be passed to all configured JSV instances. The values pe_min and pe_max represent the values n and m, which have been provided with the -pe option. A missing specification of m will have the value 9999999 in JSV scripts, which represents infinity.
P The P parameter specifies the project to which the job is assigned. If this option or a corresponding value in QMON is specified, then this value will be passed to defined JSV instances with the ot parameter.
q_hard
q_soft
The qsub -q parameter defines or redefines a list of cluster queues, queue domains or queue instances that may be used to execute this job. If this option or a corresponding value in QMON is specified, then these hard and soft resource requirements will be passed to defined JSV instances as parameters with the names q_hard and q_soft. If regular expressions are used for resource requests, then these expressions will be passed as they are. Also shortcut names will not be expanded.
R The R parameter indicates whether a reservation for this job should be done. The value specified with this option or the corresponding value specified in QMON will only be passed to defined JSV instances if the value is yes. The name of the parameter is R. The value is y also when the long form yes was specified during submission.
r The r parameter identifies the ability of a job to be rerun or not. The value specified with this option or the corresponding value specified in QMON will only be passed to defined JSV instances if the value is yes. The name of the parameter will be r. The value will be w also when the long form yes was specified during submission.
sc The sc parameter sets the given name/value pairs as the job's context. The outcome of the evaluation of all -ac, -dc, and -sc options or corresponding values in QMON is passed to all defined JSV instances as parameters with the name ac.
shell The shell parameter determines whether or not to use a command shell. The value specified with this option or the corresponding value specified in QMON will only be passed to defined JSV instances if the value is yes. The name of the parameter will be shell. The value will be y also when the long form yes was specified during submission.
soft The soft parameter signifies that all resource requirements following in the command line will be soft requirements. If this option or a corresponding value in QMON is specified, then the corresponding -q and -l resource requirements will be passed to all defined JSV instances as parameters with the names q_soft and l_soft. For more information, see descriptions for -q and -l.
S The S parameter specifies the interpreting shell for the job. If this option or a corresponding value in QMON is specified, then this value will be passed to all defined JSV instances as a parameter with the name S.
t The t parameter submits an array job. If this option or a corresponding value in QMON is specified, then this value will be passed to all defined JSV instances as parameters with the name t_min, t_max, and t_step.
v The v parameter defines or redefines the environment variables to be exported to the execution context of the job. All environment variables specified with -v, -V, or the DISPLAY variable provided with -display will be exported to all defined JSV instances only optionally when this is requested explicitly during the job submission verification.
V The V parameter specifies a validation level applied to the job to be submitted and/or the specified queued job. All environment variables specified with -v, -V, or the DISPLAY variable provided with -display will be exported to the defined JSV instances only optionally when this is requested explicitly during the job submission verification.
wd The wd parameter executes the job from the directory specified in the working_directory. The parameter value will be available in defined JSV instances as a parameter with the name cwd.

For detailed information on JSV submit parameters, see the qsub(1) man page.


Pseudo Parameters

Parameter Description
CLIENT The corresponding value for the CLIENT parameters is either qmaster or the command name of a submit client like qsub, qsh, qrsh, or qlogin and so on. This parameter value can't be changed by JSV instances. ClIENT is always be sent as part of a job verification.
CMDARGS CMDARGS displays a count of the number of arguments that are passed to a job script or command. CMDARGS is always sent during a job verification. There is also a CMDARGid for each argument, starting with CMDARG0. If no arguments are being passed to the job script or command, CMDARGS will have the value 0. To change an argument, you must change its CMDARGid value. To add an argument, add a new CMDARGid. For example, if your job is myjob.sh test 4, then CMDNAME is myjob.sh, CDMARG0 is test, and CDMARG1 is 4. To change myjob.sh test 4 to myjob.sh test 5, you must set CDMARG1 to 5. To change myjob.sh test 4 to myjob.sh test 5 true, you must set CDMARG2 to true.
Note
Currently, there is an unresolved issue that prevents users from removing arguments.
CMDNAME Either the path to the script or the command name in the case of binary submission. It will always be sent as part of a job verification.
CONTEXT It can be client if the JSV that receives this param_command command was started by a command line client like qsub, qsh, qrsh, or qlogin and so on. It is master if it was started by the master daemon process. It will always be sent as part of a job verification. Changing the value of this parameter is not possible within JSV instances.
GROUP This is the primary group of the user who is submitting the job to be verified. Cannot be changed but it is always sent as part of the verification process. The groupname is passed as parameter with the name GROUP.
JOB_ID Not available in the client context (see CONTEXT). Otherwise it contains the job number of the job which will be submitted to Grid Engine when the verification process is successful. JOB_ID is an optional parameter which can't be changed by JSV instances.
USER This is the username of the user who is submitting the job to be verified. Cannot be changed but it is always sent as part of the verification process. The username is passed as parameters with the name USER.
VERSION VERSION will always be sent as part of a job verification process and it will always be the first parameter that is sent. It will contain a version number of the format <major>.<minor>. In version 6.2u2 and higher, the value will be 1.0. The value of this parameters can't be changed.

JSV Verification Process

After you have configured the necessary JSV(s) and you submit a job, the verification process begins. All configured JSVs are executed in the order described below. Each configured client JSV instance communicates directly with the client process and each configured server JSV instance communicates directly with the master daemon process. For more information on how JSVs communicate with the client process or the master daemon, see the JSV Communication Protocol page. Using an example, the verification process is described in the below.

Order that the System Executes JSVs

Note
Only administrators can configure client JSVs in the global request file or server JSVs in the global configuration. These JSVs are always executed last.

Sun Grid Engine executes JSV instances in the following order:

Order of Execution JSV Type Configuration Location JSV Trigger Submitted by
1 client JSV command line -jsv jsv_url used with a submit client (qsub, qrsh, qsh, and qmon) any user
2 client JSV $cwd/.sge_request file -jsv jsv_url any user
3 client JSV $HOME/.sge_request file -jsv jsv_url any user
4 client JSV $SGE_ROOT/$SGE_CELL/common/sge_request file -jsv jsv_url administrator or user with administrative privileges
5 server JSV global configuration jsv_url parameter administrator or user with administrative privileges

Duration of of the JSV Verification Process

The verification process stops when one of the following occurs:

  • All configured JSVs deliver an accept result.
  • All configured JSVS deliver either an accept or accept with correction result.
  • One JSV delivers a reject or a reject later result.

Example – JSV Verification Process

For the purposes of this example, the following JSVs are configured:

Order of Execution JSV Type Configuration Location JSV Trigger Submitted By Description
1 client JSV $HOME/.sge_request file -jsv /home/Mike/jsvA.sh any user The script /home/Mike/jsvA.sh adds user Mike to each job specification (using -l attr=$USER) before it is accepted into the Grid Engine system.
2 client JSV $SGE_ROOT/$SGE_CELL/common/sge_request file -jsv /sge_root/jsvB.sh administrator or user with administrative privileges The script /sge_root/jsvB.sh rejects binary jobs (submitted with -b y). All other jobs are accepted.
3 server JSV global configuration jsv_url /sge_root/jsvC.sh administrator or user with administrative privileges The sge_root/jsvC.sh script accepts all jobs that don't contain a h_vmem resource request.
Scenario A
  1. A job is submitted using qsub script.sh.
  2. The first JSV instance, that contains the /home/Mike/jsvA.sh script, changes the job submission from qsub script.sh to qsub -l attr=Mike script.sh. The job is then passed to the second JSV instance.
  3. The second JSV instance, that contains the /sge_root/jsvB.sh script, accepts the job because it is not a binary job. The job is then passed to the third JSV instance.
  4. The third JSV instance, that contains the /sge_root/jsvC.sh script, also accepts the job because it has no h_vmem in its specification.
  5. The job is created.
Scenario B
  1. A job is submitted using qsub -b y command.
  2. The first JSV instance, that contains the /home/Mike/jsvA.sh script, changes the job specification from qsub -b y to qsub -l attr=Mike -b y. The job is then passed to the second JSV instance.
  3. The second JSV instance, that contains the sge_root/jsvB.sh script, rejects the job because it is a binary job. The verification process stops.
  4. The user who submitted the job receives an error message.
Scenario C
  1. A job is submitted using qsub -l attr=PETER,hvmem=3G script.sh.
  2. The first JSV instance, that contains the /home/Mike/jsvA.sh script, changes the job specification from qsub -l attr=PETER,hvmem=3G script.sh to qsub -l attr=Mike,hvmem=3G script.sh. The job is then passed to the second JSV instance.
  3. The second JSV instance, that contains the /sge_root/jsvB.sh script, accepts the job because it is not a binary job. The job is then passed to the second JSV instance.
  4. The third JSV instance, that contains the /sge_root/jsvC.sh script, rejects the job because it contains hvmem.
  5. The user who submitted the job receives an error message.

Configuring Resource Attributes

Resource attribute definitions can be associated with a queue, a host, or the entire cluster. A set of default resource attributes is already attached to each queue and host. Default resource attributes are built in to the system and cannot be deleted, nor can their type be changed. These resource attribute definitions are stored in an entity called the Grid Engine system complex and make up the complex configuration.

Users can request all pertinent information about resource attributes that the complex configuration provides for jobs via the qsub -l or qalter -l commands. The complex configuration also provides information about how the Grid Engine system should interpret these resource attributes.

Configuring the Complex

Task User Interface Description
How to Configure the Complex CLI or QMON Learn how to display, add, modify, and delete currently-configured complex resource attributes. Although you can define complex resource attributes from the command line, it is easier to use the QMON Complex Configuration dialog box.

Adding Resource Attributes to the Complex

Topic Description
Adding Resource Attributes to the Complex Learn how to add resource attributes to the complex.

Assigning Resource Attributes to Queues, Hosts or to the Global Complex

Topic Description
Assigning Queue Resource Attributes Learn about how to assign resources attributes to queues.
Assigning Host Resource Attributes Learn about how to assign resource attributes to hosts.
Assigning Global Resource Attributes Learn about how to assign global resource attributes.

Defining Consumable Resources

Topic Description
Defining Consumable Resources Learn how to define consumable resources, which provide an efficient way to manage limited resources.

How to Configure the Complex From the Command Line

To configure the complex from the command line, use the following arguments for the qconf command:

  • To display the current complex configuration, type the following command:
    qconf -sc
    

    The -sc option prints the current complex configuration to the standard output stream in the file format defined in the complex(5) man page. A sample output is shown in the following example.

    #name      shortcut  type  relop  requestable   consumable  default  urgency
    #---------------------------------------------------------------------------
    nastran    na        INT   <=     YES           NO          0        0
    pam-crash  pc        INT   <=     YES           YES         1        0
    permas     pm        INT   <=     FORCED        YES         1        0
    #---- # start a comment but comments are not saved across edits -----------
    
  • To modify the complex configuration, type the following command:
    qconf -mc 
    

    The -mc option opens an editor that contains either a template file for the complex configuration or an existing complex configuration file that you can modify. The changed complex configuration is the registered with the sgemaster. You must have root or manager privileges to implement this command.

  • To modify the complex configuration from a file, type the following command:
    qconf -Mc <filename>
    

    The -Mc option takes a complex configuration file as an argument. The argument file must comply to the format specified in the complex complex (5). You must have root or manager privileges to implement this command.

See the qconf(1) man page for a detailed definition of the qconf command format and the valid syntax.


How to Configure the Complex With QMON

  1. In the QMON Main Control window, click the Complex Configuration button.
    The Complex Configuration dialog box appears as shown in the following figure.

    The Complex Configuration dialog box enables you to add, modify, or delete complex resource attributes. See the complex(5) man page for details about the meaning of the rows and columns in the table.

  2. To add a new attribute, follow these steps:
    1. Make sure that no line in the Attributes table is selected.
      To deselect a highlighted attribute, hold down the Control key and click mouse button 1.
    2. In the fields above the Attributes table, type or select the values that you want.
    3. Click the Add button.
      Tip
      You can add a new attribute by copying an existing attribute and then modifying it. Make sure that the attribute name and its shortcut are unique.


  3. To modify an attribute, follow these steps:
    1. Select the attribute in the table.
      The values of the selected attribute are displayed above the Attributes table.
    2. Change the attribute values.
    3. Click the Modify button.

  4. To save configuration changes to a file, click Save.

  5. To load values from a file into the complex configuration, click Load, and select the name of a file from the list.

  6. To delete an attribute, select the attribute in the Attributes table, and click Delete.

  7. To register your new or modified complex configuration with sge_qmaster, click Commit.

Adding Resource Attributes to the Complex

By adding resource attributes to the complex, the administrator can extend the set of attributes managed by the Grid Engine system. The administrator can also restrict the influence of user-defined attributes to particular queues, hosts, or both.

User-defined attributes are a named collection of attributes with the corresponding definitions as to how the Grid Engine software is to handle these attributes. You can attach one or more user-defined attributes to a queue, to a host, or globally to all hosts in the cluster. Use the complex_values parameter for the queue configuration and the host configuration. For more information, see Configuring Queues and Configuring Hosts. The attributes defined become available to the queue and to the host, respectively, in addition to the default resource attributes.

The complex_values parameter in the queue configuration and the host configuration must set concrete values for user-defined attributes that are associated with queues and hosts.

For example, say the user-defined resource attributes permas and pamcrash, shown in the following figure, are defined.

For at least one or more queues, add the resource attributes to the list of associated user-defined attributes as shown in the Complex tab of the Modify queue-name dialog box. For details on how to configure queues, see Configuring Queues and its related sections.

The displayed queue is configured to manage up to 10 licenses of the software package permas as shown in the following figure.
"Dialog box titled Modify <queue-name>. Shows Complex tab with parameter you can set. Shows Ok

The attribute permas becomes requestable for jobs, as expressed in the Available Resources list in the Requested Resources dialog box shown below.
"Dialog box titled Requested Resource. Shows lists of requested resources for jobs. Shows OK

Consequently, the only eligible queues for these jobs are the queues that are associated with the user-defined resource attributes and that have permas licenses configured and available.

For details about how to submit jobs, see Submitting Jobs.

Alternatively, the user could submit jobs from the command line and could request attributes as follows:

% qsub -l pm=1 permas.sh
Tip
You can use the pm shortcut instead of the full attribute name permas.


Assigning Host Resource Attributes

The default host-related attributes are load values. As an administrator, you can add new resource attributes to the default attributes. Every execution daemon periodically reports load to the master daemon. The reported load values are either the standard load values such as the CPU load average, or the load values defined by the administrator, as described in Configuring Load Parameters.

The definitions of the standard load values are part of the default host resource attributes, whereas administrator-defined load values require extending the host resource attributes.

Host-related attributes are commonly extended to include nonstandard load parameters. Host-related attributes are also extended to manage host-related resources such as the number of software licenses that are assigned to a host, or the available disk space on a host's local file system.

If host-related attributes are associated with a host or with a queue instance on that host, a concrete value for a particular host resource attribute is determined by one of the following items:

  • The queue configuration, if the attribute is also assigned to the queue configuration
  • A reported load value
  • The explicit definition of a value in the complex_values entry of the corresponding host configuration. For details, see Configuring Hosts.

In some cases, none of these values are available. For example, say the value is supposed to be a load parameter, but sge_execd does not report a load value for the parameter. In such cases, the attribute is not defined, and the qstat --F command shows that the attribute is not applicable.

For example, the total free virtual memory attribute h_vmem is defined in the queue configuration as limit and is also reported as a standard load parameter. The total available amount of virtual memory on a host can be defined in the complex_values list of that host. The total available amount of virtual memory attached to a queue instance on that host can be defined in the complex_values list of that queue instance. Together with defining h_vmem as a consumable resource, you can efficiently exploit memory of a machine without risking memory over-subscription, which often results in reduced system performance that is caused by swapping. For more information about consumable resources, see Defining Consumable Resources.

Note
Only the Shortcut, Relation, Requestable, Consumable, and Default columns can be changed for the default resource attributes. No default attributes can be deleted.

Assigning Global Resource Attributes

Global resource attributes are cluster-wide resource attributes, such as available network bandwidth of a file server or the free disk space on a network-wide available file system.

Global resource attributes can also be associated with load reports if the corresponding load report contains the GLOBAL identifier, as described in Configuring Load Parameters. Global load values can be reported from any host in the cluster. No global load values are reported by default, therefore there are no default global resource attributes.

Concrete values for global resource attributes are determined by the following items:

  • Global load reports.
  • Explicit definition in the complex_values parameter of the global host configuration. See Configuring Hosts.
  • In association with a particular host or queue and an explicit definition in the corresponding complex_values lists.

Sometimes none of these cases apply. For example, a load value might not yet be reported. In such cases, the attribute does not exist.


Defining Consumable Resources

Introducing Consumable Resources

Consumable resources, or consumables, provide an efficient way to manage limited resources. The complex also builds the framework for the system's consumable resources facility. The resource attributes that are defined in the complex can be attached to the global cluster, to a host, or to a queue instance. The attached attribute identifies a resource with the associated capability. Attribute definitions in the Grid Engine complex define how resource attributes should be interpreted. During the scheduling process, the availability of resources and the job requirements are taken into account. The Grid Engine system also performs the bookkeeping and the capacity planning that is required to prevent over-subscription of consumable resources.

Typical consumable resources include:

  • Available free memory
  • Unoccupied software licenses
  • Free disk space
  • Available bandwidth on a network connection

The definition of a resource attribute includes the following:

  • Name of the attribute
  • Shortcut to reference the attribute name
  • Value type of the attribute, for example, STRING, RESTRING, TIME, or any other complex(5) type
  • Relational operator used by the scheduler
  • Requestable flag, which determines whether users can request the attribute for a job
  • Consumable flag, which identifies the attribute as a consumable resource
  • Default request value that is taken into account for consumable attributes if jobs do not explicitly specify a request for the attribute
  • Urgency value, which determines job priorities on a per resource basis

Defining Consumable Resources

To enable consumable resource management, you must define the total capacity of a resource. You can define resource capacity globally for the cluster, for specified hosts, and for specified queues. These categories can supersede each other in the given order. Thus a host can restrict availability of a global resource, and a queue can restrict host resources and global resources.

The consumption of the resource is then monitored by Grid Engine software internal bookkeeping. Jobs are dispatched only if the internal bookkeeping indicates that sufficient consumable resources are available.

You define resource capacities by using the complex_values attribute in the queue and host configurations. The complex_values definition of the global host specifies global cluster consumable settings. For more information, see the host_conf(5) and queue_conf(5) man pages, as well as Configuring Queues and Configuring Hosts.

To each consumable attribute in a complex_values list, a value is assigned that denotes the maximum available amount for that resource. The internal bookkeeping subtracts from this total the assumed resource consumption by all running jobs as expressed through the jobs' resource requests.

Multiplied Resource Requests Versus Non-Multiplied Resource Requests

By default Sun Grid Engine performs multiplied resource requests, which means that a consumable resource request is multiplied by the number of slots allocated to a parallel job. The configuration for multiplied resource requests is designated by a YES flag in the consumable column of the job row in the complex definition.

The following multiplied resource request is explained below:

qsub -l mem=100M -pe make=8

Sun Grid Engine multiples the consumable resource request (100 M) by the number of slots allocated for the parallel job (8). The consumable usage is split across the queues and hosts on which the job runs. If four tasks run on host A and four tasks run on host B, the job consumes 400 Mbytes on each host.

While multiplied resource requests typically work well, in the case of software licenses, it is more practical to make a per job request, or a non-multiplied resource request, which debits the exact amount requested. Starting in Sun Grid Engine 6.2u2, you can configure the complex to accept non-multiplied resource requests by changing the jobs consumable flag from YES to JOB, as shown below:

#name   shortcut   type   relop   requestable   consumable   default   urgency 
#-----------------------------------------------------------------------------
jobs       j        INT    <=          YES           JOB        0        0 

For more on the complex configuration, see the queue_conf(5) man page.

Examples – Defining Consumable Resources

Only numeric attributes can be configured as consumables. Numeric attributes are attributes whose type is INT, DOUBLE, MEMORY, or TIME.

In the QMON Main Control window, click the Complex Configuration button. The Complex Configuration dialog box appears.

To enable the consumable management for an attribute, set the Consumable flag for the attribute in the complex configuration.

To set up other consumable resources, follow these examples:


Example 1 - Floating Software License Management

Suppose you are using the software package pam-crash in your cluster, and you have access to 10 floating licenses. You can use pam-crash on every system as long as no more than 10 invocations of the software are active. The goal is to configure the Grid Engine system to prevent scheduling pam-crash jobs while all 10 licenses are occupied by other running pam-crash jobs.

With consumable resources, you can achieve this goal easily. First you must add the number of available pam-crash licenses as a global consumable resource to the complex configuration, as shown in the following figure.
"Dialog box titled Complex Configuration. Shows pam-crash resource attribute definition. Shows Commit

In the figure above:

  • The name of the consumable attribute is set to pam-crash. You can use pc as a shortcut in the qalter -l, qselect -l, qsh -l, qstat -l, or qsub -l commands instead.
  • The attribute type is defined to be an integer counter.
  • The Requestable flag is set to FORCED. The user must request how many pam-crash licenses a job will occupy when the job is submitted.
  • The Consumable flag specifies that the attribute is a consumable resource.
  • The setting Default is irrelevant since Requestable is set to FORCED, which means that a request value must be received for this attribute with any job.

Consumables receive their value from the global, host, or queue configurations through the complex_values lists. See the host_conf(5) and queue_conf(5) man pages, as well as Configuring Queues and Configuring Hosts.

To activate resource planning for this attribute and for the cluster, the number of available pam-crash licenses must be defined in the global host configuration, as shown in the following figure.

Dialog box titled Add/Modify Exec Host. Shows Consumables/Fixed Attributes tab with pam-crash value definition. Shows OK and Cancel buttons.

In this figure, the value for the attribute pam-crash is set to 10, corresponding to 10 floating licenses.

Note
The table Consumables/Fixed Attributes corresponds to the complex_values entry that is described in the host configuration file format host_conf(5).

Assume that a user submits the following job:

% qsub -l pc=1 pam-crash.sh

The job starts only if fewer than 10 pam-crash licenses are currently occupied. The job can run anywhere in the cluster, however, and the job occupies 1 pam-crash license throughout its run time.

One of your hosts in the cluster might not be able to be included in the floating license. For example, you might not have pam-crash binaries for that host. In such a case, you can exclude the host from the pam-crash license management. You can exclude the host by setting to zero the capacity that is related to that host for the consumable attribute pam-crash. To exclude the host, use the Execution Host tab of the Host Configuration dialog box as shown in the following figure.
Dialog box titled Add/Modify Exec Host. Shows Consumables/Fixed Attributes tab with pam-crash value definition. Shows OK and Cancel buttons.

Note
The pam-crash attribute is implicitly available to the execution host because the global attributes of the complex are inherited by all execution hosts. By setting the capacity to zero, you could also restrict the number of licenses that a host can manage to a nonzero value such as two. In this case, a maximum of two pam-crash jobs could coexist on that host.

Similarly, you might want to prevent a certain queue from running pam-crash jobs. For example, the queue might be an express queue with memory and CPU-time limits not suitable for pam-crash. In this case, set the corresponding capacity to zero in the queue configuration, as shown in the following figure.
"Dialog box titled Modify <queue-name>. Shows Complex tab with pam-crash value definition. Shows OK

Note
The pam-crash attribute is implicitly available to the queue because the global attributes of the complex are inherited by all queues.

Example 2 - Space Sharing for Virtual Memory

Administrators must often tune a system to avoid performance degradation caused by memory over-subscription, and consequently swapping of a machine. The Grid Engine software can support you in this task through the Consumable Resources facility.

The standard load parameter virtual_free reports the available free virtual memory, that is, the combination of available swap space and the available physical memory. To avoid swapping, the use of swap space must be minimized. In an ideal case, all the memory required by all processes running on a host should fit into physical memory.

The Grid Engine software can guarantee the availability of required memory for all jobs started through the Grid Engine system, given the following assumptions and configurations:

  • virtual_free is configured as a consumable resource, and its capacity on each host is set to the available physical memory, or lower.
  • Jobs request their anticipated memory usage, and the value that jobs request is not exceeded during run time.

An example of a possible virtual_free resource definition is shown in the Complex Configuration Dialog Box: virtual_free. A corresponding execution host configuration for a host with one Gbyte of main memory is shown in Add-Modify Exec Host: virtual_free.

In the virtual_free resource definition example, the Requestable flag is set to YES instead of to FORCED, as in the example of a global configuration. This means that users need not indicate the memory requirements of their jobs. The value in the Default field is used if an explicit memory request is missing. The value of 1 Gbyte as default request in this case means that a job without a request is assumed to occupy all available physical memory.

Note
virtual_free is one of the standard load parameters of the Grid Engine system. The additional availability of recent memory statistics is taken into account automatically by the system in the virtual memory capacity planning. If the load report for free virtual memory falls below the value obtained by Grid Engine software internal bookkeeping, the load value is used to avoid memory over-subscription. Differences in the reported load values and the internal bookkeeping can occur easily if jobs are started without using the Grid Engine system.

If you run different job classes with different memory requirements on one machine, you might want to partition the memory that these job classes use. This functionality is called space sharing. You can accomplish this functionality by configuring a queue for each job class. Then you assign to each queue a portion of the total memory on that host.

In the example, the queue configuration attaches half of the total memory that is available to host carc to the queue fast.q for the host carc. Hence the accumulated memory consumption of all jobs that are running in queue fast.q on host carc cannot exceed 500 Mbytes. Jobs in other queues are not taken into account. Nonetheless, the total memory consumption of all running jobs on host carc cannot exceed 1 Gbyte.
"Dialog box titled Modify <queue-name>. Shows Complex tab with virtual_free memory definition. Shows OK

Note
The attribute virtual_free is available to all queues through inheritance from the complex.

Users might submit jobs to a system configured similarly to the example in either of the following forms:

% qsub -l vf=100M honest.sh
% qsub dont_care.sh

The job submitted by the first command can be started as soon as at least 100 Mbytes of memory are available. This amount is taken into account in the capacity planning for the virtual_free consumable resource. The second job runs only if no other job is on the system, as the second job implicitly requests all the available memory. In addition, the second job cannot run in the queue fast.q because the job exceeds the queue's memory capacity.


Example 3 - Managing Available Disk Space

Some applications need to manipulate huge data sets stored in files. Such applications therefore depend on the availability of sufficient disk space throughout their run time. This requirement is similar to the space-sharing of available memory, as discussed in the preceding example. The main difference is that the Grid Engine system does not provide free disk space as one of its standard load parameters. Free disk space is not a standard load parameter because disks are usually partitioned into file systems in a site-specific way. Site-specific partitioning does not allow identifying the file system of interest automatically.

Nevertheless, available disk space can be managed efficiently by the system through the consumables resources facility. You should use the host resource attribute h_fsize for this purpose.

First, the attribute must be configured as a consumable resource, as shown in the following figure.
"Dialog box titled Complex Configuration. Shows h_fsize attribute definition. Shows Add

In the case of local host file systems, a reasonable capacity definition for the disk space consumable can be put in the host configuration, as shown in the following figure.
Dialog box titled Add/Modify Exec Host. Shows h_vmem and h_fsize attribute values. Shows OK and Cancel buttons.

Submission of jobs to a Grid Engine system that is configured as described here works similarly to the previous examples:

% qsub -l hf=5G big-sort.sh

The h_fsize attribute is recommended because h_fsize also is used as the hard file size limit in the queue configuration. The file size limit restricts the ability of jobs to create files that are larger than what is specified during job submission. The qsub command in this example specifies a file size limit of 5 Gbytes. If the job does not request the attribute, the corresponding value from the queue configuration or host configuration is used. If the Requestable flag for h_fsize is set to FORCED in the example, a request must be included in the qsub command. If the Requestable flag is not set, a request is optional in the qsub command.

By using the queue limit as the consumable resource, you control requests that the user specifies instead of the real resource consumption by the job scripts. Any violation of the limit is sanctioned, which eventually aborts the job. The queue limit ensures that the resource requests on which the Grid Engine system internal capacity planning is based are reliable. See the queue_conf(5) and the setrlimit(2) man pages for details.

Note
Some operating systems provide only per-process file size limits. In this case, a job might create multiple files with a size up to the limit. On systems that support per-job file size limitation, the Grid Engine system uses this functionality with the h_fsize attribute. See the queue_conf(5) man page for further details.

You might want applications that are not submitted to the Grid Engine system to occupy disk space concurrently. If so, the internal bookkeeping might not be sufficient to prevent application failure due to lack of disk space. To avoid this problem, you can periodically receive statistics about disk space usage. These statistics indicate the total disk space consumption, including any space that is consumed outside of the Grid Engine system.

The load sensor interface enables you to enhance the set of standard load parameters with site-specific information, such as the available disk space on a file system. See Adding Site-Specific Load Parameters for more information.

By adding an appropriate load sensor and reporting free disk space for h_fsize, you can combine consumable resource management and resource availability statistics. The Grid Engine system compares job requirements for disk space with the available capacity and with the most recent reported load value. Available capacity is derived from the internal resource planning. Jobs get dispatched to a host only if both criteria are met.


Backing Up and Restoring System Configuration

You can back up your Grid Engine system configuration files automatically. The automatic backup process uses a configuration file called backup_template.conf. The backup configuration file is located by default in $SGE_ROOT/util/install_modules/backup_template.conf.

The backup configuration file must define the following elements:

  • The Grid Engine system root directory ($SGE_ROOT).
  • The Grid Engine system cell directory ($SGE_CELL).
  • The Grid Engine system backup directory.
  • Type of backup. Your backup can be just the Grid Engine system configuration files, or the backup can be a compressed tar file that contains the configuration files.
  • The file name of the backup file.

The backup template file looks like the following example:

##################################################
# Autobackup Configuration File Template
##################################################

# Please, enter your $SGE_ROOT here (mandatory)
$SGE_ROOT=""

# Please, enter your $SGE_CELL here (mandatory)
$SGE_CELL=""

# Please, enter your Backup Directory here
# After backup you will find your backup files here (mandatory)
# The autobackup will add a time /date combination to this dirname
# to prevent an overwriting!
BACKUP_DIR=""

# Please, enter true to get a tar/gz package
# and false to copy the files only (mandatory)
TAR="true"

# Please, enter the backup file name here. (mandatory)
BACKUP_FILE="backup.tar" 

To start the automatic backup process, type the following command on the sge_qmaster host:

inst_sge -bup -auto <backup-conf>

backup-conf is the full path to the backup configuration file.

Note
You do not need to shut down any of the Grid Engine system daemons before you back up your configuration files.

Your backup is created in the directory specified by BACKUP_FILE. A backup log file called install.pid is also created in this directory. pid is the process ID number.

Topic Description
How to Perform a Manual Backup Learn how to perform a manual backup.
How to Restore from a Backup Learn how to restore from a backup.

How to Perform a Manual Backup

  1. Type the following command to start a manual backup:
    inst_sge -bup
    


  2. Enter the $SGE_ROOT directory or use the default.
    SGE Configuration Backup
    ------------------------
    
    This feature does a backup of all configuration you made
    within your cluster.
    Please enter your $SGE_ROOT directory.
    Default: [/home/user/ts/u10]
    


  3. Enter the $SGE_CELL name or use the default.
    Please enter your $SGE_CELL name. Default: [default]
    


  4. Enter the backup destination directory or use the default.
    Where do you want to save the backup files?
    Default: [/home/user/ts/u10/backup]
    


  5. Choose whether to create a compressed tar backup file.
    Caution
    If you create a compressed tar file, use the same tar binary to pack and unpack the files. Using different tar versions (gnu tar/ solaris tar) might result in corrupt tar packages.
    Shall the backup function create a compressed tar package with your files? (y/n) [y] >>
    


  6. Enter the file name of the backup file or use the default.
    ... starting with backup
    
    Please enter a filename for your backupfile. Default: [backup.tar] >>
    

    Once the filename is specified, the backup process completes. Output similar to the following is displayed.

    2007-01-11_22_43_22.dump
    bootstrap
    qtask
    settings.sh
    act_qmaster
    sgemaster
    settings.csh
    sgeexecd
    jobseqnum
    
    ... backup completed
    All information is saved in
    [/home/user/ts/u10/backup/backup.tar.gz[Z]]
    

How to Restore From a Backup

Caution
Shut down the qmaster daemon before you start the restore process. During the restore process, the spooling database is changed. If the qmaster and restore processes try to access the same data concurrently, data loss might result.
  1. Type the following command to start the restore process:
    inst_sge -rst 
    


  2. Read the messages on the screen and press Return.
    SGE Configuration Restore
    -------------------------
    
    This feature restores the configuration from a backup you mad