|
Sun Grid Engine Information Center Sun Grid Engine 6.2 Release Notes
New FeaturesMulti-Clustering With Service Domain ManagerThe Service Domain Manager (SDM) is a completely new module that extends the scope of Sun Grid Engine to a new domain of use cases. The SDM module distributes resources between different services according to configurable Service Level Objectives (SLOs), as measured by Key Performance Indicators (KPIs) of the managed resource. Manageable resources can include almost anything, such as physical or virtual hosts or software licenses. In the context of SDM, a service is defined as a scalable and manageable software product that performs a specific function for its users. Through service adapters, you can add virtually any type of service, such as a Grid Engine cluster, an application server farm, or some other type of Grid middleware application. The first release of SDM includes one service adapter that enables you to manage two or more Grid Engine clusters. The supported platforms for the Service Domain Manager (SDM) are:
In addition, Service Domain Manager requires the following software
For more information, see Service Domain Manager (SDM). Improved Scalability and Job ThroughputIn Sun Grid Engine 6.2, scalability and job throughput have been substantially improved. You can now start much faster and in a more robust way those massively parallel jobs that span thousands of hosts. The scheduler now runs as a thread in the qmaster daemon, which enables faster job starts and improved job throughput. Enhancements to the communications protocol and architectural changes to the request handling in the qmaster daemon now enable Sun Grid Engine to operate clusters that have more 60,000 CPU cores effectively. Advance ReservationsSometimes you might need to reserve a particular resource so that the resource is available for a specific purpose. Sun Grid Engine now provides advance reservation capabilities to help. See Managing Advance Reservations. New Support for Interactive JobsThe mechanism that starts interactive jobs (submitted by qrsh or qlogin) has been redesigned. Instead of starting the systems or a user defined mechanism (like "rlogind" and "rlogin") and delegating the work of forwarding input and output to these, Sun Grid Engine now provides a way to transfer input and output data over Grid Engine itself. This redesign provides the following benefits:
As opposed to SSH, this mechanism does not provide special features, such as X forwarding. Therefore, you can still configure a user-defined mechanism to provide these special features. For more information, see the following sections in the sge_conf(5) man page: qlogin_daemon, qlogin_command, rlogin_daemon, rlogin_command, rsh_daemon, and rsh_command. Multi-Cluster Support for Accounting and Reporting ConsoleIf you have multiple Sun Grid Engine clusters, you can log in to one instance of ARCo from which you can run reports on all ARCo instances that use the same database vendor and structure. With the ARCo multi-cluster support, one dbwriter instance connected to one database per qmaster is still required, but a single reporting installation is sufficient for all qmasters. You can use one database with separate schemas for each cluster, separate databases on one database server for each cluster, or separate databases on separate database servers for each cluster. The reporting tool enables you to run the same queries on separate clusters using the configured database connections. Solaris SMF SupportService Management Facility (SMF) is a new feature in Solaris 10 and provides unified model for service management. The new version of Sun Grid Engine provides the possibility to run the former daemons as SMF services, which greatly improves their availability and reliability with less effort from the administrator. Sun Service Tags SupportService Tags is a new Sun Connection technology for automatic discovery of systems, software, and services. The new Service Tags integration provides the possibility to discover systems that are running Sun Grid Engine 6.2 software. Ability to Request the Master and Slave Queues for Parallel JobsIn previous releases of Sun Grid Engine, you could not request a master queue independently from hard and soft queue requests for slave tasks of parallel jobs. The following command would not schedule a job because it requested distinct queues for the master and slave tasks: % qsub -pe make 2 -masterq q1,q2 -q q3 script.sh Similarly, the following job would have been scheduled but would not have considered the queue q3: % qsub -pe make 2 -masterq q1,q2 -soft -q q3 script.sh In Sun Grid Engine 6.2, the scheduler considers the resource requests for the master and slave queue independently. For more information, see the qsub(1) man page. New Unix Resource Limits SupportThe resource limits "descriptors", "locks", "maxproc" and "memorylocked" can now be set in the execd_params of the global or host configuration. See the sge_conf(5) man page for details. New Upgrade ProcedureA new upgrade procedure has been introduced with the 6.2 release. It introduces one new upgrade method. You can now also upgrade to a different $SGE_ROOT or $SGE_CELL and transfer the old configuration to this cluster. The upgrade procedure supports upgrading from 6.0u2 or later releases. For more information, see Upgrading Sun Grid Engine Software. Supported Platforms and PatchesChanges to Platform SupportThe following operating platforms are no longer supported in Grid Engine 6.2:
The following operating platforms are newly supported in Grid Engine 6.2:
See the complete list of supported platforms. Required Patches for Solaris 8 and 9If you use any release of Solaris 8 or Solaris 9, you must have the following patches installed to run Grid Engine:
Changes to DRMAA SupportIn Sun Grid Engine 6.2, DRMAA no longer supports the following versions of the language bindings:
For more information about the Distributed Resource Management Application API, see Automating Grid Engine Functions Through DRMAA. Known Issues and LimitationsKnown Limitations of Sun Grid Engine 6.2 Softwarereporting_param log_consumablesThis parameter controls writing of consumable resources to the reporting file. When set to log_consumables=true, all consumable resources (their current usage and their capacity) will be written to the reporting file, whenever a consumable resource changes either in definition, or in capacity, or when the usage of a consumable resource changes. When log_consumables is set to false (default), only those variables will be written to the reporting file, that are configured in the report_variables in the exec host configuration. For more information about report_variables, see the host_conf(5) man page. For Sun Grid Engine versions before 6.2, the default was log_consumables=true. To limit the amount of data being written to the reporting file and into the ARCo database, use the new default log_consumables=false. Incomplete Accounting Record for a Tightly Integrated Parallel JobSometimes, the accounting record for a completed parallel job is missing data. This error has been a known issue since the release of Sun Grid Engine 6.0. However, due to the improved performance in the protocol between sge_qmaster and the execution daemons in Sun Grid Engine 6.2, it is now much more likely to occur. To workaround this issue, set the global configuration load_report_time to be shorter than the shortest runtime of a parallel task. This ensures that the sge_qmaster gets at least one load report from this task and then waits for the final load report before it closes the job account. Missing OpenMotif Library for QMON on Mac OS XThe default Mac OS X installation does not include the OpenMotif library that QMON needs. You can get the OpenMotif library for the PowerPC and x86 architectures from various web sites, such as http://www.ist-inc.com/DOWNLOADS/openmotif_download.html. You can also find information about how to install packages that have been ported to Mac OS X at http://www.macports.org. LD_LIBRARY_PATH Settings and DRMAAWhen you use Java bindings with DRMAA, verify that the LD_LIBRARY_PATH is set correctly.
Stack Size for IBM AIX and HP/UX 11The stack size for sge_qmaster should be set to 16 MBytes. sge_qmaster might not run with the default values for stack size on the following architectures: IBM AIX and HP/UX 11. File Descriptor Limit for Master HostYou should set a high file descriptor limit in the kernel configuration on hosts that are designated to run the sge_qmaster daemon. You might want to set a high file descriptor limit on the shadow master hosts as well. A large number of available file descriptors enables the communication system to keep connections open instead of having to constantly close and reopen them. If you have many execution hosts, a high file descriptor limit significantly improves performance. Set the file descriptor limit to a number that is higher than the number of intended execution hosts. You should also make room for concurrent client requests, in particular for jobs submitted with qsub -sync or when you are running DRMAA sessions that maintain a steady communication connection with the master daemon. Refer to you operating system documentation for information about how to set the file descriptor limit. Limiting the Number of Dynamic Client EventsThe number of concurrent dynamic event clients is limited by the number of file descriptors. The default is 99. Dynamic event clients are jobs submitted with the qsub -sync command and a DRMAA session. You can limit the number of dynamic event clients with the qmaster_params global cluster configuration setting. Set this parameter to MAX_DYN_EC=n. See the sge_conf(5) man page for more information. Berkeley DB Requires That the Database Files Reside on the Local Disk in Certain SituationsBerkeley DB requires that the database files reside on the local disk, if qmaster is not running on Solaris 10 and uses a NFSv4 mount (full NFSv4 compliant clients and servers from other vendors are also supported, but have not yet been tested.) If the sge_qmaster cannot be run on the file server intended to store the spooling data (for example, if you want to use the shadow master facility), a Berkeley DB RPC server can be used. The RPC server runs on the file server and connects with the Berkeley DB sge_qmaster instance. However, Berkeley DB's RPC server uses an insecure protocol for this communication and so it presents a security problem. Do not use the RPC server method if you are concerned about security at your site. Use sge_qmaster local disks for spooling instead and, for fail-over, use a high availability solution such as Sun Cluster, which maintains host local file access in the fail-over case. Busy QMON With Large Array Task NumbersIf large array task numbers are used, you should use "compact job array display" in the QMON Job Control dialog box customization. Otherwise the QMON GUI will cause high CPU load and show poor performance. Resource Reservation Only Considers Pending JobsResource reservation currently takes only pending jobs into account. Consequently, jobs that are in a hold state due to the submit options -a time and -hold_jid joblist, and are thus not pending, do not get reservations. Such jobs are treated as if the -R n submit option were specified for them. Log Files for Automatic InstallationThe automatic installation option does not provide full diagnostic information in case of installation failures. If the installation process aborts, check for the presence and the contents of an installation log file in Configuring Spooling on IBM AIX and HP/UX 11 SystemsOn IBM AIX and HP/UX 11 systems, two different binaries are provided for sge_qmaster, spooldefaults, and spoolinit. One of these binaries is for the Berkeley DB spooling method, the other binary is for the classic spooling method. The names of these binaries are binary.spool_db and binary.spool_classic.
# cd $SGE_ROOT/bin/$SGE_ARCH # rm sge_qmaster # ln -s sge_qmaster.spool_classic sge_qmaster # cd $SGE_ROOT/utilbin/$SGE_ARCH # rm spooldefaults spoolinit # ln -s spooldefaults.spool_classic spooldefaults # ln -s spoolinit.spool_classic spoolinit Automatic Installation Requires Access Without Typing the PasswordFor a fully-featured automatic installation (not using CSP), you must grant the root user permissions to remote login through rsh or ssh without asking for a password. This enables the installation script to start the installation on the remote hosts. If this is not configured correctly, you have to log into each execution host and manually execute the automatic installation using the following command: inst_sge -x -auto <_conf-file_> -noremote Known Limitations of Accounting and Reporting ConsolePDF Exports in ARCo Require Lots of MemoryHuge reports can result in an OutOfMemoryException when they are exported to PDF.
# wcadmin add -p -a reporting java.options="... -Xmx512M ..."
# smcwebserver restart ARCo Platform SupportThe ARCo module is available only for the Solaris Sparc, Solaris Sparc 64 bit, Solaris x86, Solaris x64, Linux x86, and Linux 64 bit kernels. Known Limitations of Service Domain Manager SoftwareDo Not Install the rpm Package Using the --prefix OptionWhen you use the rpm package for installing the Service Domain Manager binaries, the --prefix option is ignored. If you want to install the binaries to a different location other than the default location, that is /opt/sdm, use the --relocate option. For example: rpm --install --relocate=/opt=/newlocation1/newlocation2 sun-sdm-core-1.0-0.noarch.rpm Do Not Use the shutdown_service -fr Command When You Shut Down the Last Running ServiceThe shutdown_service comand stops a service. This command has an optional -fr flag. If this flag is given, the service that should be shut down should return all non-static resources to the system. This will allow other services to use these resources. If the -fr flag is not specified, resources remain assigned to the service. If you shut down the last running service using the -fr flag, the resources of that service are returned to the system. However, those resources cannot be used by any other service again. If you shut down the last service, use the shutdown_service command without the -fr flag. # sdmadm shutdown_service -s mysystem
Installation of Master Host on Mac OS X IssueIf you install the SDM master host on a filesystem that is not case sensitive on Mac OS X, the installation fails. The easiest way to resolve the issue is to use a filesystem that is case sensitive. If you cannot do that easily, follow these steps:
Executor could become unresponsive while assigning or unassigning a resourceUnder certain circumstances it may happen that a thread in executor component hangs up and a process started by the executor does not finish. From the user perspective, the problem can be spot as a resource stuck in ASSIGNING state for an unreasonable long time (if resource is stuck in ASSIGNING state for more than 5 mins, it is most probably caused by this bug). From the character of the bug, we can assume that similar problem may occur when execd is being uninstalled - resource is stuck in UNASSIGNING state for an unreasonable long time. If user suspects that he/she hit the bug, it can be verified using following steps: 1. resource being added/removed to/from ge adapter is being stuck in ASSIGNING/UNASSIGNING state for an unreasonable long time (5+ min). ps -ef | grep java on the resource host will reveal at least two java processes, while the first is JVM holding executor component and second one is install/uninstall task started by executor component. kill -9 proc_num (where proc_num is a process id of a process started by executor and revealed in the step 2.) will cause executor unfreeze and resource state will change to ASSIGNED/UNASSIGNED. Known Limitations and Workarounds for the Microsoft Windows Platform
Using Sun Grid Engine 6.2 with an Existing Grid Engine ClusterYou can install the Sun Grid Engine 6.2 software in an environment that has an existing Grid Engine cluster. To run the 6.2 software in parallel with an existing Grid Engine environment, follow these rules:
|
Comments (2)
Aug 21, 2008
rasialdo says:
New Known Limitation of Service Domain Manager Known Limitations of Service Dom...New Known Limitation of Service Domain Manager
Known Limitations of Service Domain Manager Software
Do not install "rpm" package using "--prefix"
When you use "rpm" package for installation of Service Domain Manager binaries, the option"--prefix" is ignored.
If you want to install binaries to different location than the default one ("/opt/sdm") please use option "--relocate".
For example:
Oct 13, 2008
surajp says:
I have fixed the comment. Please review the changes.I have fixed the comment. Please review the changes.