Troubleshooting
Common Problems
Installation Problems
Administration Problems
User-Focused Problems
Problem: Execution host reports no load
Description: An execution host is reporting no load. You can use qstat -f or qhost to find out which execution host is reporting no load.
- If you use qstat -f, you will get sample output similar to what is shown below. In this example, exechostB is reporting no load, which is demonstrated by u in the state column and -NA- in load_avg column.
queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- all.q@exechostA BIPC 0/0/1 0.32 sol-sparc64 --------------------------------------------------------------------------------- all.q@exechostB BIPC 0/0/2 -NA- sol-sparc64 au
- If you use qhost, you will get sample output similar to what is shown below. In this example, exechostB is reporting no load, which is demonstrated by - in the LOAD column.
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - exechostA sol-sparc64 1 0.32 4.0G 3.4G 4.5G 0.0 exechostB sol-sparc64 2 - 8.0G - 516.0M -
This problem has several possible causes, each of which has a different solution.
- Possible cause: The execd daemon is not running on the host.
Solution: As the root user, start up the execd daemon on the execution host by running the $SGE_ROOT/default/common/'rcsge' script.
- Possible cause: A default domain is incorrectly specified.
Solution: As the Grid Engine system administrator, run the qconf -mconf command and change the default_domain variable to none.
- Possible cause: The qmaster host sees the name of the execution host as different from the name that the execution host sees for itself.
Solution: Determine whether you are using DNS to resolve the host names of your compute cluster.- If you are using DNS, configure /etc/hosts and NIS to return the fully qualified domain name (FQDN) as the primary host name. Of course, you can still define and use the short alias name, for example, 168.0.0.1 myhost.dom.com myhost.
- If you are not using DNS, make sure that all of your /etc/hosts files and your NIS table are consistent, for example, 168.0.0.1 myhost.corp myhost or 168.0.0.1 myhost.
Problem: Pending Jobs Not Being Dispatched
Description: Sometimes a pending job is obviously capable of being run, but the job does not get dispatched.
Solution: To diagnose the reason, the Grid Engine system offers the following utilities and options:
- qalter -w v <job_id> – This command lists the reasons why a job is not dispatchable, in principle. For this purpose, a dry scheduling run is performed. All consumable resources, as well as all slots, are considered to be fully available for this job. Similarly, all load values are ignored because these values vary.
- qalter -w p <job_id> – This command lists the reasons why a job is not dispatchable at the moment. For this purpose, a dry scheduling run is performed with all consumable resources currently occupied by running jobs are taken into account. The following example shows output for a job with the ID 242059:
% qalter -w p 242059 queue "fangorn.q" dropped because it is temporarily not available queue "lolek.q" dropped because it is temporarily not available queue "balrog.q" dropped because it is temporarily not available queue "saruman.q" dropped because it is full cannot run in queue "bilbur.q" because it is not contained in its hard queuelist (-q) cannot run in queue "dwain.q" because it is not contained in its hard queue list (-q) has no permission for host "ori"
Problem: Job or Queue Reported in Error State E
Description: Job or queue errors are indicated by an uppercase E in the qstat output. A job enters the error state when the Grid Engine system tries to run a job but fails for a reason that is specific to the job. A queue enters the error state when the Grid Engine system tries to run a job but fails for a reason that is specific to the queue.
Solution: The Grid Engine system offers a set of possibilities for users and administrators to gather diagnosis information in case of job execution errors. Both the queue and the job error states result from a failed job execution. The following diagnosis possibilities are applicable to both types of error states:
- User abort mail – If jobs are submitted with the qsub -m a command, abort mail is sent to the address specified with the -M user[@host] option. The abort mail contains diagnosis information about job errors. Abort mail is the recommended source of information for users.
- qacct accounting – If no abort mail is available, the user can run the qacct -j command. This command gets information about the job error from the Grid Engine system's job accounting function.
- Administrator abort mail – An administrator can order administrator mails about job execution problems by specifying an appropriate email address. See under administrator_mail on the sge_conf(5) man page. Administrator mail contains more detailed diagnosis information than user abort mail. Administrator mail is the recommended method in case of frequent job execution errors.
- Messages files – If no administrator mail is available, you should investigate the qmaster messages file first. You can find entries that are related to a certain job by searching for the appropriate job ID. In the default installation, the qmaster messages file is $SGE_ROOT/$SGE_CELL/spool/qmaster/messages.
You can sometimes find additional information in the messages of the execd daemon from which the job was started. Use qacct -j job-id to discover the host from which the job was started, and search in $SGE_ROOT/$SGE_CELL/spool/host/messages for the job ID.
Problem: Your job script fails when you use the qsub command even though you can run it from the command line
Cause: Process limits might be currently being set for your job. To test whether limits are being set, write a test script that performs the limit and limit -h functions. Use the qsub command to run both functions interactively and then compare the results.
Solution: Remove any commands in your configuration files that set limits in your shell.
Problem: qrsh does not seem to work at all
Description: Messages like the following are generated:
host2$ qrsh -verbose hostname local configuration host2 not defined - using global configuration waiting for interactive job to be scheduled ... Your interactive job 88 has been successfully scheduled. Establishing /share/gridware/utilbin/solaris64/rsh session to host exehost ... rcmd: socket: Permission denied /share/gridware/utilbin/solaris64/rsh exited with exit code 1 reading exit code from shepherd ... error: error waiting on socket for client to connect: Interrupted system call error: error reading return code of remote command cleaning up after abnormal exit of /share/gridware/utilbin/solaris64/rsh host2$
Cause: Permissions for qrsh are not set properly.
Solution: Check the permissions of the following files, which are located in $SGE_ROOT/utilbin/. Note that rlogin and rsh must be setuid and owned by root.
-r-s--x--x 1 root root 28856 Sep 18 06:00 rlogin* -r-s--x--x 1 root root 19808 Sep 18 06:00 rsh* -rwxr-xr-x 1 sgeadmin adm 128160 Sep 18 06:00 rshd*
Note
|
What's Missing From this Page?
We encourage you to let us know what troubleshooting or error message solutions are missing from this page. Please leave a comment below or email the wiki administrator with suggestions.
|
Participate
|
Learn More
|

