Troubleshooting

Grid Engine Home >

Troubleshooting

Problem: Execution host reports no load

Description: An execution host is reporting no load. You can use qstat -f or qhost to find out which execution host is reporting no load.

  • If you use qstat -f, you will get sample output similar to what is shown below. In this example, exechostB is reporting no load, which is demonstrated by u in the state column and -NA- in load_avg column.
    queuename                      qtype resv/used/tot. load_avg arch          states
    ---------------------------------------------------------------------------------
    all.q@exechostA                BIPC  0/0/1          0.32     sol-sparc64
    ---------------------------------------------------------------------------------
    all.q@exechostB                BIPC  0/0/2          -NA-     sol-sparc64   au
    
  • If you use qhost, you will get sample output similar to what is shown below. In this example, exechostB is reporting no load, which is demonstrated by - in the LOAD column.
    HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
    -------------------------------------------------------------------------------
    global                  -               -     -       -       -       -       -
    exechostA               sol-sparc64     1  0.32    4.0G    3.4G    4.5G     0.0
    exechostB               sol-sparc64     2     -    8.0G       -  516.0M       - 
    

This problem has several possible causes, each of which has a different solution.

  1. Possible cause: The execd daemon is not running on the host.
    Solution: As the root user, start up the execd daemon on the execution host by running the $SGE_ROOT/default/common/'rcsge' script.

  2. Possible cause: A default domain is incorrectly specified.
    Solution: As the Grid Engine system administrator, run the qconf -mconf command and change the default_domain variable to none.

  3. Possible cause: The qmaster host sees the name of the execution host as different from the name that the execution host sees for itself.
    Solution: Determine whether you are using DNS to resolve the host names of your compute cluster.
    • If you are using DNS, configure /etc/hosts and NIS to return the fully qualified domain name (FQDN) as the primary host name. Of course, you can still define and use the short alias name, for example, 168.0.0.1 myhost.dom.com myhost.
    • If you are not using DNS, make sure that all of your /etc/hosts files and your NIS table are consistent, for example, 168.0.0.1 myhost.corp myhost or 168.0.0.1 myhost.

Problem: Pending Jobs Not Being Dispatched

Description: Sometimes a pending job is obviously capable of being run, but the job does not get dispatched.
Solution: To diagnose the reason, the Grid Engine system offers the following utilities and options:

  • qalter -w v <job_id> – This command lists the reasons why a job is not dispatchable, in principle. For this purpose, a dry scheduling run is performed. All consumable resources, as well as all slots, are considered to be fully available for this job. Similarly, all load values are ignored because these values vary.
  • qalter -w p <job_id> – This command lists the reasons why a job is not dispatchable at the moment. For this purpose, a dry scheduling run is performed with all consumable resources currently occupied by running jobs are taken into account. The following example shows output for a job with the ID 242059:
    % qalter -w p 242059
    queue "fangorn.q" dropped because it is temporarily not available
    queue "lolek.q" dropped because it is temporarily not available
    queue "balrog.q" dropped because it is temporarily not available
    queue "saruman.q" dropped because it is full
    cannot run in queue "bilbur.q" because it is not contained in its hard queuelist (-q)
    cannot run in queue "dwain.q" because it is not contained in its hard queue list (-q)
    has no permission for host "ori" 
    

Problem: Job or Queue Reported in Error State E

Description: Job or queue errors are indicated by an uppercase E in the qstat output. A job enters the error state when the Grid Engine system tries to run a job but fails for a reason that is specific to the job. A queue enters the error state when the Grid Engine system tries to run a job but fails for a reason that is specific to the queue.
Solution: The Grid Engine system offers a set of possibilities for users and administrators to gather diagnosis information in case of job execution errors. Both the queue and the job error states result from a failed job execution. The following diagnosis possibilities are applicable to both types of error states:

  • User abort mail – If jobs are submitted with the qsub -m a command, abort mail is sent to the address specified with the -M user[@host] option. The abort mail contains diagnosis information about job errors. Abort mail is the recommended source of information for users.
  • qacct accounting – If no abort mail is available, the user can run the qacct -j command. This command gets information about the job error from the Grid Engine system's job accounting function.
  • Administrator abort mail – An administrator can order administrator mails about job execution problems by specifying an appropriate email address. See under administrator_mail on the sge_conf(5) man page. Administrator mail contains more detailed diagnosis information than user abort mail. Administrator mail is the recommended method in case of frequent job execution errors.
  • Messages files – If no administrator mail is available, you should investigate the qmaster messages file first. You can find entries that are related to a certain job by searching for the appropriate job ID. In the default installation, the qmaster messages file is $SGE_ROOT/$SGE_CELL/spool/qmaster/messages.
    You can sometimes find additional information in the messages of the execd daemon from which the job was started. Use qacct -j job-id to discover the host from which the job was started, and search in $SGE_ROOT/$SGE_CELL/spool/host/messages for the job ID.

Problem: Your job script fails when you use the qsub command even though you can run it from the command line

Cause: Process limits might be currently being set for your job. To test whether limits are being set, write a test script that performs the limit and limit -h functions. Use the qsub command to run both functions interactively and then compare the results.
Solution: Remove any commands in your configuration files that set limits in your shell.

Problem: qrsh does not seem to work at all

Description: Messages like the following are generated:

host2$ qrsh -verbose hostname
local configuration host2 not defined - using global configuration
waiting for interactive job to be scheduled ...
Your interactive job 88 has been successfully scheduled.
Establishing /share/gridware/utilbin/solaris64/rsh session
to host exehost ...
rcmd: socket: Permission denied
/share/gridware/utilbin/solaris64/rsh exited with exit code 1
reading exit code from shepherd ...
error: error waiting on socket for client to connect: 
Interrupted system call
error: error reading return code of remote command
cleaning up after abnormal exit of 
/share/gridware/utilbin/solaris64/rsh
host2$

Cause: Permissions for qrsh are not set properly.
Solution: Check the permissions of the following files, which are located in $SGE_ROOT/utilbin/. Note that rlogin and rsh must be setuid and owned by root.

-r-s--x--x 1 root root 28856 Sep 18 06:00 rlogin*
-r-s--x--x 1 root root 19808 Sep 18 06:00 rsh*
-rwxr-xr-x 1 sgeadmin adm 128160 Sep 18 06:00 rshd*
Note
  • The $SGE_ROOT directory also needs to be NFS-mounted with the setuid option. If $SGE_ROOT is mounted with nosuid from your submit client, qrsh and associated commands will not work.
  • This problem only occurs if the Grid Engine software is configured to use the binaries mentioned above. The new built-in interactive job support eliminates this problem.

What's Missing From this Page?

We encourage you to let us know what troubleshooting or error message solutions are missing from this page. Please leave a comment below or email the wiki administrator with suggestions.

Participate
Have a best practice to share? Questions? Suggestions? Comments?

Learn More
For more on this topic, check out the following resources:

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.

Sign up or Log in to add a comment or watch this page.


The individuals who post here are part of the extended Sun Microsystems community and they might not be employed or in any way formally affiliated with Sun Microsystems. The opinions expressed here are their own, are not necessarily reviewed in advance by anyone but the individual authors, and neither Sun nor any other party necessarily agrees with them.

Copyright 1994-2009 Sun Microsystems, Inc.
Powered by Atlassian Confluence
Sun Guidelines on Public Discourse Privacy Policy Terms of Use Trademarks Site Map Employment Investor Relations Contact