DISTRIBUTED QUEUEING

SYSTEM - 3.2.x

October 14, 1998

  • DQS Design Document (454K Postscript Version)
  • DQS Library Interface (295K Postscript)
  • DQS User Interface (342K Postscript)

    USER GUIDE

     

     

     

    SUPERCOMPUTER COMPUTATIONS RESEARCH INSTITUTE

     

     

     

    Table of Contents

    Introduction

    A DQS job

    submitting a job

    querying job status

    modifying a job request

    holding, deleting jobs

    requesting resources

    hard and soft resources

    consumable resources

    forming resource requests

    "potential" resources

    moving jobs

    cells and queues

    Qmove

    suspending queues and jobs

    parallel jobs

    job execution environment

    DQS scheduling strategies

    Problem Solving

     

     

     

     

    Introduction

    DQS is actually a simple system which provides a multitude of options to accommodate the requirements of a wide variety of sites, and users. As the number of options increase, as they do with each succeeding generation of DQS, a user might mistakenly come to view the system as quite complex. This user guide is intended to provide an introduction to DQS for the new user as well as explaining those features most often used by the experienced user. In particular the concept of "resources" is explored with attention focused on the new DQS 3.1.3 feature, "consumable resources".

    A DQS job

    Any job which a user needs to execute on one or more computers can be a "DQS job". For those whose sole contact with computers has been through the means of personal UNIX workstations the concept of running their jobs in a "batch" mode may be somewhat disconcerting. Users accustomed to submitting their jobs to mainframe computers will be more familiar with the attributes of DQS. But, unlike the mainframe system the DQS batch environment customarily includes multiples of autonomous UNIX based computational platforms heterogeneous in hardware architecture and operating system variant.

    In its most fundamental form, a DQS job is an extension of a UNIX script used to run an application, as one might even on their own personal workstation. Let us use the "traditional" example of a FORTRAN compilation and execution of a simple application:

    FTN test.f -o test

    test

    where test simply produces the classical "Hello World" output which is sent to standard error.

    If we then wish to run this same application within a UNIX script we would create a file called "test.run" with the following lines:

    #!/bin/sh

    FTN test.f -o test >& test.errors

    test > test.out

    Note that we redirect the stdout and stderr files to "test.out" and "test.errors" respectively.

    This script would then be executed by the user on a machine of their choice, most likely their own workstation.

    What then is needed to turn this script into a DQS job ? Nothing… as long as one doesn’t care what machine it will be executed on. All that is needed is to "submit" this job to the DQS batch queuing system.

     

    submitting a job

    The simple example becomes a "DQS job" by submitting it to the DQS system with the "qsub" utility :

    qsub test.run

    The qsub utility will contact the qmaster and request that the job be "validated" for execution within the system. This "validation" process of determining whether or not the not the job requires something which does not exist in the current system. Since our test script makes no obvious requests for resources (the FTN command is not recognized as a request for a compiler resource known by DQS) all that is needed is for any host in this hypothetical cell to be idle, and available to execute the job.

    Let us now take advantage of some basic DQS facilities. First we would like to have an email message sent to us upon job termination. We must instruct DQS to perform this task by inserting a "DQS directive" into the test.run script. By default DQS interprets any line of script as "DQS directive" if the first two characters of the line are the string "#$". This can be changed by the user (see qsub -C option in the Reference Manual).

    Thus we add one line to our script:

    #!/bin/sh

    #$ -me

    FTN test.f -o test >& test.errors

    test > test.out

    The DQS directive #$ -me tells the system that a mail message should be sent to the person submitting the job at the end of the job. We could also have directed that we wish to have a mail message sent at the beginning of the job and also if the job aborts with the directive #$ -meab. The order of the symbol ‘e’ ‘a’ and ‘b’ in this list is not significant.

    Note that the directive could also be communicated with DQS on the qsub command line. Instead of inserting the directive in the script, we could perform the submission with:

    qsub -me test.run

    In cases where only a few directives are needed this approach might be used, but as the user will see many job submissions will benefit from more complex sets of DQS directives which are better "captured" in the job script.

    querying job status

    Once a user has relinquished their job to the "welcoming arms" of a queuing system they need a means for monitoring and controlling its destiny. A first step is to query the system to establish the status and "DQS identity" of the job. The "qstat" utility is used to display the state of queues and jobs. There are three forms of this display:

    default (no options) Displays the state of user jobs in summary form

    full listing (-f option) Displays summary queue and job status

    extended listing (-ext option ) Displays the full queue and job descriptions

    The simplest command then to get in touch with our job is to execute the command:

    qstat

    and scan through the output looking for jobs we have submitted. Instead of being deluged with information about every other job in the system one can execute:

    qstat -u <my user name>

    where <my user name> is the login name of the user who submitted the job.

    The output of this variant might look like:

    ---Pending Jobs------------------------------------------------------------------------------------------

    <my user name> my-job-name dqs-job-number 0:0 QUEUED 03/25 20:40

    Which would indicate with some dismay accruing to <my user name> in that the job is not RUNNING on any machine in the system. But it is queued with a priority of zero ( the leftmost digit from "0:0". And our sub-priority is zero (rightmost digit ) indicating that there are no prior jobs for this user.

    or more optimistically the display might offer:

    Queue Name Queue Type Quan Load State

    queue1 batch 1/1 0.14 er UP

    <my user name> 2183 0:1 r RUNNING 02/12/96 19:25:56

    Which would hearten us in our endeavors, because our job is (apparently) executing. The symbols on the output lines may be a bit confusing because the first line shows the status of the queue while the second describes "our job" .

    Let us examine the queue description first:

    Queue Name queue1 each queue is given a unique name by the administrator

    Queue Type batch the default mode of all DQS queues

    Quan 1/1 one resource ("1/ ") of one available (" /1") are utilized

    Load 0.14 the load average measured by the queue1 CPU is 0.14

    er all of the queue states are displayed in single character symbols.. The most important of these is presented under the heading "State" . The "e" shows that the queue is ENABLED. The ""r" shows that the queue is running.

    State UP The normal more of operation will be shown as "UP"

    .

    The job description is a bit less cryptic. The entry begins with <my user name> and followed by the DQS assigned job number (2183). The values 0:1 give the submission priority of the job, defaulted to zero and the sub-priority :1 which indicates that this is the first job running for this user. The submission priority is assigned by the user with the QSUB option flag "-p" while the sub-priority is an internal parameter computed during each scheduling pass for all the queues.

    The command "qstat -ext" produces a comprehensive display of queue and job parameters as well as the status obtained with the "-f" option. Discussion of relevant portions of these extended displays will appear in later sections.

    modifying a job request

    Often a user will find one or more of their jobs in the pending queue awaiting assignment to an execution queue. After review of their pending jobs, this user may decide to change the jobs submission parameters to affect the jobs future scheduling. One method for this would be to delete the job and resubmit it. A more convenient technique is to use the QALTER utility to modify one or more of the parameters which the user assigned at the time of QSUB, or defaulted by DQS when not explicitly designated by the user.

    In the simple example given here, the user provided no parameters to the QSUB command and hence the submission priority has been set to the default value of zero. If the user wishes to increase that priority the QALTER utility would be invoked with :

    qalter -jid <job number> -p <new priority>

    The <job number> is that which DQS assigned to the job in the pending queue, and the <new priority> value must be in the range -1024 to +1023.

    Except for the job number, any parameter which can be employed with the QSUB command can be used with the QALTER command, including replacing the script file which originally accompanied the QSUB command. The QALTER command may not be used for jobs already in the RUNNING state, with exception of the return of "consumable resources" (see below).

    holding, deleting jobs

    The user has a number of tools available to work with their jobs once the those jobs are in the queuing system. For example they may decide to place a "hold" on one of their jobs in the pending queue so that another job may progress ahead of it or to delay scheduling until some other event or job has occurred. First the user may chose to submit a job to the system with a "hold" placed on the job at the time of the submission. This step involves the use of the "-h" option in the QSUB command. Once a job is submitted the user can use the "QHOLD" utility to place a hold on a job if it is still in the PENDING queue. The QHOLD uses the same "-h" option.

    The "-h" option is used for system administration tasks as well as user access. Thus the DQS 3.1.3 Reference Manual describes four alternatives. The user is permitted only the "u" (or user hold) or the "n" (no hold) variants. Thus at job submission the user might place a hold:

    qsub . . -h u .. test.run

    Or if the job is in the pending queue:

    qhold -jid <job number> -h u

    Once a "hold" has been placed on a job in the pending queue it will not be considered eligible for scheduling until it has either been "released" from the hold or it is deleted from the queue entirely. A job can be released from a user invoked "hold" with the QRLS utility:

    qrls -jid <job number> -h u

    or the user may modify the "hold" state by using the QALTER command:

    qalter -jid <job number> .. .. -h n

    Which will set the user accessible hold to "none".

    A use may delete one or more of their own jobs from the queuing system if the jobs are in either the pending queue or the executing queue:

    QDEL <job number>

    or:

    QDEL <job number>,<job number>,…..

    Note that the job numbers are separated by commas and NOT spaces.

    requesting resources

    The simple example we have been using so far (test.run) has made no unusual demands for system resources. It presumes that all queues in the system have a FORTRAN compiler and that the FORTRAN dialect in our test program is consistent with all the compilers. Further, memory, disk-space and data-base locality are also not consequential in this example. These are unrealistic assumptions in most cases. Most sites using DQS contain heterogeneous collections of hardware and software and often subdivide these collections into types of use (long-term jobs , short-term jobs, etc. ) .

    The DQS administrator is supplied with tools for organizing the system and defining resources to be accessible by the user. Typical resources are CPU memory sizes. hardware architecture and operating system versions.

    hard and soft resources

    Most jobs will have one or more imperative requirements. One of the most common is the need for a particular hardware/software system (i.e. AIX-3.2.5). By default requested resources are considered essential (or "hard") unless the user precedes the request in the QSUB command with the option "-soft’.

    Requirements for multiples of various resources in parallel jobs, such as 2 or more CPUs can be either "hard" or "soft". Many users choose to request at least 2 CPUs to run their parallel job and then request more processors following the option "-soft" flag in the QSUB command line or job script. While a non-parallel user might expect to use the "-soft" option for a request of the form "I need at least 32 MB of memory but would be much happier with 64 MB), most site resource allocations will not make effective use of such a request. The most common use of the "-soft" option for non-parallel jobs is to state a preference for a queue without making it a "hard" demand.

    consumable resources

    Site resources are by and large static over periods of time like days or weeks. CPU memory sizes and CPU computing power are not subject to moment-by-moment changes. When they are modified the DQS site manager can adjust the resource descriptions to match the new configurations.

    There is a class of resource which does vary within short periods of time. A very common commercial practice, these days, is to manage software licenses for Compilers, Data Base Managers, etc. dynamically at a given site. Many sites do not purchase licenses for all of their extant platforms. A job submitted to DQS must not be scheduled for execution if that job needs one or more software licenses in order to complete but those licenses are already in use by another job.

    Another common form of a time-varying resource would be the amount of shared memory available to a processor in a shared-memory multi-processor system. Shared local disk space might be another resource which is depleted and restored as jobs startup and terminate. Resources of this type are called, by DQS, "consumable resources".

    forming resource requests

    A user specifies the resources they require in the QSUB command line or in the DQS script file. A most direct method is to identify a specific queue as the place for the submitted job to execute:

    QSUB … -q <my queue>

    That request will require <my queue> for execution. If the user would prefer, but not insist on that queue they might make the command line request:

    QSUB … -soft -q <my queue>

    Note that DQS scans the command line and script commands from left to right. During that process any resource requests to the right of a "-hard" or "-soft" option flag will be interpreted as requiring that type of resource. Hence one could mix hard and soft resources thus :

    QSUB … -soft -q <my queue> .. .. -hard <some other resource>..

    The typical job request will not demand a specific queue. Instead the user will request one or more classes of resources which have been established by the DQS administrator. Let us presume a site with three different hardware platform architectures for which there are several CPUs available each. The site administrator has named the resources with their operating system tags, AIX325, IRIX53, SOLARIS24. In addition this example site will own one FORTRAN license each for the different operating systems. The administrator will name these , XLF, SGIFTN and FORTRAN.

    To further complicate our example, each brand of CPU has a different amount of memory on each of its three separate CPUs, 32 Megabytes, 64 Megabytes and 128 Megabytes.

    The example we have been using (test.run) will now be submitted in a more realistic manner:

    qsub -me -l AIX325.and.(mem.gt.32).and (XLF.eq.1) test.run

    The command line now has the resource request appended to it. Requests for resources other than specific queue names begin with the "-l" flag and consists of a string of resource names, interspersed with logical and relational operators. Since the string must have NO imbedded blanks, parenthesis make be used to aid readability.

    The resource request is interpreted by DQS as follows:

      1. The resource request is a "hard" request as a default.
      2. Any queues which contain the AIX325 resource AND a "mem" resource greater than 32 and at least one XLF license will be eligible for assignment to the job.
      3. When the job is started the XLF available license count will be reduced by 1. When the job terminates the XLF license count will be increased by one.

    A command line or DQS script may contain one or more request strings beginning with the "-l" option flag. Each one of these strings will request at least one queue to meet the requirement. Thus:

    qsub -l AIX325 -l AIX325

    Would request that two queues/CPUs be allocate to this job. This same request can be restated more simply:

    qsub -l (qty.eq.2).and.AIX325

    Depending upon the topology of the DQS site and the requirements of a given job, resource requests can contain a number of elements. Obviously parallel jobs will require more complex resource requests than simple single-processor jobs.

    Note: Relational operators can be given in FORTRAN or "C" syntax (.eq. == , .ne. != , .lt. <, .gt.. > , .le. <=, .ge. >= ). Logical operators can also be given in either language syntax ( .and. &&, .or. ||, .not. !). For compatibility with DQS 3.2.4 the comma (.) may be used in place of the logical ".and." operator.

    The consumable resource "XLF" requested by the job can be returned to the license pool by a RUNNING job by executing the DQS command QALTER with the "-rc" option:

    QALTER -rc XLF=1

    This command would return one XLF license to the system

    "potential" resources

    DQS 3.1.3 performs a pre-validation of jobs before accepting them into the queuing system. This pre-validation consists of searching all queue definitions to see if the "hard" resources requested for the job actually exist, even if they may be in use by some other job at the time this job was submitted. If all of the "hard" resources so not exist, the job is rejected, and an error message with the reason for the rejection returned to the QSUB utility and displayed for the user.

    In some cases a user may be aware that a resource (such as a new) queue will be added or returned to the DQS at some point in the future. They may wish to submit their job and place it into the pending queue to await the appearance of the new resource. This can be done by adding the "FORCE REQUEST" flag (‘-F’)to the QSUB command line or DQS script:

    qsub -F -l (wild_eyed_scheme).and.mem.gt.1000000

    The "-F" flag should be used with care as no pre-validation is performed and a job may have an erroneous resource request which will leave it "orphaned" in the pending queue until someone deletes it at a later time.

    moving jobs

    Once a job has been placed into the RUNNING state and is executing in one or more queues its parameters cannot be modified nor can it be moved to another location in the system. Pending jobs can be moved from one target queue to another by one of the following methods:

      1. Implicitly… using the QALTER command to modify a resource request such that the eligible queues will be changed
      2. Explicitly… when the job was submitted with the "-q" option to identify a specific target queue, that target can be changed by using the QALTER command with a different "-q" target destination-id
      3. Intercell …If a site has more than one DQS cell operating and they are mutually "authenticated" a user may move their jobs from one cell to another if that user has permission to both cells.

    cells and queues

    What is a cell? It is the collection of computer hosts and DQS software which make up a single entity managed by a daemon called the "qmaster". In the following diagram are displayed four CPUs. One of these is executing the qmaster daemon. Two processors are executing the dqs_execd daemons. These two processors are related to the queues shown here and would execute any job assigned to those queues. The computer labeled "dqs host" is not running any of the DQS daemons. It is known to the qmaster because the site administrator has added that name to the qmaster’s host list. This action makes this host a "trusted DQS host" as are any hosts running the daemon.

     

    A DQS site may have more than one "cell". The site administrator may choose to keep each cell independent and separate from the others. On the other hand they may organize the system so that one or more cells will have authorized communications with others.

    A user logged into a host in one cell can submit jobs to the other cells, or they can perform the QSTAT function for the other cells.

    Qmove

    The user can move one of their jobs in a pending queue in one cell to the pending queue in another cell. The qmove utility is provided for this inter-cell transfer purpose only. The usual command would be:

    qmove <job number>@CELL_C2

    Which would move the numbered job from CELL_C2 to the cell in which the qmove utility is being executed. Where a user in CELL_C3 wishes to move a job from CELL_C2 to CELL_C1 the command would be:

    qmove -cell CELL_C1 <job number>@CELL_C2

    The effects of this move process can be somewhat surprising:

      1. The job number is changed,.
      2. The moved job becomes a newly submitted job in the destination queue
      3. The resource requests are interpreted in terms of the new cells characteristics.
      4. A moved job is not pre-validated and thus can become an orphan in the new destination queue.

    suspending queues and jobs

    The user will note that from time to time one or more queue may display the SUSPENDED status. When this occurs any job executing on that queue is suspended also, but NOT terminated. As the queue is un-suspended the job is continued from the point where it was submitted, During the period of its suspension all of its files remain open and all memory and paging space allocated to the job remain in that state.

    When does a queue get suspended? The DQS administrator and anyone designated as the queue’s owner can suspend that queue using the QMOD command. There is one additional method which may appear in some site configurations. If a queue is assigned to a host which is also serving as the personal workstation for some user of the system, they may chose to use the QIDLE command at that workstation. This utility is a X-windows facility which monitors the keyboard and mouse on a workstation. If these devices are being used the QIDLE facility will suspend the queues on that workstation.

    One additional means by which a queue may be suspended is when it is designated as a subordinate queue to another queue, by the DQS administrator. The usual application of this facility is when a host serves both a s a parallel and single processor resource. The single processor queue is made subordinate to the parallel queue. When a parallel job is started the subordinate queue and any job being executed there will be suspended.

    parallel jobs

    A major feature of DQS is its support for the scheduling and management of parallel jobs to be run on two or more of the hosts in a system. There are three components in submitting parallel jobs:

      1. Resource requests must identify the requisite system capabilities and machines for as many hosts as absolutely required ("hard") plus those additional resources that would be desirable ("soft"). This is very important to sound system management because it is possible for a user to request and be assigned a single host, then launch additional processes on other hosts by mistake, or design. Some of the parallel message passing systems do not prevent this abuse(error).
      2. The "-par" flag should be used to specify one of several paradigms known to DQS. P4, MPI, PVM and TCGMSG are popular message passing systems which can be supported with DQS system facilities. If none of these schemes match the user’s requirements, at a minimum the option "-par GENERIC_ALL" should be selected.
      3. Most parallel jobs require the execution of a startup script to initiate the message passing system’s daemons, or other mechanism. The MPI system starts with the function "mpirun" for example. A parallel job submission then should identify the script and any parameters as an element in either the "-master", "-q" or "-l" options of the command line or DQS script. For example:

    qsub -me -l (qty.eq.4).and.(-exec.eq.mpirun).and.AIX325

    This will request four AIX325 hosts to run a parallel job. After the job is put into execution, but before the user’s job script is executed, the function "mpirun" will be executed in the working directory of that user.

    job execution environment

    The simple "test.run" example we have been using so far will have operated with the following characteristics

      1. The job will be executed in the user’s home director. Using the "-cwd" option can change the job execution to the working directory from which the QSUB utility was executed.
      2. The standard output and standard error files are written to the directory in which the program is being executed. The "-o" and "-e" and "-j" options can be used to specify what directory should be used for the outputs. The "-j" option can "join" stderr and stdout and put the results where specified.
      3. The normal environment which the user encounters when logging in to run a batch or interactive job will be used. The "-v" option permits a user to specify new values for existing environment variables or to create new environment variables.

    For detailed instructions on changing the jobs’ environment please see QSUB in the DQS reference manual.

    DQS scheduling strategies

    Once a job has disappeared into the maw of DQS it is subjected to a variety of manipulations which are intended to utilize the entire system resources in the most optimum way while ensuring that each user is given "fair" access to those resources. The default operation of the scheduler is often adapted by each site to its own requirements. The basic process consists of:

      1. First the site administrator may chose one of two options for selecting queues for assignment. The default is to sort queues by the load average they are reporting to the qmaster. The second is to sort the queues according to the internal sequence numbers assigned by the site. This second technique is used where the preference is to allocate host machines on the basis of their computing power instead of their current computing load.
      2. Jobs in the queue are sorted by the priority with which they were submitted , and then their internal sub-priority and finally their job identifier. The sub-priority is calculated each time the qmaster scans the queued jobs for scheduling. It is relative to each user and reflects the total number of jobs in this queue which are ahead of each job. This number includes any of the use’s jobs which are in RUNNING state as well as though which are QUEUED.
      3. Starting with the first PENDING job in the list the qmaster attempts to find a queue which is available, not overloaded and which meets the resource requirements of that job. If a queue is found the job is "handed off" to that queue’s dqs_execd daemon, and the next job is examined.
      4. If the job’s requirements cannot be met, it is passed over and the next one in the list is considered.
      5. Each site has a system wide parameter, set by the DQS administrator, which limits the number of jobs for a single user which may be considered during any given scheduling pass over the queue. Those pending jobs which are in this category they are marked MAXUJOB by the qmaster and ignored during that pass.
      6. There is one additional scheduling filter provided in DQS 3.1.3. Besides the system-wide maximum user job parameter, each queue possesses a maximum job parameter. This is assigned in the queue configuration for each queue and can be changed by the DQS administrator while the system is in operation. In the default DQS 3.1.3. release the job pre-validation process includes a "user eligibility" criteria in addition the available resource tests.

    After a job has been validated as to requesting "real" resources, it is tested against the site’s queues to determine which ones it would be eligible for. Of the eligible queues , the values of the "maximum user jobs" for each queue is extracted and the smallest one selected. At the same time the number of jobs in RUNNING state for this user is computed. If the minimum queue-maximum-user-jobs is not greater than the number of that user’s jobs RUNNING.. the job is rejected at QSUB time and an error message returned to the user.

    This last scheduling pre-validation most certainly may confuse the reader but it is the core of the "fair play" method developed at SCRI and needs to be used for a while to demonstrate its behavior and value.

    Problem Solving

    Even when one starts with the simple test case with which we began this User Guide. It is possible to get into one or more dead-ends on one’s first, second, or whatever ,attempts at using the DQS. We will proceed through a number of typical problems which a user may encounter along the way:

      1. An attempt to execute any DQS utility fails with error messages such as "Unable to contact qmaster".--- The host on which the utility is initiated is not in the authorized list of DQS hosts. The site administrator can verify this with the system error file.
      2. An attempt to execute a DQS utility fails with messages containing "ALARM".. "shutdown". --- The qmaster has died, or is too busy to respond to the utility within a time limit established fro the system The DQS administrator can adjust this value.
      3. The submission of a job with QSUB fails with an error message indicating that one or more resources are not available, or the job exceeds the number permitted for this user at this time. – A first step is to resubmit the job with the "-F" (force job submission) flag set. Otherwise consult the DQS administrator for assistance.
      4. A submitted job, which has not been forced ("-F" flag) remains in the pending queue without being placed into execution while other later jobs go onto execution.--- First use the command "qstat -f" and examine the list of resources shown for this job. It is possible that only one or two queues in the system satisfy the resources requested and that the other jobs which appear to have bypassed this jobs are less demanding. If possible the resources requested can me modified with the QALTER command.
      5. A job appears in the QSTAT display as RUNNING on a queue, but never terminates and a process status ("ps") fails to show the job executing in that host. An attempt to perform a QDEL operation on this job fails.--- This is a serious internal error in DQS and the DQS administrator should be contacted immediately.
      6. A job appears in the QSTAT display as RUNNING and then disappears as if it has completed execution. However no job results appear and the stdout and stderr files are absent.--- This is a serious internal error in DQS and the site administrator should be contacted..

    The DQS error file (err_file) and accounting file (acc_file ) contain valuable information which can assist the knowledgeable user the means for analyzing and correcting their problems with the system. Refer to Appendix A - DQS 3.1.3 Error Messages for further information.