LoadLeveler on JUBL


IBM Tivoli Workload Scheduler LoadLeveler (Version 3.4.0) is used as batch system on BlueGene/L.

Using LoadLeveler

Job submission to LoadLeveler is done using a job command file. The job command file is a shell script containing keywords embedded in comments beginning with # @. These keywords inform LoadLeveler of the resources required for the job to run, the program to execute, where to write output files and the job environment.
Two sample job scripts can be found  here and on the system in the /bgl/local/samples/LoadL directory.

Most of the keywords are the same as used for LoadLeveler scripts on JUMP. But there are some BlueGene specific things:

  1. You have to indicating your job as an BlueGene job with # @ job_type = bluegene. Otherwise the job is executed as a serial job on the login node without allocating a BlueGene partition.

  2. The size of a job has to be specified by using # @ bg_size OR # @ bg_shape.
    • The bg_size keyword specifies the number of compute nodes the job should use. BlueGene/L only allows partitions including 32, 128 and multiples of 512 compute nodes. Thus bg_size of 1 specifies a partition of size 32 and bg_size of 129 specifies a partition of size 512.
    • The bg_shape keyword specifies the shape of the partition at the base partition (midplane) level, not at the compute node level. A bg_shape value 1x2x1 means 1 base partitions in the x direction, 2 in the y direction and 1 in the z direction, which are two midplanes = 1024 compute nodes. bg_shape defines the logical dimensions of your partition. For an efficient scheduling LoadLeveler may allocate physically one of three permutations (1x2x1, 2x1x1, 1x1x2) and ensures the correct mapping of the MPI-tasks.

      If - and only if - you are using your own mapfile (-mapfile option in the mpirun command) or your application relies on a correct physical size of the partition you have to use the bg_rotate = FALSE keyword together with bg_shape. This indicates LoadLeveler that only the requested shape satisfies the job requirement.

  3. The topology of the partition can be specified with the bg_connection keyword, which can be one of the three values: MESH (default), TORUS and PREFER_TORUS.

    This choice can have a big influence on the performance of your application. In case of doubt always add

    # @ bg_connection = TORUS

    to your job script.

A detailed description of the BlueGene specific keywords and a table of the core general keywords is give here:  Job File Keywords

On a BlueGene/L system, the program to execute is always the mpirun command. In other words, preparing a job for submission requires you to create a job command file that passes the appropriate arguments to the mpirun command. There are two ways to specify the application, LoadLeveler should execute:

  1. Adding the mpirun call after the # @ queue statement in the job file. (s. Sample 1)
  2. Using the executable and arguments keywords:
    # @ executable has to point to mpirun and # @ arguments keeps all the arguments for the mpirun call. This case is shown in Sample 2

Since LoadLeveler automatically selects the appropriate partition to run the job on, the –partition option should not be specified in the mpirun command.

The number of MPI-tasks can still be controlled with the -np option, the execution mode (coprocessor mode / virtual node mode) is specified with -mode CO or -mode VN inside the argument list. A detailed description of the ralation between bg_size/bg_shape, the number of allocated compute nodes and the number of MPI-tasks can be found in the  FAQ's.

Submitting a LoadLeveler Job

Jobs are submitted with
    llsubmit <jobfile name>

Some useful LoadLeveler commands are listed here:
Command Short Description Man page
llsubmit Submits a job to LoadLeveler. >>
llq Shows queued and running jobs >>
llq –b Shows BlueGene jobs. >>
llcancel <job_id> Delete a queued or running job. >>
llstatus Displays system information. >>
llclass Shows information about defined classes. >>

Interactive Parallel Applications

To start an interactive parallel application, use the FZJ specific procedure:

        llrun [llrun_options] <mpirun_options>

For more information on llrun see llrun.

Documentation

See also: LoadLeveler Documentation

last change 13.08.2007 | Michael Stephan | Print