TOPAS - Automatic Performance Statistics Collection on the CRAY T3E

Bernd Mohr
Forschungszentrum Jülich
John von Neumann-Institut für Computing (NIC)
Zentralinstitut für Angewandte Mathematik (ZAM)
52425 Jülich, Germany
b.mohr@fz-juelich.de

Abstract. TOPAS is a tool to automatically and transparently monitor usage and performance of every parallel job executed on a CRAY T3E. We have modified the UNICOS/mk compiler wrapper scripts to automatically link the TOPAS measurement module to every user application whenever it is recompiled. No modification is necessary in the user's program or build procedures. At run-time, two PEs of the parallel application are picked to actually perform the measurement for the parallel job as a whole. The measurement consists of executing special code immediately before and after the execution of the program. So there is no measurement overhead during the execution of the application itself. The TOPAS module is very simple (about 250 lines of code). It is based on the Performance Counter Library (PCL), a common interface for portable performance counting on microprocessors, also developed at NIC/ZAM.

Through environment variables, users can request the printing of the recorded information at the end of the execution, choose to measure integer, load, or store operations instead of floating point, and specify the PEs which should be used for performing the measurement.

In addition to the TOPAS measurement module, we implemented a tool which allows a system administrator to calculate interesting statistics like the typical MFlop rates achieved by user programs, as well as programming language and message passing library usage from this data. Most of this information is not available through regular T3E system accounting.


1 Introduction and Motivation

Almost all currently available tools for analyzing performance of parallel applications are targeting users and programmers, i.e., they focus on providing extensive performance analysis and tuning for a single program. There are no tools for system administrators and managers of large high-performance parallel machines to get an overview of the performance of all user jobs as a whole. Regular job accounting also does not provide this information. But information like this is necessary to be able to make educated decisions for the management, operation, and procurements in a high-performance computing center like the Zentralinstitut für Angewandte Mathematik (ZAM/NIC) of Forschungszentrum Jülich. Currently, there are three CRAY T3Es installed: a 512 node T3E-600, a 256 node T3E-900, and a 256 node T3E-1200.

The only tool currently available is the excellent work of Rolf Rabenseifner of the High-Performance Computing-Center Stuttgart (HLRS) [4, 5, 6]. He implemented an automatic counter instrumentation and profiling module which gets added to the MPI library for CRAY T3E and SGI Origin2000 systems. However, after an extensive review of his work, we decided to implement our own system because

  1. Rabenseifner's approach only works for MPI [7] based programs while on our T3Es the PVM [8] and SHMEM [9] message passing libraries are also used.
  2. Data is only collected between the calls to MPI_Init and MPI_Finalize, and not about the whole program.
  3. His profiling module collects too much data (lots of detailed statistics about single MPI calls) by default.
  4. The measurement of the profiling information influences the execution time of the user program. The introduced overhead is minimal (0.1% to 0.2%) and probably can be neglected, but we want no influence at all.
Therefore, we designed and implemented TOPAS (T3E Observative Performance Analysis System). TOPAS is a tool to automatically and transparently monitor usage and performance of every parallel job executed on a CRAY T3E. It works for all parallel user programs regardless of the message passing library used. The measurement consists of executing special code immediately before and after the execution of the actual program. So there is no measurement overhead during the execution of the application itself. As we are interested only in a coarse statistical overview, and in order to keep the amount of measured data low, only two PEs of the parallel application are picked at run-time to actually perform the measurement for the parallel job as a whole. We currently use TOPAS to gather performance statistics for our three CRAY T3Es.

First, the paper describes the design and implementation of the TOPAS system. Then we give an overview of the results obtained through TOPAS in the first three months of its operation. Last, Section 4 describes an extension to the original TOPAS system which allows users to get an overview of the performance of their application.

2 Implementation

The implementation of TOPAS is very simple yet effective. Its design and implementation took two person days. In addition to the basic measurement and analysis module, some simple modification to the UNICOS/mk compiler scripts are necessary.

2.1 The TOPAS Measurement Module

The TOPAS measurement module is an object file which is automatically linked to each application if users link or compile their program. No modification of the user's source code or building procedures is necessary. The measurement module is implemented in C.

To initialize TOPAS and to start the measurement, we use a little-known (but documented and supported) feature of the UNICOS common start-up code implemented by the function $START$. First, it does all the necessary initializations (e.g., allocation of private and shared heap segments). Just before calling the main routine of the program, $START$ checks for the existence of a sitelocal_start routine [1]. If this routine is linked into the program, it will be called.

Then, in sitelocal_start, it is possible to register another function to be executed at the end of the program by using the ANSI C function atexit. atexit calls registered functions in the reverse order of their registration, so this function will be called the very last.

Therefore, the measurement module consists of two functions (see pseudo code Listing 1):

void sitelocal_start() {
    /* -- only measure parallel programs -- */
    if ( _num_pes() > 1 ) {
        initialize_and_check_environment();

        if ( i_am_measurement_pe() ) {
            /* -- install exit routines -- */
            atexit(sitelocal_end);
            atabort(sitelocal_end);

            /* -- start measurement -- */
            PCL_initialize_and_start_HW_counter();
            start_UNIX_timer();
        }
    }
}

void sitelocal_end() {
    /* -- end measurement -- */
    end_UNIX_timer();
    PCL_read_HW_counter();
    calculate_elapsed_time();
    get_program_characteristics();

    /* -- display and store results -- */
    print_results_to_logfile();
    if ( batch_or_user_request ) display_results();
}
Listing 1: TOPAS Measurement Module

  1. sitelocal_start(). This routine is automatically executed before the user's main program by the UNICOS start-up code. It first checks whether the program is running in parallel. It then initializes the module. This part can be controlled by the user through special environment variables. If the user does not specify the PEs for which measurements should be performed (through the enviroment variable TOPAS_PE), two PEs are randomly picked ignoring the first and last PE, if possible. This is done because these are often used as masters in master/slave type applications and therefore may have untypical (performance) behavior.

    These two PEs now actually perform the measurement. First, they register the TOPAS wrap-up code sitelocal_end with atexit. In addition, we use the UNICOS extension atabort which works like atexit but calls the registered functions if the program is aborted.

    Next, the CRAY T3E hardware counters are initialized to measure the number of floating point and the number of integer operations. This is done by using the Performance Counter Library (PCL[3], a common interface for portable performance counting on microprocessors, also developed at NIC/ZAM. Although the DEC Alpha CPU used in the T3E has two performance counters, the number of floating point and the number of integer operations cannot be measured at the same time. Therefore, the TOPAS measurement module runs on two PEs, each used to count one of the values. The second available hardware counter is used to determine the level 1 data cache misses.

    Finally, timers are initialized and started by the standard UNIX function times which returns wall clock, system, and user time in system hardware clock ticks.

  2. sitelocal_end(). This routine is automatically executed on the PEs choosen for the measurement after the user's main program finishes. It first stops and reads the contents of the UNIX timers and T3E hardware counters. Then all desired information is determined and stored in a system-wide log file. If the program is running as a batch job or, for interactive programs, if the user requests it (by setting the environment variable TOPAS_PRINT to "yes" or "true"), the collected information is also printed to standard error in a human readable, nicely formatted form.

2.2 Data collected by TOPAS

An overview of the data collected by TOPAS and the method of calculation is shown in the following table:

No. Description Method of Collection Format
1. Date and Time localtime(time(0)) YYYY/MM/DD HH:MM:SS
2. user name First successful call out of
1. cuserid(0)
2. getlogin()
3. getenv("LOGNAME")
4. getenv("USER")
string
3. name of the executable __progname string
4. measuring PE _my_pes() int
5. total number of PEs _num_pes() int
6. MHz of the CPUs sysconf(_SC_CRAY_CPCYCLE) t300|t450|t600
7. execution mode:
batch or interactive
getenv("QSUB_HOME")
!= NULL
B|I
8. programming language see below cc|cxx|kxx|f77|f90
9. message passing library loaded(...); see below mpi|pvm|sma|-
10. user, system, wall clock time times() float in seconds
11. number of floating point
or integer operations
PCL f=int or
i=int
12. level 1 data cache misses PCL int
13. unique identification getpgrp() int
Table 1: Data collected by TOPAS

Items 1 to 5, 7, and 10 are also available through regular UNICOS/mk system accounting but are included in the measurement to allow a simple implementation of the analyis of the data and to be able to relate the measured data to information available through other sources. Item 13 (the UNIX process group) is used to combine measurements from different PEs which belong to the same program execution. Unfortunately, this doesn't result in a unique id for multiple mpprun commands in a single NQS batch job, but this can be corrected during the off-line analysis.

Most data items can be calculated by calling standard UNIX or UNICOS functions or by checking standard environment variables. There are two exceptions.

  1. The determination of the programming language used is done by using different TOPAS measurement modules which differ only in the value for the programming language they record in the log file. The right measurement module is chosen by the corresponding UNICOS compiler driver script automatically (see Section 2.3). This means that for mixed-language applications, the compiler (script) used for linking determines which language gets recorded.

  2. To identify the message passing library used, another little-known (but documented) system routine of UNICOS is used: loaded [1]. With loaded it is possible to check whether a specific function was linked to the executing program. In order to make this work, the function has to be declared as a soft reference using the compiler pragma "#pragma _CRI soft". This causes the linker to ignore this function when resolving external symbols. For example, Listing 2 shows how to check whether MPI_Init was linked to the program or not.
The only remaining difficulty is how to check for the PVM and SHMEM message passing libraries since they do not contain a initialization routine like MPI_Init which has to be called from the user's program in order to use the library. However, after some experimentation and searching, we found that on the CRAY T3E, PVM and SHMEM libraries contain two undocumented routines called _pvm_initialize and _shmeminit which also get called by the UNICOS start-up code $START$ if necessary.

#include <stdio.h>
#include <infoblk.h>

/* -- declare MPI_Init as a soft reference -- */
#pragma _CRI soft MPI_Init
extern int MPI_Init(int *argc, char ***argv);

int main() {
    if (loaded(MPI_Init)) printf("program uses MPI\n");
}
Listing 2: Example of using loaded()

The order of the checks is important as both MPI and PVM are implemented with SHMEM on the CRAY T3E. The special case that none of the three initialization routines is loaded is also recognized and accordingly recorded.

2.3 UNICOS/mk System Integration

The last task left is to ensure that the TOPAS measurement module gets linked into each parallel user application. After some considerations, we decided that the easiest way to do this was to modify the UNICOS compiler driver scripts located in /opt/ctl/bin. The compiler commands for C, C++, Fortran77, and Fortran90 are actually korn shell scripts (named cc, CC, fort77, and f90 respectively) which take care of initializing the environment and module system. Since they are shell scripts, they are easy to modify and maintainance is kept simple. Only a few modifications are necessary. Listing 3 shows the changes necessary for the Fortran90 compiler command f90. The modifications needed for the other compiler scripts are analogous.

# -- rest of f90 compiler script here ...

TOPAS="/usr/local/topas/topas.f90.o -lpcl"
for opt ; do
    if [[ $opt is special ]] ; then unset TOPAS; fi
done

exec $F90_DRIVER ${SEGLDOPTS} ${_F90_OPTS} "$@" ${CMDOPTS} \
     ${INCDIR} ${INCLUDE_PATH} ${LIB_PATH} ${LIBDIRS} ${TOPAS}
Listing 3: Modifications necessary for Fortran90 compiler (f90) driver script

All changes are localized at the end of the corresponding scripts. The key change is to add the variable ${TOPAS} to the end of the last line of the script which executes the "real" compiler. But first, the variable ${TOPAS} is set to the pathname of the corresponding measurement module and the PCL library. In order to avoid problems when the program uses the performance counters itself (either directly or through the CRAY T3E performance tool PAT), the arguments given to the script are checked next and, if necessary, TOPAS is deactivated by unsetting the TOPAS variable. Of course, the real code for this check is more complicated as in Listing 3.

2.4 Statistics Calculation and Display

The format for storing the collected information in a log file is also very simple. The data is written as ASCII text. All data of a single measurement is written in one line separated by white space. This allows the use of standard UNIX tools like grep, awk, or perl for the analysis of the data. The format of the single data items is shown in the right-most column of Table 1.

We implemented a short perl script topas-stat.pl which reads TOPAS log files line by line, splits each line into words, and calculates the total CPU time and MFlop rate for each job. At the end, it computes the distribution of the MFlop rates, of the programming language, and of the message passing library used by the applications. The bucket size of the histogram used to display the MFlop distribution can be changed by a command line parameter. The result can be printed in a nicely formatted way (see Listing 4) or in a format suitable to be used by graphics packages like gnuplot or xmgr or statistical packages like R. Also, the format of the data makes it easy to calculate statistics for a subset of the data, e.g.,

    grep ' f90 ' topas.log | grep ' B ' | grep  ' mpi ' | perl topas-stat.pl
computes the results for all Fortran90 batch programs which use MPI for communication. The percentages of the statistics are computed in two ways. The first percentage (Num% in Listing 4) is based on the number of measurements (program executions), while the second one (Time%) takes both the number of PEs used and the execution time into account.

3 TOPAS Results

In this section we present some results from TOPAS collected on the three CRAY T3Es of the John von Neumann Institute for Computing (NIC): a 512 node T3E-600, a 256 node T3E-900, and a 256 node T3E-1200. NIC is a mutual foundation of Forschungszentrum Jülich and Deutsches Elektronen-Synchrotron DESY to support supercomputer-aided scientific research and development in Germany. It provides supercomputer capacity for projects in university and industry nationwide in the fields of modelling and computer simulation. Typical applications cover chemistry, many-particle physis, high energy physics, astrophysics, and environmental science. Currently, we have about 500 registered users on the CRAY T3Es. The T3E-600 and T3E-1200 only run batch jobs, while the T3E-900 is available for interactive program development during the day on weekdays. The measurements were done between May and August 1999.

It should be clear that the data presented here only describes the mix of applications run on our CRAY T3Es in the last three months. For a detailed analysis more data is necessary. It should not be used to draw conclusions on the performance of CRAY T3E machines in general. Also, note that the MFlop rates are based on the total wall clock execution time of the applications, i.e., it covers also input, output, initializtion, wrap-up, and checkpointing phases of the program and not just inner loops or kernels. It is basically the worst possible way of computing a MFlop rate.

Here is a summary of the overall results as produced by the TOPAS perl statistics script:

==========================================
TOPAS Report      1999/05/04 to 1999/08/23
==========================================

 Mflops    Num%   Num   Time%         Time
------------------------------------------
  0- 50  64.95% 44160  69.13% 643638:27:57
 50-100  18.13% 12323  19.13% 178111:52:48
100-150  11.10%  7544   8.58%  79850:53:59
150-200   2.68%  1820   2.86%  26595:19:35
200-250   0.83%   567   0.13%   1211:11:29
250-300   0.68%   459   0.07%    683:56:15
300-350   0.72%   488   0.07%    618:58:00
350-400   0.63%   431   0.04%    344:37:34
400-450   0.29%   196   0.01%     58:34:26

   Lang    Num%   Num   Time%         Time
------------------------------------------
    cxx   3.03%  2065   3.72%  34612:10:41
    kxx   2.00%  1365   5.10%  47524:35:34
      c  19.93% 13604  19.85% 184816:59:24
    f90  75.00% 51195  71.33% 664160:20:54

 MP-Lib    Num%   Num   Time%         Time
------------------------------------------
    sma   5.05%  3448  25.42% 236699:09:58
    pvm   0.14%    98   0.00%     11:32:21
      -   3.96%  2705  10.91% 101545:22:34
    mpi  90.84% 62012  63.67% 592858:02:52
Listing 4: TOPAS results

About 71% of our T3E users are programming in Fortran, 20% in C, and the remaining 9% in C++ (using Crays CC and KAI KCC). Also, most applications use either MPI (64%) or SHMEM (25%) as their communication library. PVM is basically not used. An interesting fact is that about 11% of the time is used by programs which do not communicate at all!

Figure 1 shows the MFlop distribution graphically but broken down by CPU type and also in a finer resolution. The values for the T3E-600 are much worse than for the other two machines. This is probably because this machine has only 128MByte main memory (compared to 512 Myte on the others). Its stream buffers are switched off, and it only runs batch jobs which request more than 64 PEs. Also, all MFlop rates higher than 200 are collapsed into one bar to make the graphics more readable. As one can see from Listing 4, there are applications reaching up to 450 MFlops.

Figure 1: MFlop Distribution broken down by CPU Type

Figure 2 shows the same data but broken down by programming language used.

Figure 2: MFlop Distribution broken down by Programming Language used

4 TOPAS as a User-level Performance Tool

So far, user feedback on TOPAS was very positive. Some users requested even that the measurement covers all PEs of an application because then TOPAS could be used as a primitive performance tool for measuring MFlops and for detecting load imbalance. However, we could not add this feature to the default TOPAS measurement module because the necessary synchronization and communication for its implementation would make the detection of programs which use no message passing library impossible. Therefore, we implemented a separate extended version of the measurement module with the requested features which is only linked to an user application on request. This can be done in three ways.
  1. By setting the environment variable TOPAS_MEASURE_ALL_PE before linking the application. This environment variable is then used in our modified versions of the compiler driver scripts to pick the right version of the measurement module.
  2. By using the command topas for linking. This is just a two line shell script which first sets the environment variable TOPAS_MEASURE_ALL_PE, and then executes the original linker command given to it as parameters.
  3. By using the (empty) linker command file topas.cld which is ignored by the Cray linker but again recognized by our modified versions of the compiler driver scripts.
The extended version of the TOPAS measurement module still writes an entry into the TOPAS log file so the program still shows up in the statistics, but in addition performs the necessary measurements on all PEs of the application. Of course, in this mode, only one instruction counter can be measured, but it is possible for the user to select what kind of operations should be counted by setting the environment variable TOPAS_OPS to fpops, intops, loads or stores.

At the end, the collected data is written to the file topas.out in the user's directory. A separate command topasview is used to analyze the contents of this file. topasview can either print the raw data (wall clock, system, and user time, counter and level 1 data cache miss rate) nicely formatted one line per PE or perform some simple statistical analysis on the counter rates. This includes a statistical summary and a simple cluster analysis. Example output for a Car-Parrinello code running on 16 PEs is shown in Listing 5:

***    95.783 on pe 9
min    95.922 on pe 5
q25    96.103
q50    96.171
q75    96.279
max    96.342 on pe 2
***    97.218 on pe 0

mean   96.215
rho     0.307
95-96:  5, 9, 13
96-97:  1-4, 6-8, 10-12, 14-15
97-98:  0
Listing 5: topasview statistical summary (left) and cluster analysis (right)
The statistical summary shows the minimum, maximum, 25%-, 50%-, and 75% quartiles, as well as the mean and the standard deviation. Values further away than 1.5 times of the inter-quartile range from the quartiles are marked as extreme values. For these and the minimum and maximum, the originating PE is also printed. The CRAY T3E performance tool PAT can measure the same values but produces much larger data files (due to the additional collection of profile data for all procedures called in an application). Moreover, it does not provide any kind of statistical analysis as topasview does.

5 Conclusion

In this paper we presented the design and implementation of TOPAS, a very simple but effective tool to automatically and transparently monitor usage and performance of every parallel job executed on a CRAY T3E. The measurement is done in a way that no overhead is introduced during the execution of the application itself. The data collected gives a coarse overview of sustained performance of all applications of a T3E system and provides hints which codes should be analyzed in more detail. If more precise data is needed, the regular CRAY T3E performance tools Apprentice and PAT [2] or Rabenseifner's methods [4, 5, 6] can be used. Based on the techniques used for TOPAS (sitelocal_start and atexit), his work even could be extended to cover PVM and SHMEM programs, too.

6 Acknowledgements

The author wants to thank Reiner Vogelsang who was instrumental in understanding the secrets of soft references and program startup on the T3E, Mark Potts for providing the initial information on how to access the performance counters, and Rudolf Berrendorf for implementing PCL.

7 References

[1]
CRAY Research, Inc., sitelocal_start(3C) and loaded(3C), UNICOS System Libraries Reference Manual SR-2080, 1995.
[2]
J. Galarowicz, B. Mohr, Analyzing Message Passing Programs on the Cray T3E with PAT and VAMPIR, Proceedings of 4th European CRAY-SGI MPP Workshop, Eds.: H. Lederer, F. Hertwick, IPP-Report des MPI für Plasmaphysik, Garching, IPP R/46, 29-49 (also Technical Report FZJ-ZAM-IB-9809, Forschungszentrum Jülich), 1998.
[3]
R. Berrendorf, H. Ziegler, PCL - The Performance Counter Library: A Common Interface to Access Hardware Performance Counters on Microprocessors, Technical Report FZJ-ZAM-IB-9816, Forschungszentrum Jülich, October 1998.
[4]
R. Rabenseifner, Automatic MPI Counter Profiling of All Users: First Results on a CRAY T3E 900-512, in: Proceedings of the Message Passing Interface Developer's and User's Conference 1999 (MPIDC'99), Atlanta, USA, March 10-12, 1999.
[5]
R. Rabenseifner, P. Gottschling, W. E. Nagel, and S. Seidl, Effective Performance Problem Detection of MPI Programs on MPP Systems: From the Global View to the Detail, in: Proceedings of Parallel Computing 99 (ParCo99), Delft, The Netherlands, 1999.
[6]
R. Rabenseifner, Automatic Profiling of MPI Applications with Hardware Performance Counters, in: Proceedings of the 6th PVM / MPI European Users Meeting, EuroPVM/MPI'99, Barcelona, Spain, 1999.
[7]
W. Gropp, E. Lusk, A. Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface, Scientific and Engineering Computation Series, MIT Press, 1994.
[8]
A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sunderam, PVM: Parallel Virtual Machine - A User's Guide and Tutorial for Networked Parallel Computing, Scientific and Engineering Computation Series, MIT Press, 1994.
[9]
CRAY Research, Inc., Message Passing Toolkit: Release Overview, RO-5290 1.2 , 1998.

Copyright © 1999 Forschungszentrum Jülich, NIC/ZAM

CRAY, UNICOS, UNICOS/mk, CF90, and CRAY T3E are trademarks of Cray Research, Inc.