Abstract. TOPAS is a tool to automatically and transparently monitor usage and performance of every parallel job executed on a CRAY T3E. We have modified the UNICOS/mk compiler wrapper scripts to automatically link the TOPAS measurement module to every user application whenever it is recompiled. No modification is necessary in the user's program or build procedures. At run-time, two PEs of the parallel application are picked to actually perform the measurement for the parallel job as a whole. The measurement consists of executing special code immediately before and after the execution of the program. So there is no measurement overhead during the execution of the application itself. The TOPAS module is very simple (about 250 lines of code). It is based on the Performance Counter Library (PCL), a common interface for portable performance counting on microprocessors, also developed at NIC/ZAM.Through environment variables, users can request the printing of the recorded information at the end of the execution, choose to measure integer, load, or store operations instead of floating point, and specify the PEs which should be used for performing the measurement.
In addition to the TOPAS measurement module, we implemented a tool which allows a system administrator to calculate interesting statistics like the typical MFlop rates achieved by user programs, as well as programming language and message passing library usage from this data. Most of this information is not available through regular T3E system accounting.
The only tool currently available is the excellent work of Rolf Rabenseifner of the High-Performance Computing-Center Stuttgart (HLRS) [4, 5, 6]. He implemented an automatic counter instrumentation and profiling module which gets added to the MPI library for CRAY T3E and SGI Origin2000 systems. However, after an extensive review of his work, we decided to implement our own system because
First, the paper describes the design and implementation of the TOPAS system. Then we give an overview of the results obtained through TOPAS in the first three months of its operation. Last, Section 4 describes an extension to the original TOPAS system which allows users to get an overview of the performance of their application.
To initialize TOPAS and to start the measurement, we use a little-known (but documented and supported) feature of the UNICOS common start-up code implemented by the function $START$. First, it does all the necessary initializations (e.g., allocation of private and shared heap segments). Just before calling the main routine of the program, $START$ checks for the existence of a sitelocal_start routine [1]. If this routine is linked into the program, it will be called.
Then, in sitelocal_start, it is possible to register another function to be executed at the end of the program by using the ANSI C function atexit. atexit calls registered functions in the reverse order of their registration, so this function will be called the very last.
Therefore, the measurement module consists of two functions (see pseudo code Listing 1):
|
void sitelocal_start() { /* -- only measure parallel programs -- */ if ( _num_pes() > 1 ) { initialize_and_check_environment(); if ( i_am_measurement_pe() ) { /* -- install exit routines -- */ atexit(sitelocal_end); atabort(sitelocal_end); /* -- start measurement -- */ PCL_initialize_and_start_HW_counter(); start_UNIX_timer(); } } } void sitelocal_end() { /* -- end measurement -- */ end_UNIX_timer(); PCL_read_HW_counter(); calculate_elapsed_time(); get_program_characteristics(); /* -- display and store results -- */ print_results_to_logfile(); if ( batch_or_user_request ) display_results(); } |
These two PEs now actually perform the measurement. First, they register the TOPAS wrap-up code sitelocal_end with atexit. In addition, we use the UNICOS extension atabort which works like atexit but calls the registered functions if the program is aborted.
Next, the CRAY T3E hardware counters are initialized to measure the number of floating point and the number of integer operations. This is done by using the Performance Counter Library (PCL) [3], a common interface for portable performance counting on microprocessors, also developed at NIC/ZAM. Although the DEC Alpha CPU used in the T3E has two performance counters, the number of floating point and the number of integer operations cannot be measured at the same time. Therefore, the TOPAS measurement module runs on two PEs, each used to count one of the values. The second available hardware counter is used to determine the level 1 data cache misses.
Finally, timers are initialized and started by the standard UNIX function times which returns wall clock, system, and user time in system hardware clock ticks.
| No. | Description | Method of Collection | Format |
| 1. | Date and Time | localtime(time(0)) | YYYY/MM/DD HH:MM:SS |
| 2. | user name | First successful call out of
1. cuserid(0) 2. getlogin() 3. getenv("LOGNAME") 4. getenv("USER") | string |
| 3. | name of the executable | __progname | string |
| 4. | measuring PE | _my_pes() | int |
| 5. | total number of PEs | _num_pes() | int |
| 6. | MHz of the CPUs | sysconf(_SC_CRAY_CPCYCLE) | t300|t450|t600 |
| 7. | execution mode:
batch or interactive | getenv("QSUB_HOME")
!= NULL | B|I |
| 8. | programming language | see below | cc|cxx|kxx|f77|f90 |
| 9. | message passing library | loaded(...); see below | mpi|pvm|sma|- |
| 10. | user, system, wall clock time | times() | float in seconds |
| 11. | number of floating point
or integer operations | PCL | f=int or
i=int |
| 12. | level 1 data cache misses | PCL | int |
| 13. | unique identification | getpgrp() | int |
Items 1 to 5, 7, and 10 are also available through regular UNICOS/mk system accounting but are included in the measurement to allow a simple implementation of the analyis of the data and to be able to relate the measured data to information available through other sources. Item 13 (the UNIX process group) is used to combine measurements from different PEs which belong to the same program execution. Unfortunately, this doesn't result in a unique id for multiple mpprun commands in a single NQS batch job, but this can be corrected during the off-line analysis.
Most data items can be calculated by calling standard UNIX or UNICOS functions or by checking standard environment variables. There are two exceptions.
|
#include <stdio.h> #include <infoblk.h> /* -- declare MPI_Init as a soft reference -- */ #pragma _CRI soft MPI_Init extern int MPI_Init(int *argc, char ***argv); int main() { if (loaded(MPI_Init)) printf("program uses MPI\n"); } |
The order of the checks is important as both MPI and PVM are implemented with SHMEM on the CRAY T3E. The special case that none of the three initialization routines is loaded is also recognized and accordingly recorded.
|
# -- rest of f90 compiler script here ... TOPAS="/usr/local/topas/topas.f90.o -lpcl" for opt ; do if [[ $opt is special ]] ; then unset TOPAS; fi done exec $F90_DRIVER ${SEGLDOPTS} ${_F90_OPTS} "$@" ${CMDOPTS} \ ${INCDIR} ${INCLUDE_PATH} ${LIB_PATH} ${LIBDIRS} ${TOPAS} |
All changes are localized at the end of the corresponding scripts. The key change is to add the variable ${TOPAS} to the end of the last line of the script which executes the "real" compiler. But first, the variable ${TOPAS} is set to the pathname of the corresponding measurement module and the PCL library. In order to avoid problems when the program uses the performance counters itself (either directly or through the CRAY T3E performance tool PAT), the arguments given to the script are checked next and, if necessary, TOPAS is deactivated by unsetting the TOPAS variable. Of course, the real code for this check is more complicated as in Listing 3.
We implemented a short perl script topas-stat.pl which reads TOPAS log files line by line, splits each line into words, and calculates the total CPU time and MFlop rate for each job. At the end, it computes the distribution of the MFlop rates, of the programming language, and of the message passing library used by the applications. The bucket size of the histogram used to display the MFlop distribution can be changed by a command line parameter. The result can be printed in a nicely formatted way (see Listing 4) or in a format suitable to be used by graphics packages like gnuplot or xmgr or statistical packages like R. Also, the format of the data makes it easy to calculate statistics for a subset of the data, e.g.,
grep ' f90 ' topas.log | grep ' B ' | grep ' mpi ' | perl topas-stat.pl
computes the results for all Fortran90 batch programs which use MPI for
communication. The percentages of the statistics are computed in two ways.
The first percentage (Num% in Listing 4) is
based on the number of measurements
(program executions), while the second one (Time%) takes both
the number of PEs used and the execution time into account.
It should be clear that the data presented here only describes the mix of applications run on our CRAY T3Es in the last three months. For a detailed analysis more data is necessary. It should not be used to draw conclusions on the performance of CRAY T3E machines in general. Also, note that the MFlop rates are based on the total wall clock execution time of the applications, i.e., it covers also input, output, initializtion, wrap-up, and checkpointing phases of the program and not just inner loops or kernels. It is basically the worst possible way of computing a MFlop rate.
Here is a summary of the overall results as produced by the TOPAS perl statistics script:
| ========================================== TOPAS Report 1999/05/04 to 1999/08/23 ========================================== Mflops Num% Num Time% Time ------------------------------------------ 0- 50 64.95% 44160 69.13% 643638:27:57 50-100 18.13% 12323 19.13% 178111:52:48 100-150 11.10% 7544 8.58% 79850:53:59 150-200 2.68% 1820 2.86% 26595:19:35 200-250 0.83% 567 0.13% 1211:11:29 250-300 0.68% 459 0.07% 683:56:15 300-350 0.72% 488 0.07% 618:58:00 350-400 0.63% 431 0.04% 344:37:34 400-450 0.29% 196 0.01% 58:34:26 Lang Num% Num Time% Time ------------------------------------------ cxx 3.03% 2065 3.72% 34612:10:41 kxx 2.00% 1365 5.10% 47524:35:34 c 19.93% 13604 19.85% 184816:59:24 f90 75.00% 51195 71.33% 664160:20:54 MP-Lib Num% Num Time% Time ------------------------------------------ sma 5.05% 3448 25.42% 236699:09:58 pvm 0.14% 98 0.00% 11:32:21 - 3.96% 2705 10.91% 101545:22:34 mpi 90.84% 62012 63.67% 592858:02:52 |
About 71% of our T3E users are programming in Fortran, 20% in C, and the remaining 9% in C++ (using Crays CC and KAI KCC). Also, most applications use either MPI (64%) or SHMEM (25%) as their communication library. PVM is basically not used. An interesting fact is that about 11% of the time is used by programs which do not communicate at all!
Figure 1 shows the MFlop distribution graphically but broken down by CPU type and also in a finer resolution. The values for the T3E-600 are much worse than for the other two machines. This is probably because this machine has only 128MByte main memory (compared to 512 Myte on the others). Its stream buffers are switched off, and it only runs batch jobs which request more than 64 PEs. Also, all MFlop rates higher than 200 are collapsed into one bar to make the graphics more readable. As one can see from Listing 4, there are applications reaching up to 450 MFlops.
|
Figure 2 shows the same data but broken down by programming language used.
|
At the end, the collected data is written to the file topas.out in the user's directory. A separate command topasview is used to analyze the contents of this file. topasview can either print the raw data (wall clock, system, and user time, counter and level 1 data cache miss rate) nicely formatted one line per PE or perform some simple statistical analysis on the counter rates. This includes a statistical summary and a simple cluster analysis. Example output for a Car-Parrinello code running on 16 PEs is shown in Listing 5:
| *** 95.783 on pe 9 min 95.922 on pe 5 q25 96.103 q50 96.171 q75 96.279 max 96.342 on pe 2 *** 97.218 on pe 0 mean 96.215 rho 0.307 |
95-96: 5, 9, 13 96-97: 1-4, 6-8, 10-12, 14-15 97-98: 0 |
CRAY, UNICOS, UNICOS/mk, CF90, and CRAY T3E are trademarks of Cray Research, Inc.