Benchmark für Rechnerauswahl
12 Pages
English

Benchmark für Rechnerauswahl

-

Downloading requires you to have access to the YouScribe library
Learn all about the services we offer

Description

Universität Karlsruhe (TH) Rechenzentrum KaSC Benchmark Suite Version 1.0 AUGUST 2004 - 2 - 1 DIRECTORY STRUCTURE ....................................................................................................................... 4 1.1 KERNELS .............................................................................................................................................. 4 2 BENCHMARK PROGRAMS...................................................................................................................... 4 2.1 CONFIGURATION OF BENCHMARK PROGRAMS......................................................................................... 6 2.2 COMPILATION OF BENCHMARK PROGRAMS ............................................................................................. 7 2.3 RUNNING THE BENCHMARK P.................................................................................................. 8 2.4 RUNNING DIFFERENT COMMUNICATION BENCHMARKS ............................................................................. 9 2.4.1 Communication within one SMP node 9 2.4.1.1 Bandwidth and latency for single ping pong between two tasks within one SMP node .......... 9 2.4.1.2 dth ansingle exchange between two tasks within one SMP node........... 9 2.4.1.3 Bisection bandwidth for ping pong within one SMP node...................................................... 10 2.4.1.4 dwidth for exchange 10 2.4.2 Communication ...

Subjects

Informations

Published by
Reads 23
Language English

Universität Karlsruhe (TH)
Rechenzentrum

KaSC Benchmark Suite

Version 1.0

AUGUST 2004





- 2 -
1 DIRECTORY STRUCTURE ....................................................................................................................... 4
1.1 KERNELS .............................................................................................................................................. 4
2 BENCHMARK PROGRAMS...................................................................................................................... 4
2.1 CONFIGURATION OF BENCHMARK PROGRAMS......................................................................................... 6
2.2 COMPILATION OF BENCHMARK PROGRAMS ............................................................................................. 7
2.3 RUNNING THE BENCHMARK P.................................................................................................. 8
2.4 RUNNING DIFFERENT COMMUNICATION BENCHMARKS ............................................................................. 9
2.4.1 Communication within one SMP node 9
2.4.1.1 Bandwidth and latency for single ping pong between two tasks within one SMP node .......... 9
2.4.1.2 dth ansingle exchange between two tasks within one SMP node........... 9
2.4.1.3 Bisection bandwidth for ping pong within one SMP node...................................................... 10
2.4.1.4 dwidth for exchange 10
2.4.2 Communication between two SMP nodes................................................................................. 11
2.4.2.1 Single ping pong test between two tasks running on different SMP nodes ........................... 11
2.4.2.2 Single exchange between two tasks running on different SMP nodes .................................. 11
2.4.2.3 Multiple ping pong test between all CPUs of two SMP nodes ............................................... 11
2.4.2.4 Multiple exchange between all CPUs of two SMP nodes 11
2.4.3 Bisection Bandwidth................................................................................................................... 11
2.4.4 Latency Hiding ........................................................................................................................... 11

3 CONTACT................................................................................................................................................ 12



- 3 -
1 Directory structure

All files belonging to this benchmark are supplied in a compressed tar file benchmark.tar.gz which
should be uncompressed and untared using the commands

gunzip benchmark.tar.gz
tar -xvf benchmark.tar

The actual directory also contains a file make.inc. Within this file compiler names and options, library path
etc. are defined to compile the various benchmark programs. The Makefile includes this file in order to
read all these definitions. For some systems sample files make_<SYSTEM>.inc are included in the actual
directory.


1.1 Kernels

The actual directory contains some low level programs measuring some specific data of one processor, one
SMP node or the whole system. It contains a set of low level kernels that are typical for applications in the
field of scientific computing. Some programs only run on a single CPU, some programs run on a single CPU
as well as on all CPUs of an SMP node. The small benchmark suite also includes some programs to
measure the communication performance of typical MPI point-to-point communications.


2 Benchmark Programs

All programs are written in Fortran 90 and call a C function seconds to measure CPU time and wall clock
time. The C function seconds.c itself calls the function gettimeofday from the system library to get the
timing data. Depending on whether Fortran adds an underscore to the names of external functions or not you
should add the option –DFTNLINKSUFFIX and depending on whether Fortran changes the characters of the
name seconds to uppercase (SECONDS) or not you should add the option -DUPPERCASE to the list of the
compiler options for the C compiler (CFLAGS in make.inc). All programs contain a structure that is similar
to

call SECONDS(tim1,tdum)
do j=1,repfactor
call DUMMY(x1,x2,x3,n)
do i=1,n
simple operation
enddo
enddo
call SECONDS(tim2,tdum)

With this loop construct we want to measure the time needed to complete the inner loop do I=1,n. In order
to increase the time which is measured and to increase the accuracy of the measurement, this loop is
executed repeatedly (outer loop do j=1,repfactor). The call to the external subroutine DUMMY is included
to guarantee that the outer loop is executed as often as requested. Otherwise optimization by the compiler
could change this loop construct. You should make sure that neither the compiler nor the linker will do an
optimization that changes this sequence and nest of loops. This could mean that interprocedural optimization
should be switched off for linking, e.g. used options in FFLAGS (in make.inc) should not switch on any
interprocedural optimization during compile or link step.

The benchmark suite includes the following serial programs running on one processor (for these programs
except of bm116 exists a parallel version):



- 4 - x3(i) = x1(i) + x2(i) bm111: vector add
bm112: vector multiply x3(i) = x1(i) * x2(i)
x3(i) = x1(i) / x2(i)bm113: vector divide
x3(i) = x1(i) + s * x2(i)bm114: vector triad with scalar (saxpy)
x7(i) =(x1(i)+x2(i))*x3(i)+(x4(i)-x5(i))/x6(i)bm116: vector compound operation
s = s + x1(i) * x2(i)bm117: dot product
x4(i) = x2(i) + x3(i) * x4(i) bm118: vector triad

The benchmark suite includes the following serial programs running only on one processor:

bm111str: vector add with stride x3(i*istr) = x1(i*istr) + x2(i*istr) gth: vector adgather x3(i) = x1(ix(i)) + x2(i)
bm111sct: vector add with scatter x3(ix(i)) = x1(i) + x2(i)
bm117gth: dot product with gather s = s + x1(ix(i)) * x2(i)
bm121: matrix multiply - rowwise version C = A * B
bm122: matrix mutiply - dot product version
bm123: matrix multiply - columnwise version
bm123u4: matrix multiply - columnwise
C = A * B version with 4-fold unrolling
bm124: matrix multiply - library version
bm131: scalar performance test 1 a = a * b + c
bm132: scalar performance test 2 a = b*c + d; b = a - c*b; c = a/b

The benchmark suite includes the following programs that run usually in parallel on one node and measure
the performance on the first processor:

x3(i) = x1(i) + x2(i)bm111smp: vector add on all processors
x3(i) = x1(i) * x2(i)bm112smp: vector multiply on all processors
x3(i) = x1(i) / x2(i)bm113smp: vector divide on all processors
x3(i) = x1(i) + s * x2(i) bm114smp: vector triad with scalar on all
processors s = s + x1(i) * x2(i)bm117smp: dot product on all processors
x4(i) = x1(i) + x2(i) * x3(i) bm118smp: vector triad on all processors
x3(i) = x1(i) + s * x2(i) on 1. proc. bm1148smp: vector triad with scalar on the first
processor and vector triad with large,constant x4(i) = x1(i) + x2(i) * x3(i) on other
vectorlength on all other processors procs.
s = s + x1(i) * x2(i) on 1. processor. bm1178smp: dot product on the first processor
and vector with large,constant vectorlength triad
on all other processors procs.
x4(i) = x1(i) + x2(i) * x3(i) on 1. proc. bm1188smp: vector triad on the first processor
and vector triad with large,constant vectorlength
on all other processors procs.

The programs measuring the communication performance are:

bmmpi1: MPI ping-pong benchmark
bmmpi2: MPI double ping-pong (exchange)
bmmpi3: MPI overlap test (short messages)


- 5 - bmmpi4: MPI overlap test (long messages)









2.1 Configuration of Benchmark Programs

Before starting the installation process you should check and adapt the file make.inc. A sample make.inc is
shown below:

# make.inc
#
# Utility file used by the KaSC Benchmark.
#
########################################################################
#
# Part 1: Utilities
#
# Name of the make utility that will be used to compile and link the
# benchmarking program.
# Most of the benchmarking programs are based on using the GNU make utility.
MAKE = gmake

# Number of processors of one node
NP =
# Number of processors for the communication benchmarks
NP_COMM =

# Name of the command to run parallel programs, e.g. mpirun
PAR_CMD =

# Option for the number of processors (spatially) in front of the executable
# together with the number of processors for the serial run, e.g. -np 1
PAR_OPTS1 =
# together with the number of processors for the parallel run, e.g. -np $(NP)
PAR_OPTP1 =

# Option for the number of processors (spatially) behind the executable together
# with the number of processors for the serial run, e.g. -procs 1
PAR_OPTS2 =
# with the number of processors for the parallel run, e.g. -procs $(NP) –nodes 1
PAR_OPTP2 =

# Option for the number of processors (spatially) in front of the executable
# together with the number of processors for the communication benchmarks,
# e.g. -np $(NP_COMM)
PAR_OPTC1 =
# Option for the number of processors (spatially) behind the executable together
# with the number of processors for the communication benchmarks,
# e.g. -procs $(NP_COMM) –nodes 2
PAR_OPTC2 =



- 6 -

######################################################################
#
# Part 2: Compiler and compiler flags
#

# Name of the C compiler
CC =
# C compiler flags including high optimization
CFLAGS =


# Name of Fortran 90/95 compiler, used to compile programs *.f90 written in free
# format
F90 = f90
# Name of the script used to compile MPI programs written in Fortran 90
# (free format), e.g. mpif90
MPF90 =
# Fortran 90 compiler flags including high optimization
F90FLAGS =

#############################################################################
#
# Part 3: Libraries
#

# Flags for the Linker/Loader
LDFLAGS =

# Serial BLAS library, e.g. -lblas
BLASLIB =

# Additional communication libraries for MPI jobs (usually not required)
LIBFLGCOMM =




2.2 Compilation of Benchmark Programs

No program modifications should be necessary to compile and link the programs:

gmake cleanall
gmake all

Within the actual directory there exists one auxiliary file cache_sizes that has to be adapted to the
characteristics of the target system. In the first line of this file the size of level 1 data cache is given, in line 2
the size of level 2 data cache etc. The file contains as many lines as there are levels of data cache in your
system. The input format is xxx B, xxx kB or xxx MB.
For a processor with 128 kB level 1 data cache, 2 MB of level 2 data cache and 16 MB of level 3 data
cache, the contents of this file would be:
128 kB
2 MB
8 MB
If a data cache is shared by several processors of one shared memory node, you should divide the cache
size by the number of processors that share this data cache. If for example on your system each processor
has 128 kB level 1 data cache, which are not shared with other processors, 2 MB of level 2 cache which


- 7 - are shared by 2 processors and 128 MB of level 3 cache that are shared by 8 processors, the contents of
the file cache_sizes would be:
128 kB # 128 kB level 1 cache per processor
1 MB # 2 MB level 2 cache shared by 2 processors, i.e. 1 MB per processor
16 MB # 128 MB level 3 cache shared by 8 procs., i.e. 16 MB per processor
The program gen_input generates the input files input_3_vectors, input_2_vectors and
input_4_vectors. gen_input is automatically run before the benchmark programs requiring these
files are launched. The benchmark programs are bm111smp, bm112smp, bm113smp, bm114smp,
bm1148smp, bm117smp, bm1178smp, bm118smp and bm1188smp. They compute performances for
vector lengths that just fit into the L1-, L2- and L3-cache (if there are 3 cache-levels) additionally to the vector
lengths 10000000, 1000000, 100000, 10000, 1000, 100 and 10.


2.3 Running the Benchmark Programs

You can run the programs bm11x (x={1,2,3,4,6,7,8}) with the vector lengths 10000000, 1000000, 100000,
10000, 1000, 100, 10 and 1 on one CPU by calling

gmake run_single.

Run these programs except of bm116 on a single CPU by calling

gmake run_single_parallel_version.

Run the programs bm111smp, bm112smp, bm113smp, bm114smp, bm1148smp, bm117smp,
bm1178smp, bm118smp and bm1188smp on all CPUs of a single node by calling

gmake run_parallel.

Run the programs bm111str, bm111gth, bm111sct, bm117gth, bm131, bm132 ,bm12x
(x={1,2,3,3u4,4}) on a single CPU by calling

gmake run_mono.

To run the parallel programs you must set NP to the number of processors of one node, PARCMD to the
command for the parallel execution of programs, either PAROPTP1 or PAROPTP2 to the parallel options in the and line and either PAROPTS1 or PAROPTS2 to the parallel options in the command line with the
number of processors set to 1.

Run the programs bmmpix (x={1,2,3,4}) on two or more CPUs by calling

gmake run_comm.


gmake run_simple

means to perform gmake run_single, gmake run_mono and gmake run_comm at a time.

gmake run_simple_new

means to perform gmake run_single_parallel_version, gmake run_mono and gmake run_comm
at a time.

gmake run or gmake run_all

means to perform gmake run_single, gmake run_mono, gmake run_parallel and gmake
run_comm at a time.



- 8 - gmake run_all_new

means to perform gmake run_single_parallel_version, gmake run_mono , gmake run_parallel
and gmake run_comm at a time.

It is possible to compile only parts of the whole benchmark suite by exchanging run by comp, i.e.
gmake comp_comm compiles all programs that are run by the command gmake run_comm.



2.4 Running different Communication Benchmarks

2.4.1 Communication within one SMP node

There are four test cases for communication within one SMP node:
1. Bandwidth and latency for single ping pong between two tasks within one SMP node
2. dth ansingle exchange between two tasks within one SMP node
3. Bisection bandwidth for ping pong within one SMP node
4. dwidth for exchange


2.4.1.1 Bandwidth and latency for single ping pong between two tasks within one SMP
node

The program bmmpi1 implements a simple ping-pong operation as shown in figures Fig. 1 and Fig. 3. You
can use this program first to measure the communication between two MPI tasks running within the same
SMP node by setting either PAROPTC1 or PAROPTC2 in the file make.inc so that the number of processors
is equal or less than the number of processors of one node and the number of nodes is 1.



Single ping-pong
P0 P1 P0 P1 P0 P1
P2 P3 P2 P3 P2 P3
Node 0 Node 0 Node 1
send recieveTask 0
recieve sendTask 1
Zeit

Fig. 1: single ping pong communication pattern


2.4.1.2 Bandwidth and latency for single exchange between two tasks within one SMP
node

In order to exchange data between two tasks, the communication operations in program bmmpi2 are
implemented using MPI_Isend, MPI_Irecv and MPI_Wait function calls. This communication pattern is
shown in Fig. 2. The main purpose of this benchmark is to see whether the communication is bi-directional
or not.



- 9 - To run this program on two or more processors of one SMP node, let PAROPTC1 and PAROPTC2 unchanged.

Single exchange
P0 P1 P0 P1 P0 P1
P2 P3 P2 P3 P2 P3
Node 0 Node 0 Node 1
send / receive
Task 0
send / receive Task 1
Zeit

Fig. 2: single exchange communication pattern


2.4.1.3 Bisection bandwidth for ping pong within one SMP node

To measure the bisection bandwidth for a ping pong operation within one SMP node, the program bmmpi1
1is run on all processors of one SMP node.

Multi ping-pong, Multi exchange
P0 P1 P0 P1 P0 P1
P2 P3 P2 P3 P2 P3
Node 0 Node 0 Node 1
barrier barrier
send recieve
Task 0
... ...
...
recieve sendTask n
Zeit

Fig. 3: communication pattern for multiple ping pong and multiple exchange

Set either PAROPTC1 or PAROPTC2 in the file make.inc so that the number of processors is the same as
the even number of processors of one node.


2.4.1.4 Bisection bandwidth for exchange within one SMP node


1 In this case we assume that the number of CPUs per SMP node is even. Otherwise the number of MPI tasks should be
reduced by one.



- 10 -