MADD benchmarks Jan Mandel last updated 3/20/99 MILLIONS OF MULTIPLY/ADD (a=a+b*c) OPERATIONS PER SECOND, ONE PROCESSOR Multiply numbers by 2 to get Mflops numbers as understood by some vendors. All compiled with the command f77 -O3 unless otherwise noted, and using 64 bit floating point (double precision). The comparisons do not agree with SPEC benchmarks because I measure the mix I care about, which is not spec. All loops are simple, so we also measure how well the compiler handles that and unrolls automatically. ------------------------------------------------------------------------------------ 1. madd99 benchmark, disables compiler chaining of loops, splits dot and axpy in subroutines architecture name matmul size 1000 size 10000 size 1e+6 size 1e+7 3x3 6x6 dot1 dot2 axpy dot1 dot2 axpy dot1 dot2 axpy dot1 dot2 axpy SGI O2000 R10K 195Mhz 2)thunderbird 227 529 178 91 42 141 74 28 55 25 13 51 26 14 SGI O2000 R10K 195Mhz thunderbird 225 527 89 89 42 78 72 28 41 25 14 39 25 14 SUN250 UltraSparc-II 300MHz 1) cog 295 422 143 128 67 127 97 65 27 16.3 9.3 27.1 16.7 9.5 SUN250 UltraSparc-II 300MHz cog 23 31 38 31 20 25 24 19 15 11 7 15 11 7 SUN450 UltraSparc-II 250MHz sprocket 19 29 31 25 17 21 12 10 14 11 6 14 11 6 SGI Power Chal. R8000 75MHz dracula 67 143 18 18 24 19 19 25 11 7 5 11 7 5 ----- 1) base flags from specfp95: f77 -fast -xarch=v8plusa -fsimple=2 (compiler 4.2 does not support -xprefetch) 2) base flags from specfp95: f77 -Ofast=ip27 -LNO:fusion=2 matmul: fastest spead from cache and registers; depends heavily on compiler quality as well mutiplies matrices size 3*3 and 6*6 using completely unrolled code dot1,dot2,axpy: cache and memory bound computation. large sizes are out of cache. dot1 dot2 axpy computes s=a.a s=a.b c=c+a*b reads per iter 1 2 2 writes per iter 0 0 1 MB/s per MADD 8 16 24 1995/96 benchmark version (madd95) matmul and minmul same as madd99, but dot and axpy allow the compiler chain loops matmul matmul ddot daxpy ddot daxpy ddot daxpy run 3x3 6x6 1000 1000 10000 10000 1e+6 1e+6 99 SGI O2K R10K 195MHz S thb 226 528 175 177 140 166 152 177 99 SUN250 US-II 300MHz S cog 23 31 51 97 33 98 18 98 99 SUN350 US-II 250MHz S sprocket 19 29 43 82 26 81 16 83 99 SGI Power Chlg 75 Mhz S dracula 67 143 100 33 101 32.6 39 33 95 SGI Power Chlg 75 Mhz S dracula 20.0 51.9 95.5 93.1 95.9 110 95 SGI Power Chlg 90MHz S odin 24 62 102 107 113 110 95 ditto, f90 -O3 S odin 24 61 110 110 106 128 99 pentium 200Mhz g77 W linus 23 23 12.5 14.3 12 14.7 11.2 15.3 95 SUN Ultrasparc 167MHz W 16.0 22.4 27.4 54.8 18.3 55.2 95 IBM RS/6000 360 W tiger 8 14 25 25 25 25 95 IBM RS/6000 250 W putr 8.6 12.3 14.3 12.5 8.3 12.5 95 SUN SPARCenter1000 S math 5.8 8.0 9.9 11.8 11.7 10 95 SGI Challenge R4400 S loki 7.8 18 24.4 8.3 14.8 8.3 95 SGI Indy R4400PC W jinx 7.5 16 21 14.5 7.1 14.6 95 SGI Indigo 2 (f77 -O2) W indigo 5.4 12.7 16.7 5.3 9.1 5.6 95 DEC Alpha 300 S carbon 12 14.8 23.3 23.8 16.5 25 To get Megaflop speed multiply all numbers by 2 (1 MADD = 2 flops) S=server, W=workstation matmul: matrix-matrix multiplication using constant bounds coded loop, stride 1 daxpy,ddot: fortran loops on vector of given size, stride 1 What this measures: matmul: floating point overlapped with integer, from cache ddot, daxpy: small run from cache, large from memory ---------------------------------------------------------------------------------- Code versions madd95 original 95 madd95 versions 99 had to pretend to use result to fool optimizer, added 1e+6 size madd99 separated comuputation to prevent loop chaining and to measure memory bandwidth The benchmark code is available in file madd.tar in this directory. Just untar, edit makefile if your system does not have etime(2), and type make.