Software measurements .

oct2005

     This section includes general software measurements (mainly timing):

FFTW performance

    Performance measurements were taken using the fftw bench routine. Ffts from 1K to 2^20 length were made with
various setups and machines:

cpu: summer1: (top)

FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
1024 1 0 18.67 ms 14.87 us 3442.0
1024 4 0 29.46 ms 43.46 us 1178.1
1024 8 0 40.54 ms 68.62 us 746.1
1024 1 1 36.43 ms 8.37 us 6115.2
1024 4 1 61.23 ms 54.53 us 938.9
1024 8 1 107.30 ms 69.15 us 740.4






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
2048 1 0 37.96 ms 35.25 us 3195.5
2048 4 0 65.88 ms 75.48 us 1492.2
2048 8 0 84.82 ms 69.73 us 1615.3
2048 1 1 71.98 ms 19.30 us 5834.9
2048 4 1 97.61 ms 52.47 us 2146.6
2048 8 1 144.96 ms 82.36 us 1367.7






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
4096 1 0 74.14 ms 78.51 us 3130.4
4096 4 0 100.95 ms 81.80 us 3004.2
4096 8 0 117.69 ms 100.98 us 2433.8
4096 1 1 142.61 ms 49.21 us 4993.6
4096 4 1 217.56 ms 72.66 us 3382.5
4096 8 1 310.06 ms 99.26 us 2476.0






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
8192 1 0 147.49 ms 170.78 us 3117.9
8192 4 0 170.44 ms 134.49 us 3959.2
8192 8 0 216.80 ms 152.18 us 3499.0
8192 1 1 284.35 ms 105.45 us 5049.8
8192 4 1 363.84 ms 132.57 us 4016.6
8192 8 1 530.91 ms 142.82 us 3728.3






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
16384 1 0 292.04 ms 366.41 us 3130.1
16384 4 0 318.86 ms 227.42 us 5043.0
16384 8 0 376.15 ms 175.45 us 6536.7
16384 1 1 555.23 ms 231.91 us 4945.4
16384 4 1 670.96 ms 203.48 us 5636.2
16384 8 1 799.97 ms 224.48 us 5109.0






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
32768 1 0 601.26 ms 795.00 us 3091.3
32768 4 0 629.55 ms 435.97 us 5637.1
32768 8 0 641.25 ms 296.72 us 8282.6
32768 1 1 1.05 s 509.09 us 4827.4
32768 4 1 1.18 s 431.06 us 5701.3
32768 8 1 1.30 s 276.88 us 8876.2






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
65536 1 0 1.66 s 1.73 ms 3032.8
65536 4 0 1.53 s 884.81 us 5925.4
65536 8 0 1.48 s 589.28 us 8897.1
65536 1 1 2.39 s 1.11 ms 4740.7
65536 4 1 2.47 s 727.94 us 7202.4
65536 8 1 2.61 s 521.56 us 10052.0






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
131072 1 0 4.46 s 4.92 ms 2263.1
131072 4 0 3.73 s 1.83 ms 6096.8
131072 8 0 3.35 s 1.14 ms 9732.9
131072 1 1 6.01 s 3.73 ms 2984.9
131072 4 1 5.47 s 1.56 ms 7138.3
131072 8 1 5.33 s 1.59 ms 7009.2






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
262144 1 0 13.77 s 12.99 ms 1816.2
262144 4 0 10.27 s 4.68 ms 5041.8
262144 8 0 9.10 s 3.12 ms 7553.4
262144 1 1 18.21 s 10.51 ms 2244.6
262144 4 1 13.95 s 4.02 ms 5872.5
262144 8 1 13.42 s 2.93 ms 8060.5






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
524288 1 0 6.38 s 31.44 ms 1584.1
524288 4 0 3.75 s 11.61 ms 4291.5
524288 8 0 2.99 s 7.40 ms 6731.2
524288 1 1 5.71 s 24.60 ms 2024.9
524288 4 1 3.68 s 10.32 ms 4825.4
524288 8 1 3.12 s 7.03 ms 7083.5






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
1048576 1 0 18.22 s 65.67 ms 1596.7
1048576 4 0 10.73 s 21.61 ms 4853.4
1048576 8 0 8.96 s 14.69 ms 7139.0
1048576 1 1 16.44 s 53.33 ms 1966.2
1048576 4 1 11.50 s 18.18 ms 5766.8
1048576 8 1 8.57 s 13.14 ms 7978.8







processing: /share/megs/phil/x101/fftw/fftw-3.2.2/archsrc/xeon5400/tests/ benchphil, benchhtml.pl

fftw on cpu  aserv11: (top)

Observations:


FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
1024 1 1 36.08 ms 3.20 us 15994.0
1024 4 1 100.03 ms 38.96 us 1314.1
1024 8 1 143.75 ms 36.01 us 1421.8
1024 12 1 166.72 ms 32.17 us 1591.6
1024 16 1 173.61 ms 61.41 us 833.7






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
2048 1 1 58.58 ms 7.32 us 15398.0
2048 4 1 163.33 ms 44.80 us 2514.0
2048 8 1 192.12 ms 45.64 us 2467.8
2048 12 1 240.42 ms 59.04 us 1908.0
2048 16 1 304.54 ms 83.88 us 1342.8






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
4096 1 1 101.56 ms 17.22 us 14268.0
4096 4 1 226.86 ms 56.85 us 4322.8
4096 8 1 311.78 ms 65.09 us 3775.9
4096 12 1 301.65 ms 84.76 us 2899.6
4096 16 1 375.20 ms 67.17 us 3658.7






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
8192 1 1 190.83 ms 48.95 us 10877.0
8192 4 1 346.32 ms 60.65 us 8779.8
8192 8 1 449.22 ms 70.93 us 7506.7
8192 12 1 511.94 ms 114.89 us 4634.7
8192 16 1 564.90 ms 80.33 us 6628.8






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
16384 1 1 349.97 ms 115.52 us 9927.7
16384 4 1 569.72 ms 82.45 us 13911.0
16384 8 1 635.73 ms 122.51 us 9361.7
16384 12 1 660.19 ms 112.47 us 10197.0
16384 16 1 800.49 ms 93.67 us 12244.0






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
32768 1 1 655.07 ms 257.95 us 9527.3
32768 4 1 1.03 s 133.98 us 18344.0
32768 8 1 1.11 s 162.08 us 15163.0
32768 12 1 1.15 s 135.22 us 18175.0
32768 16 1 1.12 s 139.38 us 17632.0






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
65536 1 1 1.49 s 577.28 us 9082.0
65536 4 1 2.01 s 314.28 us 16682.0
65536 8 1 1.81 s 188.17 us 27862.0
65536 12 1 1.60 s 185.53 us 28259.0
65536 16 1 1.97 s 293.89 us 17840.0






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
131072 1 1 3.23 s 1.20 ms 9295.4
131072 4 1 4.11 s 493.94 us 22556.0
131072 8 1 3.55 s 299.66 us 37180.0
131072 12 1 3.06 s 270.83 us 41137.0
131072 16 1 3.26 s 319.06 us 34918.0






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
262144 1 1 7.03 s 2.68 ms 8792.7
262144 4 1 8.19 s 858.25 us 27490.0
262144 8 1 7.56 s 795.50 us 29658.0
262144 12 1 5.40 s 783.50 us 30112.0
262144 16 1 6.20 s 1.09 ms 21605.0






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
524288 1 1 2.13 s 11.29 ms 4413.6
524288 4 1 1.27 s 3.17 ms 15708.0
524288 8 1 1.36 s 3.07 ms 16215.0
524288 12 1 12.45 s 2.19 ms 22737.0
524288 16 1 1.06 s 2.56 ms 19441.0






FFTLEN Nthreads UseSSE2 SetupTm RunTm MFLOPS
1048576 1 1 6.26 s 24.73 ms 4239.2
1048576 4 1 3.61 s 7.54 ms 13911.0
1048576 8 1 3.21 s 6.00 ms 17484.0
1048576 12 1 41.21 s 6.19 ms 16933.0
1048576 16 1 2.89 s 5.31 ms 19740.0







processing: /share/megs/phil/x101/fftw/fftw-3.2.2/archsrc/nehalem/tests/ benchphil, benchhtml.pl

intel performance primitives (IPP) lib: (top)

15aug11 IPP ffttimes on aserv11 (.ps) (.pdf) using intel benchmark program ps_ipps.

       processing:x101/intel/tests/ sysperf.sc, sysperf.pro

18aug11: IPP table of ffttimes on aserv11/adslinux using my test program

    processing:x101/intel/tests/fftbnch.c

10oct05: Some linux kernels see no speed up when running 2 processes on a dual processor cpu. (top)



 (23feb06: We finally looked inside the aolc boxes and they do not have multiple cpus (even though the purchase order claimed they did). So their timing is for a single processor with hyperthreading enabled. So the conclusions about 2.4.21 kernels may not be correct...)

    The idl routine (atmclp) processes the coded long process atm data. It was used to benchmark  some of the dual processor cpus at the observatory. The data set used was:

A single version of the  processing was run and then two copies (two separate idl sessions) were run. The times for the processing are shown in the table below:
 
cpu
cpu type
freq(ghz)
hyper
thread
Linux
kernel
Time 1 copy
secs
Time 2 copies
secs
fusion00
xeon 2.4
no
2.4.18-27.8.0smp
59
62
fusion02
xeon 2.2
yes
2.4.21-4.ELsmp
61
99
aolc1*
xeon 2.4
no*
2.4.21-4.ELsmp
58
107
aolc2*
xeon 2.4
no*
2.4.21-4.ELsmp
61
134
pserverK
xeon 3.0
yes
2.6.8-1.521smp
57
53
57
53 (repeat)
pserverM
xeon 3.0
yes
2.6.8-1.521smp
52
60
pserverN
pent4 3.2
no
2.6.12-1.1447_FC4smp
61
104
(but cpu was busy)
*
You can see that the 2.4.21-4Elsmp kernels take twice as long to run two copies as 1 copy. This means that there is no advantage to using the dual processor (aolc2 actually took longer than twice the single copy time).  For most of the measurements top showed no other processes using the cpu. The exception was pserverN where root was running a cp that took about 30% of the cpu.
 

For the aolc computers you should spread the jobs out over multiple cpus rather than trying to run two of the same on the same cpu (until arun gets a chance to update the kernels).
 

processing: x101/atm/testclp.pro
 home_~phil