fft_check.F is a benchmark program. It times in-place double-precision complex 3-dimensional FFT and in-place and out-of-place double-precision real 3-dimensional FFT. It is written in Fortran and parallelized with OpenMP.
http://loto.sourceforge.net/feram/src/fft_check.html is the homepage of fft_check.
Its GPLed source code is in feram-X.YY.ZZ/src/ of the feram package. You can freely download a tar ball of feram (feram-X.YY.ZZ.tar.xz) from http://sourceforge.net/projects/loto/files/feram/ . feram is molecular dynamics (MD) simulator for bulk and thin-film ferroelectrics and GPLed free software.
Home page of feram is http://loto.sourceforge.net/feram/ .
Although fft_check can be built by the usual 'configure && make' manner along with feram (see ../INSTALL), makefiles for several compilers and architectures are provided in names of fft_check.Makefile.*. You can build fft_check with GNU Fortran (gfortran), Intel Fortran (ifort), IBM XL Fortran (xlf90_r), etc. You can link FFTW, Intel MKL, Hitachi MATRIX/MPP as an FFT library. For example,
$ make -f fft_check.Makefile.Intel-gfortran-fftw3_omp
The version number of your FFTW library can be obtained with fftw-wisdom command as
$ fftw-wisdom -V
If fft_check.F is compiled with FFTW library and f90_wisdom.f, it imports 'wisdom' file in current directory or /etc/fftw/wisdom and exports 'wisdom_new' into current directory.
Use fft_acml.f and fft_acml.Makefile.
$ make -f fft_acml.Makefile
Note that FFT in ACML is currently not so good:
Without OMP_NUM_THREADS environment variable,
$ ./fft_check 10000 80 90 100
fft_check normally uses all cores in your computer, where it FFT an array of size 80x90x100 in 10,000 iterations. Note that the number of iterations does not affect wisdom_new file.
If you want to benchmark efficiency of single processor on a multi-processor system, use taskset(1) on Linux, cpuset(1) on FreeBSD, dplace(1) on SGI or pbind(1) on Solaris for binding threads to cores on one processor. For example, on a system with two hyper-threading-off Xeon X5650,
$ OMP_NUM_THREADS=6 taskset -c 0-5 ./fft_check 10000 80 90 100
numactl(8) on Linux is also a useful command for achieving good performance on Non-Uniform Memory Access (NUMA) systems.
$ numactl --help $ numactl --show $ numactl --hardware $ numactl --cpunodebind=0,1 --interleave=all ./fft_check 100 256 256 256
Verbose reports will be written into standard error (STDERR). Formatted results will be written into standard output (STDOUT) in an order of N_TIMES Lx Ly Lz N NTHREADS plan_ci time_ci GFLOPS_ci plan_ri time_ri GFLOPS_ri plan_ro time_ro GFLOPS_ro, where plan denotes time in second for planning, time denotes time in second for FFT, _ci denotes in-place double-precision complex 3-dimensional FFT, _ri denotes in-place double-precision real 3-dimensional FFT, and _ro denotes out-of-place of that.
Giga FLOPS values are roughly estimated from 5*N*log_2(N) floating point operations.
In Fig. 1 and Fig. 2, results of 3-dimensional FFT benchmark on single node of some systems are shown. Computational conditions are listed below.
Raw data of results of benchmark are 19example-fft-benchmark/fft_check_powr2.*.dat and 19example-fft-benchmark/fft_check_nonp2.*.dat. They are plotted with GNUPLOT scripts of 19example-fft-benchmark/fft_powr2.gp and 19example-fft-benchmark/fft_nonp2.gp.
In Fig. 3 and Fig. 4, results of 3-dimensional FFT benchmark on single chip 3.83 GHz POWER7 and 2.70 GHz E5-2680. For small size FFT, single chip may give better efficiency than single node. Computational conditions are listed below.
Benchmarks with Tesla K20X and Tesla M2090 GPUs are also plotted in In Fig. 3 (b) and Fig. 4 (b). Double-precision real↔complex 3-dimensional in-place FFT is performed on the GPU devices with cufft_check.F and cufft_module.f.
Raw data of results of benchmark are 19example-fft-benchmark/fft_check_powr2chip.*.dat and 19example-fft-benchmark/fft_check_nonp2chip.*.dat. They are plotted with GNUPLOT scripts of 19example-fft-benchmark/fft_powr2chip.gp and 19example-fft-benchmark/fft_nonp2chip.gp.
If the numbers of dimensions of an array are powers of two, for example a(512,512,512), "bank conflict" may occur in FFT and it reduces computational speed. To avoid "bank conflict", "padding" is commonly introduced, for example a(512,512+3,512). However, introduction of "padding" make code complicated. Therefore, "padding" is not introduced in this fft_check.F.
You can find an MPI-parallelized large-scale FFT benchmark program at https://github.com/t-nissie/fft_check_mpi . It is written in Fortran, using FFTW, parallelized with MPI.
Copyright © 2007-2013 by Takeshi Nishimatsu
fft_check is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY. You can copy, modify and redistribute fft_check, but only under the conditions described in the GNU General Public License (the "GPL").
Takeshi Nishimatsu (t-nissie{at}imr.tohoku.ac.jp)