In this paper, we study the parallel algorithm based on CUDA and MPI for the Fast Fourier Transform on the hexagon (FFTH) and its implementation. Firstly, we design a CUDA FFTH algorithm by utilizing the hierachical parallelization mechanism and the build-in CUFFT library for classic rectangular FFTs. With respect to the serial cpu program, our CUDA program achieves 12x speedup for 3×20482 double-precision complex-to-complex FFTH. If we ignore the PCI between main memory and GPU device memory, around 30x-40x speedup can be even achieved. Although the non-tensorial FFTH is much more complicated than the rectangular FFT, our CUDA FFTH program gains the same efficiency as the rectangular CUFFT. Next, efforts are mainly contributed to optimization techniques for parallel array transposition and data sorting, which significantly improve the efficiency of the CUDA-MPI FFTH algorithm. On a 10-node cluster with 60 GPUs, our CUDA-MPI program achieves about 55x speedup with respect to the the serial cpu program for 3×81922 complex-to-complex double-precision FFTH, and it is more efficient than the MPI parallel FFTW. Our research on the CUDA-MPI algorithm for FFTH is beneficial to the exploration and development of new parallel algorithms on large-scale CPU-GPU heterogeneous computer systems.
. A CUDA-MPI ALGORITHM FOR THE FAST FOURIER TRANSFORM ON THE HEXAGON AND ITS IMPLEMENTATION[J]. Journal of Numerical Methods and Computer Applicat, 2012, (1): 59-72.
Chen Y, Cui X, Mei H. Large-scale FFT on GPU clusters[J]. ICS'10, 2010, 2: 315-324.
Cooley J W, Tukey J W. An algorithm for the machine caculation of complex Fourier series[J].Math. Comp., 1965, 19: 297-301.
CUDA C Programming Guide, Version 3.1. NIVIDIA Corp, 2010.
CUDA CUFFT Library, Version 3.1. NIVIDIA Corp, 2010.
FFTW: Manual for FFTW 3.3 Online. http://www.fftw.org/.
Jacobsen D A, Thibault J C, Senocak I. An MPI-CUDA Implementation for Massively ParallelIncompressible Flow Computations on Multi-GPU Clusters[C]. 48th AIAA Aerospace SciencesMeeting, 2010.
Li H, Sun J. Fast generalized Fourier transforms on hexagon domains[J]. RDCPS Annual Reports,2002, 13: 45-57.
Li H, Sun J. Fast Fourier Transform on Hexagon[J]. RDCPS Annual Reports, 2004, 15: 50-63.
Li H, Sun J. Fast Fourier transform on hexagon[C]. In W. Zhang, Z. Chen, R. Glowinskin, W. Tonyeditors, Current Trend in High Performance and Its Application, Lecture Notes in ComputationalScience and Engineering, pp. 357-362. Springer-verlag, Berlin, 2005.
Li H, Sun J, Xu Y. Discrete Fourier analysis, cubature and interpolation on a hexagon and atriangle[J]. SIAM J. Numer. Anal, 2008, 46: 1653-1681.
Luo L, Yang C, Zhao Y, et al. A scalable hybrid algorithm based on domain decomposition and al-gebraic multigrid for solving partial differential equations on a cluster of CPU/GPUs, Proceedingsof 2nd International Workshop on GPUs a nd Scientific Applications[C]. In conjunction with 2011International Conference on Parallel Architectures and Compilation Techniques (PACT 2011).
NVIDIA: NVIDIA's Fermi: The First Complete GPU Computing Architecture.http://www.nvidia.com/content/PDF/fermi white papers/P.Glaskowsky NVIDIA's Fermi-TheFirst Complete GPU Architecture.pdf.
Sun J. Multivariate Fourier series over a class of non tensor-product partition domains[J]. J.Comput. Math., 2003, 21: 53-62.