数值计算与计算机应用
       首页 |  期刊介绍 |  编委会 |  投稿指南 |  期刊订阅 |  下载中心 |  联系我们 |  在线办公 | 
数值计算与计算机应用  2012, Vol. Issue (1): 59-72    DOI:
论文 最新目录 | 下期目录 | 过刊浏览 | 高级检索 Previous Articles  |  Next Articles  
六边形区域快速傅里叶变换的CUDA-MPI算法及其实现
陈家杰, 李会元, 张先轶
1. 中国科学院软件研究所并行软件与计算科学实验室, 北京 100190;
2. 中国科学院研究生院, 北京 100190
A CUDA-MPI ALGORITHM FOR THE FAST FOURIER TRANSFORM ON THE HEXAGON AND ITS IMPLEMENTATION
Chen Jiajie, Li Huiyuan, Zhang Xianyi
1. Laboratory of Parallel Computing, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;
2. Graduate University of Chinese Academy of Sciences, Beijing 100190, China
 全文: PDF (683 KB)   HTML (1 KB)   输出: BibTeX | EndNote (RIS)      背景资料
摘要 本文研究六边形区域上快速傅里叶变换(FFTH)的CUDA-MPI算法及其实现. 首先,我们通过充分利用CUDA的层次化并行机制及其库函数, 设计了FFTH的高效率的CUDA算法.对于规模为3×20482的双精度复数类型数据, 我们设计的CUDA程序与CPU串行程序相比可以达到12倍加速比, 如果不计内存和显存之间的数据传输, 则加速比可达40倍;其计算效率与 CUFFT所提供的二维方形区域 FFT程序的效率基本一致.在此基础上, 我们通过研究GPU上分布式并行数据的转置与排序算法, 优化设计了FFTH的CUDA-MPI算法.在3×81922的数据规模、10节点×6GPU的计算环境下, 我们的CUDA-MPI程序与CPU串行程序相比达到了55倍的加速; 其效率比MPI并行版FFTW以及基于CUFFT本地计算和FFTW并行转置的方形区域并行FFT 的效率都要高出很多. FFTH的CUDA-MPI算法研究和测试为大规模CPU+GPU异构计算机系统的可扩展新型算法的探索提供了参考.
服务
把本文推荐给朋友
加入我的书架
加入引用管理器
E-mail Alert
RSS
作者相关文章
关键词六边形区域快速傅里叶变换   CUDA-MPI 算法   并行排序     
Abstract: In this paper, we study the parallel algorithm based on CUDA and MPI for the Fast Fourier Transform on the hexagon (FFTH) and its implementation. Firstly, we design a CUDA FFTH algorithm by utilizing the hierachical parallelization mechanism and the build-in CUFFT library for classic rectangular FFTs. With respect to the serial cpu program, our CUDA program achieves 12x speedup for 3×20482 double-precision complex-to-complex FFTH. If we ignore the PCI between main memory and GPU device memory, around 30x-40x speedup can be even achieved. Although the non-tensorial FFTH is much more complicated than the rectangular FFT, our CUDA FFTH program gains the same efficiency as the rectangular CUFFT. Next, efforts are mainly contributed to optimization techniques for parallel array transposition and data sorting, which significantly improve the efficiency of the CUDA-MPI FFTH algorithm. On a 10-node cluster with 60 GPUs, our CUDA-MPI program achieves about 55x speedup with respect to the the serial cpu program for 3×81922 complex-to-complex double-precision FFTH, and it is more efficient than the MPI parallel FFTW. Our research on the CUDA-MPI algorithm for FFTH is beneficial to the exploration and development of new parallel algorithms on large-scale CPU-GPU heterogeneous computer systems.
Key wordsFast Fourier Transform on the hexagon (FFTH)   CUDA-MPI algorithm   parallel sorting   
收稿日期: 2011-10-20;
基金资助:

国家自然科学基金(10971212, 91130014)资助项目.

引用本文:   
. 六边形区域快速傅里叶变换的CUDA-MPI算法及其实现[J]. 数值计算与计算机应用, 2012, (1): 59-72.
. A CUDA-MPI ALGORITHM FOR THE FAST FOURIER TRANSFORM ON THE HEXAGON AND ITS IMPLEMENTATION[J]. Journal of Numerical Methods and Computer Applicat, 2012, (1): 59-72.
 
[1] Chen Y, Cui X, Mei H. Large-scale FFT on GPU clusters[J]. ICS'10, 2010, 2: 315-324.
[2] Cooley J W, Tukey J W. An algorithm for the machine caculation of complex Fourier series[J].Math. Comp., 1965, 19: 297-301.
[3] CUDA C Programming Guide, Version 3.1. NIVIDIA Corp, 2010.
[4] CUDA CUFFT Library, Version 3.1. NIVIDIA Corp, 2010.
[5] FFTW: Manual for FFTW 3.3 Online. http://www.fftw.org/.
[6] Jacobsen D A, Thibault J C, Senocak I. An MPI-CUDA Implementation for Massively ParallelIncompressible Flow Computations on Multi-GPU Clusters[C]. 48th AIAA Aerospace SciencesMeeting, 2010.
[7] Li H, Sun J. Fast generalized Fourier transforms on hexagon domains[J]. RDCPS Annual Reports,2002, 13: 45-57.
[8] Li H, Sun J. Fast Fourier Transform on Hexagon[J]. RDCPS Annual Reports, 2004, 15: 50-63.
[9] Li H, Sun J. Fast Fourier transform on hexagon[C]. In W. Zhang, Z. Chen, R. Glowinskin, W. Tonyeditors, Current Trend in High Performance and Its Application, Lecture Notes in ComputationalScience and Engineering, pp. 357-362. Springer-verlag, Berlin, 2005.
[10] Li H, Sun J, Xu Y. Discrete Fourier analysis, cubature and interpolation on a hexagon and atriangle[J]. SIAM J. Numer. Anal, 2008, 46: 1653-1681.
[11] Luo L, Yang C, Zhao Y, et al. A scalable hybrid algorithm based on domain decomposition and al-gebraic multigrid for solving partial differential equations on a cluster of CPU/GPUs, Proceedingsof 2nd International Workshop on GPUs a nd Scientific Applications[C]. In conjunction with 2011International Conference on Parallel Architectures and Compilation Techniques (PACT 2011).
[12] NVIDIA: NVIDIA's Fermi: The First Complete GPU Computing Architecture.http://www.nvidia.com/content/PDF/fermi white papers/P.Glaskowsky NVIDIA's Fermi-TheFirst Complete GPU Architecture.pdf.
[13] Sun J. Multivariate Fourier series over a class of non tensor-product partition domains[J]. J.Comput. Math., 2003, 21: 53-62.
[14] 孙家昶, 姚继锋. 平行六边形区域上的快速离散傅立叶变换[J].计算数学, 2004, 26(3): 351-366. 浏览
[15] Wang H, Potluri S, Luo M, et al. MVAPICH2-GPU: optimized GPU to GPU communication forInfiniBand cluster[J]. Comput. Sci. Res. Dev., 2011, 26: 257-266.
[16] 许彦芹, 陈庆奎. 基于SMP集群的MPI+CUDA模型的研究与实现[J]. 计算机工程与设计, 2010, 31(15): 3408-3412.
[17] Yang C, Cai X. Parallel multilevel methods for implicit solution of shallow water equations withnonsmooth topography on the cubed-sphere[J]. Journal of Computational Physics, 2011, 230:2523-2539.
没有找到本文相关文献
Copyright © 2008 数值计算与计算机应用 版权所有
中国科学院数学与系统科学研究院 《数值计算与计算机应用》编辑部
北京2719信箱 (100190) Email: szjs@lsec.cc.ac.cn
Support by Beijing Magtech Co.ltd   E-mail:support@magtech.com.cn
京ICP备05002806号-10