• 青年评述 •    下一篇

高可扩展、高性能和高实用的稀疏矩阵计算研究进展与挑战

刘伟峰   

  1. 中国石油大学(北京)计算机科学与技术系, 超级科学软件实验室, 北京 102249
  • 收稿日期:2020-05-11 出版日期:2020-12-15 发布日期:2020-12-15
  • 作者简介:刘伟峰,博士,中国石油大学(北京)教授、博士生导师.2002年和2006年分别于中国石油大学(北京)计算机科学与技术系获学士与硕士学位,2016年于哥本哈根大学尼尔斯·玻尔研究所获计算科学博士学位.他的主要研究方向为数值线性代数和并行计算,其中尤其关注稀疏矩阵的数据结构、并行算法和软件.
  • 基金资助:

    科学挑战专题(No.TZZT2016002)、国家自然科学基金项目(No.61972415)和中国石油大学(北京)科研基金项目(No.2462019YJRC004,2462020XKJS03)资助.

刘伟峰. 高可扩展、高性能和高实用的稀疏矩阵计算研究进展与挑战[J]. 数值计算与计算机应用, 2020, 41(4): 259-281.

Liu Weifeng. HIGHLY-SCALABLE, HIGHLY-PERFORMANT AND HIGHLY-PRACTICAL SPARSE MATRIX COMPUTATIONS: PROGRESS AND CHALLENGES[J]. Journal on Numerica Methods and Computer Applications, 2020, 41(4): 259-281.

HIGHLY-SCALABLE, HIGHLY-PERFORMANT AND HIGHLY-PRACTICAL SPARSE MATRIX COMPUTATIONS: PROGRESS AND CHALLENGES

Liu Weifeng   

  1. China University of Petroleum-Beijing, Department of Computer Science and Technology, Super Scientific Software Laboratory, Beijing 102249, China
  • Received:2020-05-11 Online:2020-12-15 Published:2020-12-15
稀疏矩阵算法是超级计算领域的热点和难点研究内容之一.本文从高可扩展、高性能和高实用这三个角度,对过去30年来国内外稀疏矩阵计算的部分主要研究工作进行了综述.并配合在三个GPU上十余个稀疏BLAS算法的测试数据,讨论了同时达到高可扩展、高性能和高实用这三个目标的主要难点.最后提出了未来稀疏矩阵计算领域的一系列挑战.
Sparse matrix algorithms are one of the most challenging topics in the supercomputing area. This paper from three aspects, i.e., high scalability, high performance and high practicability, provides an overview of the most important research work published in the last 30 years. Then a large amount of experimental results from evaluating over ten sparse BLAS algorithms on three GPUs are analyzed, and the difficulties of getting the three achievements are discussed. Finally, a serial of challenges of the next generation sparse matrix computations are proposed.

MR(2010)主题分类: 

()
[1] Asanovi? K, Bodik R, Catanzaro B C, Gebis J J, Husbands P, Keutzer K, Patterson D A, Plishker W L, Shalf J, Williams S W, Yelick K A. The Landscape of Parallel Computing Research:A View from Berkeley[R]. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.

[2] Dongarra J. Preface:Basic Linear Algebra Subprograms Technical (Blast) Forum Standard[J]. The International Journal of High Performance Computing Applications, 2002, 16(2):115-115.

[3] Duff I S. A survey of sparse matrix research[J]. Proceedings of the IEEE, 1977, 65(4):500-535.

[4] Duff I S, Marrone M, Radicati G, Vittoli C. Level 3 Basic Linear Algebra Subprograms for Sparse Matrices:A User-level Interface[J]. ACM Trans. Math. Softw., 1997, 23(3):379-401.

[5] Duff I S, Heroux M A, Pozo R. An Overview of the Sparse Basic Linear Algebra Subprograms:The New Standard from the BLAS Technical Forum[J]. ACM Trans. Math. Softw., 2002, 28(2):239-267.

[6] Gustavson F G. Two Fast Algorithms for Sparse Matrices:Multiplication and Permuted Transposition[J]. ACM Trans. Math. Softw., September 1978, 4(3):250-269.

[7] Vuduc R, Demmel J W, Yelick K A. OSKI:A library of automatically tuned sparse matrix kernels[J]. Journal of Physics:Conference Series, 2005, 16(1):521.

[8] Catalyurek U V, Aykanat C. Hypergraph-partitioning-based decomposition for parallel sparsematrix vector multiplication[J]. IEEE Transactions on Parallel and Distributed Systems, 1999, 10(7):673-693.

[9] Williams S, Oliker L, Vuduc R, Shalf J, Yelick K, Demmel J. Optimization of sparse matrix-vector multiplication on emerging multicore platforms[J]. Parallel Computing, 2009, 35(3):178-194.

[10] Bell N, Garland M. Implementing Sparse Matrix-vector Multiplication on Throughput-oriented Processors[C]. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC'09, 2009, 18:1-18:11.

[11] Liu X, Smelyanskiy M, Chow E, Dubey P. Efficient Sparse Matrix-vector Multiplication on x86-based Many-core Processors[C]. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS'13, 2013, 273-282.

[12] Liu W, Vinter B. CSR5:An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication[C]. In Proceedings of the 29th ACM International Conference on Supercomputing, ICS'15, 2015, 339-350.

[13] Merrill D, Garland M. Merge-Based Parallel Sparse Matrix-Vector Multiplication[C]. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC'16, 2016, 678-689.

[14] Liu W, Li A, Hogg J, Duff I S, Vinter B. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves[C]. In Proceedings of the 22Nd International Conference on Parallel Processing, Euro-Par'16, 2016, 617-630.

[15] Liu W, Vinter B. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data[C]. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS'14, 2014, 370-381.

[16] Vastenhouw B, Bisseling R H. A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication[J]. SIAM Review, 2005, 47(1):67-95.

[17] Li S, Hu C, Zhang J, Zhang Y. Automatic tuning of sparse matrix-vector multiplication on multicore clusters[J]. Science China Information Sciences, 2015, 58(9):1-14.

[18] Selvitopi O, Aykanat C. Reducing latency cost in 2D sparse matrix partitioning models[J]. Parallel Computing, 2016, 57:1-24.

[19] Acer S, Selvitopi O, Aykanat C. Optimizing nonzero-based sparse matrix partitioning models via reducing latency[J]. Journal of Parallel and Distributed Computing, 2018.

[20] Selvitopi O, Acer S, Aykanat C. A Recursive Hypergraph Bipartitioning Framework for Reducing Bandwidth and Latency Costs Simultaneously[J]. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(2):345-358.

[21] Kayaaslan E, Aykanat C, Bora Uçar. 1.5D Parallel Sparse Matrix-Vector Multiply[J]. SIAM Journal on Scientific Computing, 2018, 40(1):C25-C46.

[22] Boman E G, Devine K D, Rajamanickam S. Scalable matrix computations on large scalefree graphs using 2D graph partitioning[C]. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC'13, 2013, 1-12.

[23] Kuhlemann V, Vassilevski P S. Improving the Communication Pattern in Matrix-Vector Operations for Large Scale-Free Graphs by Disaggregation[J]. SIAM Journal on Scientific Computing, 2013, 35(5):S465-S486.

[24] Anderson E, Saad Y. Solving Sparse Triangular Linear Systems on Parallel Computers[J]. International Journal of High Speed Computing, 1989, 01(01):73-95.

[25] Totoni E, Heath M T, Kale L V. Structure-adaptive parallel solution of sparse triangular linear systems[J]. Parallel Computing, 2014, 40(9):454-470.

[26] Liu Y, Jacquelin M, Ghysels P, Li X S. Highly scalable distributed-memory sparse triangular solution algorithms[C]. In 2018 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing, 2018, 87-96.

[27] Sao P, Kannan R, Li X S, Vuduc R. A Communication-Avoiding 3D Sparse Triangular Solver[C]. In Proceedings of the ACM International Conference on Supercomputing, ICS'19, 2019, 127-137.

[28] Ding N, Williams S, Liu Y, Li X S. Leveraging One-Sided Communication for Sparse Triangular Solvers[C]. In Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing, 93-105.

[29] Buluç A, Gilbert J R. Parallel Sparse Matrix-Matrix Multiplication and Indexing:Implementation and Experiments[J]. SIAM Journal on Scientific Computing, 2012, 34(4):C170-C191.

[30] Ballard G, Buluç A, Demmel J, Grigori L, Lipshitz B, Schwartz O, Toledo S. Communication Optimal Parallel Multiplication of Sparse Random Matrices[C]. In Proceedings of the Twentyfifth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA'13, 2013, 222-231.

[31] Ballard G, Druinsky A, Knight N, Schwartz O. Hypergraph Partitioning for Sparse MatrixMatrix Multiplication[J]. ACM Trans. Parallel Comput., 2016, 3(3):18:1-18:34.

[32] Ballard G, Siefert C, Hu J. Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multigrid[J]. SIAM Journal on Scientific Computing, 2016, 38(3):C203-C231.

[33] Azad A, Ballard G, Buluç A, Demmel J, Grigori L, Schwartz O, Toledo S, Williams S. Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication[J]. SIAM Journal on Scientific Computing, 2016, 38(6):C624-C651.

[34] Ballard G, Demmel J, Holtz O, Schwartz O. Minimizing Communication in Numerical Linear Algebra[J]. SIAM Journal on Matrix Analysis and Applications, 2011, 32(3):866-901.

[35] Yang W, Li K, Li K. A hybrid computing method of SpMV on CPU-GPU heterogeneous computing systems[J]. Journal of Parallel and Distributed Computing, 2017, 104:49-60.

[36] Liu W, Vinter B. A Framework for General Sparse Matrix-Matrix Multiplication on GPUs and Heterogeneous Processors[J]. Journal of Parallel and Distributed Computing, November 2015, 85(C):47-61.

[37] Anh P N Q, Fan R, Wen Y. Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication[C]. In Proceedings of the 2016 International Conference on Supercomputing, ICS'16, 2016, 36:1-36:12.

[38] Nagasaka Y, Nukada A, Matsuoka S. High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU[C]. In 46th International Conference on Parallel Processing, ICPP'17, 2017, 101-110.

[39] Wang H, Liu W, Hou K, chun Feng W. Parallel Transposition of Sparse Data Structures[C]. In Proceedings of the 2016 International Conference on Supercomputing, ICS'16, 2016, 33:1-33:13.

[40] Parger M, Winter M, Mlakar D, Steinberger M. SpECK:Accelerating GPU Sparse Matrix-Matrix Multiplication through Lightweight Analysis[C]. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'20, 2020, 362-375.

[41] Abubaker N F T, Akbudak K, Aykanat C. Spatiotemporal Graph and Hypergraph Partitioning Models for Sparse Matrix-Vector Multiplication on Many-Core Architectures[J]. IEEE Transactions on Parallel and Distributed Systems, 2019, 30(2):445-458.

[42] Akbudak K, Aykanat C. Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures[J]. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(8):2258-2271.

[43] Filippone S, Cardellini V, Barbieri D, Fanfarillo A. Sparse Matrix-Vector Multiplication on GPGPUs[J]. ACM Trans. Math. Softw., 2017, 43(4):30:1-30:49.

[44] Abdelfattah A, Ltaief H, Keyes D, Dongarra J. Performance optimization of Sparse MatrixVector Multiplication for multi-component PDE-based applications using GPUs[J]. Concurrency and Computation:Practice and Experience, 2016, 28(12):3447-3465.

[45] Gao J, Wang Y, Wang J. A novel multi-graphics processing unit parallel optimization framework for the sparse matrix-vector multiplication[J]. Concurrency and Computation:Practice and Experience, 2017, 29(5):e3936-n/a.

[46] Gao J, Zhou Y, He G, Xia Y. A multi-GPU parallel optimization model for the preconditioned conjugate gradient algorithm[J]. Parallel Computing, 2017, 63:1-16.

[47] Kreutzer M, Hager G, Wellein G, Fehske H, Bishop A R. A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units[J]. SIAM Journal on Scientific Computing, 2014, 36(5):C401-C423.

[48] Zhang J, Wan J, Li F, Mao J, Zhuang L, Yuan J, Liu E, Yu Z. Efficient sparse matrix-vector multiplication using cache oblivious extension quadtree storage format[J]. Future Generation Computer Systems, 2016, 54:490-500.

[49] Anzt H, Cojean T, Yen-Chen C, Dongarra J, Flegar G, Nayak P, Tomov S, Tsai Y M, Wang W. Load-Balancing Sparse Matrix Vector Product Kernels on GPUs[J]. ACM Trans. Parallel Comput., March 2020, 7(1).

[50] Hong C, Sukumaran-Rajam A, Nisa I, Singh K, Sadayappan P. Adaptive Sparse Tiling for Sparse Matrix Multiplication[C]. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, PPoPP'19, 2019, 300-314.

[51] Zhang Y, Li S, Yan S, Zhou H. A Cross-Platform SpMV Framework on Many-Core Architectures[J]. ACM Trans. Archit. Code Optim., 2016, 13(4):33:1-33:25.

[52] Wang X, Liu W, Xue W, Wu L. swSpTRSV:A Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures[C]. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'18, 2018, 338-353.

[53] Rubensson E H, Rudberg E. Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model[J]. Parallel Computing, 2016, 57:87-106.

[54] Li A, Liu W, Kristensen M R B, Vinter B, Wang H, Hou K, Marquez A, Song S L. Exploring and Analyzing the Real Impact of Modern On-package Memory on HPC Scientific Kernels[C]. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC'17, 2017, 26:1-26:14.

[55] Langr D, Tvrdík P. Evaluation Criteria for Sparse Matrix Storage Formats[J]. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(2):428-440.

[56] Tang W T, Tan W J, Goh R S M, Turner S J, Wong W F. A Family of Bit-RepresentationOptimized Formats for Fast Sparse Matrix-Vector Multiplication on the GPU[J]. IEEE Transactions on Parallel and Distributed Systems, 2015, 26(9):2373-2385.

[57] Buono D, Petrini F, Checconi F, Liu X, Que X, Long C, Tuan T C. Optimizing Sparse MatrixVector Multiplication for Large-Scale Data Analytics[C]. In Proceedings of the 2016 International Conference on Supercomputing, ICS'16, pages 37:1-37:12, 2016.

[58] Yzelman A N, Bisseling R H. Cache-Oblivious Sparse Matrix-Vector Multiplication by Using Sparse Matrix Partitioning Methods[J]. SIAM Journal on Scientific Computing, 2009, 31(4):3128-3154.

[59] Yzelman A N, Roose D. High-Level Strategies for Parallel Shared-Memory Sparse Matrix-Vector Multiplication[J]. IEEE Transactions on Parallel and Distributed Systems, 2014, 25(1):116-125.

[60] Kabir H, Booth J D, Aupy G, Benoit A, Robert Y, Raghavan P. STS-k:a multilevel sparse triangular solution scheme for NUMA multicores[C]. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC'15, 2015, 1-11.

[61] Picciau A, Inggs G E, Wickerson J, Kerrigan E C, Constantinides G A. Balancing Locality and Concurrency:Solving Sparse Triangular Systems on GPUs[C]. In IEEE 23rd International Conference on High Performance Computing, HiPC'16, 2016, 183-192.

[62] Gremse F, Küpper K, Naumann U. Memory-Efficient Sparse Matrix-Matrix Multiplication by Row Merging on Many-Core Architectures[J]. SIAM Journal on Scientific Computing, 2018, 40(4):C429-C449.

[63] Xie B, Zhan J, Liu X, Gao W, Jia Z, He X, Zhang L. CVR:Efficient Vectorization of SpMV on x86 Processors[C]. In Proceedings of the 2018 International Symposium on Code Generation and Optimization, CGO'18, 2018, 149-162.

[64] Zhang F, Liu W, Feng N, Zhai J, Du X. Performance Evaluation and Analysis of Sparse Matrix and Graph Kernels on Heterogeneous Processors[J]. CCF Transactions on High Performance Computing, 2019, 1:131-143.

[65] Bell N, Dalton S, Olson L N. Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods[J]. SIAM Journal on Scientific Computing, 2012, 34(4):C123-C152.

[66] Dalton S, Olson L, Bell N. Optimizing Sparse Matrix-Matrix Multiplication for the GPU[J]. ACM Trans. Math. Softw., 2015, 41(4):25:1-25:20.

[67] Hou K, Liu W, Wang H, chun Feng W. Fast Segmented Sort on GPUs[C]. In Proceedings of the International Conference on Supercomputing, ICS'17, 2017, 12:1-12:10.

[68] Liu W, Wang H, Vinter B. Parallel Segmented Merge and Its Applications to Two Sparse Matrix Kernels[C]. In 2018 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing, 2018.

[69] Winter M, Mlakar D, Zayer R, Seidel H P, Steinberger M. Adaptive Sparse Matrix-matrix Multiplication on the GPU[C]. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, PPoPP'19, 2019, 68-81.

[70] Liu J, He X, Liu W, Tan G. Register-based Implementation of the Sparse General Matrixmatrix Multiplication on GPUs[C]. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'18, 2018, 407-408.

[71] Liu J, He X, Liu W, Tan G. Register-Aware Optimizations for Parallel Sparse Matrix-Matrix Multiplication[J]. International Journal of Parallel Programming, 2019.

[72] Park J, Smelyanskiy M, Sundaram N, Dubey P. Sparsifying Synchronization for HighPerformance Shared-Memory Sparse Triangular Solver[C]. In Supercomputing, 2014, 124-140.

[73] Liu W, Li A, Hogg J D, Duff I S, Vinter B. Fast Synchronization-Free Algorithms for Parallel Sparse Triangular Solves with Multiple Right-Hand Sides[J]. Concurrency and Computation:Practice and Experience, 2017, 29(21):e4244-n/a.

[74] Wang X, Xu P, Xue W, Ao Y, Yang C, Fu H, Gan L, Yang G, Zheng W. A Fast Sparse Triangular Solver for Structured-grid Problems on Sunway Many-core Processor SW26010[C]. In Proceedings of the 47th International Conference on Parallel Processing, ICPP'18, 2018, 53:1-53:11.

[75] Yan S, Li C, Zhang Y, Zhou H. yaSpMV:Yet Another SpMV Framework on GPUs[C]. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'14, 2014, 107-118.

[76] Greathouse J L, Daga M. Efficient Sparse Matrix-vector Multiplication on GPUs Using the CSR Storage Format[C]. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC'14, 2014, 769-780.

[77] Ashari A, Sedaghati N, Eisenlohr J, Parthasarath S, Sadayappan P. Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications[C]. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC'14, pages 781-792, 2014.

[78] Daga M, Greathouse J L. Structural Agnostic SpMV:Adapting CSR-Adaptive for Irregular Matrices[C]. In IEEE 22nd International Conference on High Performance Computing, HiPC'15, 2015, 64-74.

[79] Zhao Y, Zhou W, Shen X, Yiu G. Overhead-Conscious Format Selection for SpMV-Based Applications[C]. In 2018 IEEE International Parallel and Distributed Processing Symposium, IPDPS'18, 2018, 950-959.

[80] Zhou W, Zhao Y, Shen X, Chen W. Enabling Runtime SpMV Format Selection through an Overhead Conscious Method[J]. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(1):80-93.

[81] Li J, Tan G, Chen M, Sun N. SMAT:An Input Adaptive Auto-tuner for Sparse Matrix-vector Multiplication[C]. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI'13, 2013, 117-126.

[82] Tan G, Liu J, Li J. Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture[J]. ACM Trans. Math. Softw., 2018, 44(4):46:1-46:25.

[83] Sedaghati N, Mu T, Pouchet L N, Parthasarathy S, Sadayappan P. Automatic Selection of Sparse Matrix Representation on GPUs[C]. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS'15, 2015, 99-108.

[84] Benatia A, Ji W, Wang Y, Shi F. BestSF:A Sparse Meta-Format for Optimizing SpMV on GPU[J]. ACM Trans. Archit. Code Optim., 2018, 15(3):29:1-29:27.

[85] Benatia A, Ji W, Wang Y, Shi F. Sparse matrix partitioning for optimizing SpMV on CPUGPU heterogeneous platforms[J]. The International Journal of High Performance Computing Applications, 2020, 34(1):66-80.

[86] Yang W, Li K, Mo Z, Li K. Performance Optimization Using Partitioned SpMV on GPUs and Multicore CPUs[J]. IEEE Transactions on Computers, 2015, 64(9):2623-2636.

[87] Zhao Y, Li J, Liao C, Shen X. Bridging the Gap Between Deep Learning and Sparse Matrix Format Selection[C]. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'18, 2018, 94-108.

[88] Xie Z, Tan G, Liu W, Sun N. IA-SpGEMM:An Input-Aware Auto-Tuning Framework for Parallel Sparse Matrix-Matrix Multiplication[C]. In Proceedings of the ACM International Conference on Supercomputing, ICS'19, 2019, 94-105.

[89] Son S W, Malkowski K, Chen G, Kandemir M, Raghavan P. Reducing energy consumption of parallel sparse matrix applications through integrated link/CPU voltage scaling[J]. The Journal of Supercomputing, 2007, 41(3):179-213.

[90] Donofrio D, Oliker L, Shalf J, Wehner M F, Rowen C, Krueger J, Kamil S, Mohiyuddin M. Energy-Efficient Computing for Extreme-Scale Science[J]. Computer, 2009, 42(11):62-71.

[91] Abdelfattah S, Haidar A, Tomov S, Dongarra J. Analysis and Design Techniques towards HighPerformance and Energy-Efficient Dense Linear Solvers on GPUs[J]. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(12):2700-2712.

[92] Benatia A, Ji W, Wang Y, Shi F. Energy evaluation of Sparse Matrix-Vector Multiplication on GPU[C]. In 2016 Seventh International Green and Sustainable Computing Conference (IGSC), 2016, 1-6.

[93] Giefers H, Staar P, Bekas C, Hagleitner C. Analyzing the energy-efficiency of sparse matrix multiplication on heterogeneous systems:A comparative study of GPU, Xeon Phi and FPGA[C]. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS'16, 2016, 46-56.

[94] Anzt H, Tomov S, Dongarra J. On the performance and energy efficiency of sparse linear algebra on GPUs[J]. The International Journal of High Performance Computing Applications, 2017, 31(5):375-390.

[95] Li A, Liu W, Wang L, Barker K, Song S L. Warp-Consolidation:A Novel Execution Model for GPUs[C]. In Proceedings of the 2018 International Conference on Supercomputing, ICS'18, 2018, 53-64.

[96] Li A, Song S L, Liu W, Liu X, Kumar A, Corporaal H. Locality-Aware CTA Clustering for Modern GPUs[C]. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS'17, 2017, 297-311.

[97] Su B Y, Keutzer K. clSpMV:A Cross-Platform OpenCL SpMV Framework on GPUs[C]. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS'12, 2012, 353-364.

[98] Guo P, Wang L, Chen P. A Performance Modeling and Optimization Analysis Tool for Sparse Matrix-Vector Multiplication on GPUs[J]. IEEE Transactions on Parallel and Distributed Systems, 2014, 25(5):1112-1123.

[99] Li K, Yang W, Li K. Performance Analysis and Optimization for SpMV on GPU Using Probabilistic Modeling[J]. IEEE Transactions on Parallel and Distributed Systems, Jan 2015, 26(1):196-205.

[100] Chen Y, Li K, Yang W, Xiao G, Xie X, Li T. Performance-Aware Model for Sparse Matrix-Matrix Multiplication on the Sunway TaihuLight Supercomputer[J]. IEEE Transactions on Parallel and Distributed Systems, 2019.

[101] Liu C, Xie B, Liu X, Xue W, Yang H, Liu X. Towards Efficient SpMV on Sunway Manycore Architectures[C]. In Proceedings of the 2018 International Conference on Supercomputing, ICS'18, 2018, 363-373.

[102] Sun Q, Zhang C, Wu C, Zhang J, Li L. Bandwidth Reduced Parallel SpMV on the SW26010 Many-Core Platform[C]. In Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018, 2018, 54:1-54:10.

[103] Chen Y, Xiao G, Xiao Z, Yang W. hpSpMV:A Heterogeneous Parallel Computing Scheme for SpMV on the Sunway TaihuLight Supercomputer[C]. In 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2019, 989-995.

[104] You X, Yang H, Luan Z, Liu Y, Qian D. Performance Evaluation and Analysis of Linear Algebra Kernels in the Prototype Tianhe-3 Cluster[C]. In Supercomputing Frontiers, 2019, 86-105.

[105] Chen D, Fang J, Chen S, Xu C, Wang Z. Optimizing Sparse Matrix-Vector Multiplications on an ARMv8-based Many-Core Architecture[J]. International Journal of Parallel Programming, 2019.

[106] Chen D, Fang J, Xu C, Chen S, Wang Z. Characterizing Scalability of Sparse Matrix-Vector Multiplications on Phytium FT-2000+[J]. International Journal of Parallel Programming, 2020.

[107] Ao Y, Yang C, Liu F, Yin W, Jiang L, Sun Q. Performance Optimization of the HPCG Benchmark on the Sunway TaihuLight Supercomputer[J]. ACM Trans. Archit. Code Optim., 2018, 15(1):11:1-11:20.

[108] Davis T A, Hu Y. The University of Florida Sparse Matrix Collection[J]. ACM Trans. Math. Softw., 2011, 38(1):1:1-1:25.

[109] Naumov M. Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU[R]. Technical report, NVIDIA, 2011.

[110] Gremse F, Andreas Höfter, Schwen L O, Kiessling F, Naumann U. GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging[J]. SIAM Journal on Scientific Computing, 2015, 37(1):C54-C71.

[111] Deveci M, Trott C, Rajamanickam S. Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures[J]. Parallel Computing, 2018, 78:33-46.
[1] 于天禹, 赵永华, 赵莲. 基于神威太湖之光架构的LOBPCG并行算法研究[J]. 数值计算与计算机应用, 2019, 40(4): 291-309.
[2] 徐小文. 并行代数多重网格算法:大规模计算应用现状与挑战[J]. 数值计算与计算机应用, 2019, 40(4): 243-260.
[3] 谢力, 王武, 冯仰德. 基于多层半可分结构矩阵的快速算法与并行实现[J]. 数值计算与计算机应用, 2017, 38(1): 37-48.
[4] 王天一, 姜金荣, 张贺, 何卷雄, 迟学斌. CAS-ESM编译运行脚本文件系统设计与实现[J]. 数值计算与计算机应用, 2016, 37(4): 287-298.
[5] 郑汉垣, 宋安平, 张武. 基于MIC的GaBP并行算法[J]. 数值计算与计算机应用, 2015, 36(1): 31-41.
[6] 胡伟. 常微分方程初值问题的完全三阶并行块算法及实验阶研究[J]. 数值计算与计算机应用, 2014, 35(3): 163-170.
[7] 王玉柱, 姜金荣, 蔡长青, 迟学斌, 岳天祥. 三维变分资料同化系统并行算法设计与实现[J]. 数值计算与计算机应用, 2013, 34(3): 231-240.
[8] 吴洋, 赵永华, 纪国良. 一类大规模稀疏矩阵特征问题求解的并行算法[J]. 数值计算与计算机应用, 2013, 34(2): 136-146.
[9] 张学波, 李晓梅. 快速求解一类Toeplitz循环三对角线性方程组的分布式并行算法[J]. 数值计算与计算机应用, 2009, 30(3): 161-169.
[10] 金君,乔楠,梁德旺. NAPA软件的并行优化[J]. 数值计算与计算机应用, 2008, 29(1): 65-72.
[11] 肖曼玉,吕全义,汪保,欧阳洁. 块三对角线性方程组的一种并行算法[J]. 数值计算与计算机应用, 2007, 28(4): 241-249.
[12] 朱君,赵宁. Euler方程非结构网格分布式并行计算[J]. 数值计算与计算机应用, 2007, 28(3): 161-166.
[13] 郑芳英,韩丛英,贺国平. 一个无约束优化问题并行算法的异步执行[J]. 数值计算与计算机应用, 2007, 28(1): 63-70.
[14] 曹建文,刘洋,孙家昶,姚继锋,潘峰. 大规模油藏数值模拟软件并行计算技术及在Beowulf系统上的应用进展[J]. 数值计算与计算机应用, 2006, 27(2): 86-95.
[15] 张学波,李晓梅. 一类Toeplitz三对角方程组的有效分布式并行算法[J]. 数值计算与计算机应用, 2005, 26(2): 101-109.
阅读次数
全文


摘要