• 青年评述 • 下一篇
刘伟峰
刘伟峰. 高可扩展、高性能和高实用的稀疏矩阵计算研究进展与挑战[J]. 数值计算与计算机应用, 2020, 41(4): 259-281.
Liu Weifeng. HIGHLY-SCALABLE, HIGHLY-PERFORMANT AND HIGHLY-PRACTICAL SPARSE MATRIX COMPUTATIONS: PROGRESS AND CHALLENGES[J]. Journal on Numerica Methods and Computer Applications, 2020, 41(4): 259-281.
Liu Weifeng
MR(2010)主题分类:
分享此文:
[1] Asanovi? K, Bodik R, Catanzaro B C, Gebis J J, Husbands P, Keutzer K, Patterson D A, Plishker W L, Shalf J, Williams S W, Yelick K A. The Landscape of Parallel Computing Research:A View from Berkeley[R]. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.[2] Dongarra J. Preface:Basic Linear Algebra Subprograms Technical (Blast) Forum Standard[J]. The International Journal of High Performance Computing Applications, 2002, 16(2):115-115.[3] Duff I S. A survey of sparse matrix research[J]. Proceedings of the IEEE, 1977, 65(4):500-535.[4] Duff I S, Marrone M, Radicati G, Vittoli C. Level 3 Basic Linear Algebra Subprograms for Sparse Matrices:A User-level Interface[J]. ACM Trans. Math. Softw., 1997, 23(3):379-401.[5] Duff I S, Heroux M A, Pozo R. An Overview of the Sparse Basic Linear Algebra Subprograms:The New Standard from the BLAS Technical Forum[J]. ACM Trans. Math. Softw., 2002, 28(2):239-267.[6] Gustavson F G. Two Fast Algorithms for Sparse Matrices:Multiplication and Permuted Transposition[J]. ACM Trans. Math. Softw., September 1978, 4(3):250-269.[7] Vuduc R, Demmel J W, Yelick K A. OSKI:A library of automatically tuned sparse matrix kernels[J]. Journal of Physics:Conference Series, 2005, 16(1):521.[8] Catalyurek U V, Aykanat C. Hypergraph-partitioning-based decomposition for parallel sparsematrix vector multiplication[J]. IEEE Transactions on Parallel and Distributed Systems, 1999, 10(7):673-693.[9] Williams S, Oliker L, Vuduc R, Shalf J, Yelick K, Demmel J. Optimization of sparse matrix-vector multiplication on emerging multicore platforms[J]. Parallel Computing, 2009, 35(3):178-194.[10] Bell N, Garland M. Implementing Sparse Matrix-vector Multiplication on Throughput-oriented Processors[C]. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC'09, 2009, 18:1-18:11.[11] Liu X, Smelyanskiy M, Chow E, Dubey P. Efficient Sparse Matrix-vector Multiplication on x86-based Many-core Processors[C]. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS'13, 2013, 273-282.[12] Liu W, Vinter B. CSR5:An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication[C]. In Proceedings of the 29th ACM International Conference on Supercomputing, ICS'15, 2015, 339-350.[13] Merrill D, Garland M. Merge-Based Parallel Sparse Matrix-Vector Multiplication[C]. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC'16, 2016, 678-689.[14] Liu W, Li A, Hogg J, Duff I S, Vinter B. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves[C]. In Proceedings of the 22Nd International Conference on Parallel Processing, Euro-Par'16, 2016, 617-630.[15] Liu W, Vinter B. An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data[C]. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS'14, 2014, 370-381.[16] Vastenhouw B, Bisseling R H. A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication[J]. SIAM Review, 2005, 47(1):67-95.[17] Li S, Hu C, Zhang J, Zhang Y. Automatic tuning of sparse matrix-vector multiplication on multicore clusters[J]. Science China Information Sciences, 2015, 58(9):1-14.[18] Selvitopi O, Aykanat C. Reducing latency cost in 2D sparse matrix partitioning models[J]. Parallel Computing, 2016, 57:1-24.[19] Acer S, Selvitopi O, Aykanat C. Optimizing nonzero-based sparse matrix partitioning models via reducing latency[J]. Journal of Parallel and Distributed Computing, 2018.[20] Selvitopi O, Acer S, Aykanat C. A Recursive Hypergraph Bipartitioning Framework for Reducing Bandwidth and Latency Costs Simultaneously[J]. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(2):345-358.[21] Kayaaslan E, Aykanat C, Bora Uçar. 1.5D Parallel Sparse Matrix-Vector Multiply[J]. SIAM Journal on Scientific Computing, 2018, 40(1):C25-C46.[22] Boman E G, Devine K D, Rajamanickam S. Scalable matrix computations on large scalefree graphs using 2D graph partitioning[C]. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC'13, 2013, 1-12.[23] Kuhlemann V, Vassilevski P S. Improving the Communication Pattern in Matrix-Vector Operations for Large Scale-Free Graphs by Disaggregation[J]. SIAM Journal on Scientific Computing, 2013, 35(5):S465-S486.[24] Anderson E, Saad Y. Solving Sparse Triangular Linear Systems on Parallel Computers[J]. International Journal of High Speed Computing, 1989, 01(01):73-95.[25] Totoni E, Heath M T, Kale L V. Structure-adaptive parallel solution of sparse triangular linear systems[J]. Parallel Computing, 2014, 40(9):454-470.[26] Liu Y, Jacquelin M, Ghysels P, Li X S. Highly scalable distributed-memory sparse triangular solution algorithms[C]. In 2018 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing, 2018, 87-96.[27] Sao P, Kannan R, Li X S, Vuduc R. A Communication-Avoiding 3D Sparse Triangular Solver[C]. In Proceedings of the ACM International Conference on Supercomputing, ICS'19, 2019, 127-137.[28] Ding N, Williams S, Liu Y, Li X S. Leveraging One-Sided Communication for Sparse Triangular Solvers[C]. In Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing, 93-105.[29] Buluç A, Gilbert J R. Parallel Sparse Matrix-Matrix Multiplication and Indexing:Implementation and Experiments[J]. SIAM Journal on Scientific Computing, 2012, 34(4):C170-C191.[30] Ballard G, Buluç A, Demmel J, Grigori L, Lipshitz B, Schwartz O, Toledo S. Communication Optimal Parallel Multiplication of Sparse Random Matrices[C]. In Proceedings of the Twentyfifth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA'13, 2013, 222-231.[31] Ballard G, Druinsky A, Knight N, Schwartz O. Hypergraph Partitioning for Sparse MatrixMatrix Multiplication[J]. ACM Trans. Parallel Comput., 2016, 3(3):18:1-18:34.[32] Ballard G, Siefert C, Hu J. Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multigrid[J]. SIAM Journal on Scientific Computing, 2016, 38(3):C203-C231.[33] Azad A, Ballard G, Buluç A, Demmel J, Grigori L, Schwartz O, Toledo S, Williams S. Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication[J]. SIAM Journal on Scientific Computing, 2016, 38(6):C624-C651.[34] Ballard G, Demmel J, Holtz O, Schwartz O. Minimizing Communication in Numerical Linear Algebra[J]. SIAM Journal on Matrix Analysis and Applications, 2011, 32(3):866-901.[35] Yang W, Li K, Li K. A hybrid computing method of SpMV on CPU-GPU heterogeneous computing systems[J]. Journal of Parallel and Distributed Computing, 2017, 104:49-60.[36] Liu W, Vinter B. A Framework for General Sparse Matrix-Matrix Multiplication on GPUs and Heterogeneous Processors[J]. Journal of Parallel and Distributed Computing, November 2015, 85(C):47-61.[37] Anh P N Q, Fan R, Wen Y. Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication[C]. In Proceedings of the 2016 International Conference on Supercomputing, ICS'16, 2016, 36:1-36:12.[38] Nagasaka Y, Nukada A, Matsuoka S. High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU[C]. In 46th International Conference on Parallel Processing, ICPP'17, 2017, 101-110.[39] Wang H, Liu W, Hou K, chun Feng W. Parallel Transposition of Sparse Data Structures[C]. In Proceedings of the 2016 International Conference on Supercomputing, ICS'16, 2016, 33:1-33:13.[40] Parger M, Winter M, Mlakar D, Steinberger M. SpECK:Accelerating GPU Sparse Matrix-Matrix Multiplication through Lightweight Analysis[C]. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'20, 2020, 362-375.[41] Abubaker N F T, Akbudak K, Aykanat C. Spatiotemporal Graph and Hypergraph Partitioning Models for Sparse Matrix-Vector Multiplication on Many-Core Architectures[J]. IEEE Transactions on Parallel and Distributed Systems, 2019, 30(2):445-458.[42] Akbudak K, Aykanat C. Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures[J]. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(8):2258-2271.[43] Filippone S, Cardellini V, Barbieri D, Fanfarillo A. Sparse Matrix-Vector Multiplication on GPGPUs[J]. ACM Trans. Math. Softw., 2017, 43(4):30:1-30:49.[44] Abdelfattah A, Ltaief H, Keyes D, Dongarra J. Performance optimization of Sparse MatrixVector Multiplication for multi-component PDE-based applications using GPUs[J]. Concurrency and Computation:Practice and Experience, 2016, 28(12):3447-3465.[45] Gao J, Wang Y, Wang J. A novel multi-graphics processing unit parallel optimization framework for the sparse matrix-vector multiplication[J]. Concurrency and Computation:Practice and Experience, 2017, 29(5):e3936-n/a.[46] Gao J, Zhou Y, He G, Xia Y. A multi-GPU parallel optimization model for the preconditioned conjugate gradient algorithm[J]. Parallel Computing, 2017, 63:1-16.[47] Kreutzer M, Hager G, Wellein G, Fehske H, Bishop A R. A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units[J]. SIAM Journal on Scientific Computing, 2014, 36(5):C401-C423.[48] Zhang J, Wan J, Li F, Mao J, Zhuang L, Yuan J, Liu E, Yu Z. Efficient sparse matrix-vector multiplication using cache oblivious extension quadtree storage format[J]. Future Generation Computer Systems, 2016, 54:490-500.[49] Anzt H, Cojean T, Yen-Chen C, Dongarra J, Flegar G, Nayak P, Tomov S, Tsai Y M, Wang W. Load-Balancing Sparse Matrix Vector Product Kernels on GPUs[J]. ACM Trans. Parallel Comput., March 2020, 7(1).[50] Hong C, Sukumaran-Rajam A, Nisa I, Singh K, Sadayappan P. Adaptive Sparse Tiling for Sparse Matrix Multiplication[C]. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, PPoPP'19, 2019, 300-314.[51] Zhang Y, Li S, Yan S, Zhou H. A Cross-Platform SpMV Framework on Many-Core Architectures[J]. ACM Trans. Archit. Code Optim., 2016, 13(4):33:1-33:25.[52] Wang X, Liu W, Xue W, Wu L. swSpTRSV:A Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures[C]. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'18, 2018, 338-353.[53] Rubensson E H, Rudberg E. Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model[J]. Parallel Computing, 2016, 57:87-106.[54] Li A, Liu W, Kristensen M R B, Vinter B, Wang H, Hou K, Marquez A, Song S L. Exploring and Analyzing the Real Impact of Modern On-package Memory on HPC Scientific Kernels[C]. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC'17, 2017, 26:1-26:14.[55] Langr D, Tvrdík P. Evaluation Criteria for Sparse Matrix Storage Formats[J]. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(2):428-440.[56] Tang W T, Tan W J, Goh R S M, Turner S J, Wong W F. A Family of Bit-RepresentationOptimized Formats for Fast Sparse Matrix-Vector Multiplication on the GPU[J]. IEEE Transactions on Parallel and Distributed Systems, 2015, 26(9):2373-2385.[57] Buono D, Petrini F, Checconi F, Liu X, Que X, Long C, Tuan T C. Optimizing Sparse MatrixVector Multiplication for Large-Scale Data Analytics[C]. In Proceedings of the 2016 International Conference on Supercomputing, ICS'16, pages 37:1-37:12, 2016.[58] Yzelman A N, Bisseling R H. Cache-Oblivious Sparse Matrix-Vector Multiplication by Using Sparse Matrix Partitioning Methods[J]. SIAM Journal on Scientific Computing, 2009, 31(4):3128-3154.[59] Yzelman A N, Roose D. High-Level Strategies for Parallel Shared-Memory Sparse Matrix-Vector Multiplication[J]. IEEE Transactions on Parallel and Distributed Systems, 2014, 25(1):116-125.[60] Kabir H, Booth J D, Aupy G, Benoit A, Robert Y, Raghavan P. STS-k:a multilevel sparse triangular solution scheme for NUMA multicores[C]. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC'15, 2015, 1-11.[61] Picciau A, Inggs G E, Wickerson J, Kerrigan E C, Constantinides G A. Balancing Locality and Concurrency:Solving Sparse Triangular Systems on GPUs[C]. In IEEE 23rd International Conference on High Performance Computing, HiPC'16, 2016, 183-192.[62] Gremse F, Küpper K, Naumann U. Memory-Efficient Sparse Matrix-Matrix Multiplication by Row Merging on Many-Core Architectures[J]. SIAM Journal on Scientific Computing, 2018, 40(4):C429-C449.[63] Xie B, Zhan J, Liu X, Gao W, Jia Z, He X, Zhang L. CVR:Efficient Vectorization of SpMV on x86 Processors[C]. In Proceedings of the 2018 International Symposium on Code Generation and Optimization, CGO'18, 2018, 149-162.[64] Zhang F, Liu W, Feng N, Zhai J, Du X. Performance Evaluation and Analysis of Sparse Matrix and Graph Kernels on Heterogeneous Processors[J]. CCF Transactions on High Performance Computing, 2019, 1:131-143.[65] Bell N, Dalton S, Olson L N. Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods[J]. SIAM Journal on Scientific Computing, 2012, 34(4):C123-C152.[66] Dalton S, Olson L, Bell N. Optimizing Sparse Matrix-Matrix Multiplication for the GPU[J]. ACM Trans. Math. Softw., 2015, 41(4):25:1-25:20.[67] Hou K, Liu W, Wang H, chun Feng W. Fast Segmented Sort on GPUs[C]. In Proceedings of the International Conference on Supercomputing, ICS'17, 2017, 12:1-12:10.[68] Liu W, Wang H, Vinter B. Parallel Segmented Merge and Its Applications to Two Sparse Matrix Kernels[C]. In 2018 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing, 2018.[69] Winter M, Mlakar D, Zayer R, Seidel H P, Steinberger M. Adaptive Sparse Matrix-matrix Multiplication on the GPU[C]. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, PPoPP'19, 2019, 68-81.[70] Liu J, He X, Liu W, Tan G. Register-based Implementation of the Sparse General Matrixmatrix Multiplication on GPUs[C]. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'18, 2018, 407-408.[71] Liu J, He X, Liu W, Tan G. Register-Aware Optimizations for Parallel Sparse Matrix-Matrix Multiplication[J]. International Journal of Parallel Programming, 2019.[72] Park J, Smelyanskiy M, Sundaram N, Dubey P. Sparsifying Synchronization for HighPerformance Shared-Memory Sparse Triangular Solver[C]. In Supercomputing, 2014, 124-140.[73] Liu W, Li A, Hogg J D, Duff I S, Vinter B. Fast Synchronization-Free Algorithms for Parallel Sparse Triangular Solves with Multiple Right-Hand Sides[J]. Concurrency and Computation:Practice and Experience, 2017, 29(21):e4244-n/a.[74] Wang X, Xu P, Xue W, Ao Y, Yang C, Fu H, Gan L, Yang G, Zheng W. A Fast Sparse Triangular Solver for Structured-grid Problems on Sunway Many-core Processor SW26010[C]. In Proceedings of the 47th International Conference on Parallel Processing, ICPP'18, 2018, 53:1-53:11.[75] Yan S, Li C, Zhang Y, Zhou H. yaSpMV:Yet Another SpMV Framework on GPUs[C]. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'14, 2014, 107-118.[76] Greathouse J L, Daga M. Efficient Sparse Matrix-vector Multiplication on GPUs Using the CSR Storage Format[C]. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC'14, 2014, 769-780.[77] Ashari A, Sedaghati N, Eisenlohr J, Parthasarath S, Sadayappan P. Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications[C]. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC'14, pages 781-792, 2014.[78] Daga M, Greathouse J L. Structural Agnostic SpMV:Adapting CSR-Adaptive for Irregular Matrices[C]. In IEEE 22nd International Conference on High Performance Computing, HiPC'15, 2015, 64-74.[79] Zhao Y, Zhou W, Shen X, Yiu G. Overhead-Conscious Format Selection for SpMV-Based Applications[C]. In 2018 IEEE International Parallel and Distributed Processing Symposium, IPDPS'18, 2018, 950-959.[80] Zhou W, Zhao Y, Shen X, Chen W. Enabling Runtime SpMV Format Selection through an Overhead Conscious Method[J]. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(1):80-93.[81] Li J, Tan G, Chen M, Sun N. SMAT:An Input Adaptive Auto-tuner for Sparse Matrix-vector Multiplication[C]. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI'13, 2013, 117-126.[82] Tan G, Liu J, Li J. Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture[J]. ACM Trans. Math. Softw., 2018, 44(4):46:1-46:25.[83] Sedaghati N, Mu T, Pouchet L N, Parthasarathy S, Sadayappan P. Automatic Selection of Sparse Matrix Representation on GPUs[C]. In Proceedings of the 29th ACM on International Conference on Supercomputing, ICS'15, 2015, 99-108.[84] Benatia A, Ji W, Wang Y, Shi F. BestSF:A Sparse Meta-Format for Optimizing SpMV on GPU[J]. ACM Trans. Archit. Code Optim., 2018, 15(3):29:1-29:27.[85] Benatia A, Ji W, Wang Y, Shi F. Sparse matrix partitioning for optimizing SpMV on CPUGPU heterogeneous platforms[J]. The International Journal of High Performance Computing Applications, 2020, 34(1):66-80.[86] Yang W, Li K, Mo Z, Li K. Performance Optimization Using Partitioned SpMV on GPUs and Multicore CPUs[J]. IEEE Transactions on Computers, 2015, 64(9):2623-2636.[87] Zhao Y, Li J, Liao C, Shen X. Bridging the Gap Between Deep Learning and Sparse Matrix Format Selection[C]. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP'18, 2018, 94-108.[88] Xie Z, Tan G, Liu W, Sun N. IA-SpGEMM:An Input-Aware Auto-Tuning Framework for Parallel Sparse Matrix-Matrix Multiplication[C]. In Proceedings of the ACM International Conference on Supercomputing, ICS'19, 2019, 94-105.[89] Son S W, Malkowski K, Chen G, Kandemir M, Raghavan P. Reducing energy consumption of parallel sparse matrix applications through integrated link/CPU voltage scaling[J]. The Journal of Supercomputing, 2007, 41(3):179-213.[90] Donofrio D, Oliker L, Shalf J, Wehner M F, Rowen C, Krueger J, Kamil S, Mohiyuddin M. Energy-Efficient Computing for Extreme-Scale Science[J]. Computer, 2009, 42(11):62-71.[91] Abdelfattah S, Haidar A, Tomov S, Dongarra J. Analysis and Design Techniques towards HighPerformance and Energy-Efficient Dense Linear Solvers on GPUs[J]. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(12):2700-2712.[92] Benatia A, Ji W, Wang Y, Shi F. Energy evaluation of Sparse Matrix-Vector Multiplication on GPU[C]. In 2016 Seventh International Green and Sustainable Computing Conference (IGSC), 2016, 1-6.[93] Giefers H, Staar P, Bekas C, Hagleitner C. Analyzing the energy-efficiency of sparse matrix multiplication on heterogeneous systems:A comparative study of GPU, Xeon Phi and FPGA[C]. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS'16, 2016, 46-56.[94] Anzt H, Tomov S, Dongarra J. On the performance and energy efficiency of sparse linear algebra on GPUs[J]. The International Journal of High Performance Computing Applications, 2017, 31(5):375-390.[95] Li A, Liu W, Wang L, Barker K, Song S L. Warp-Consolidation:A Novel Execution Model for GPUs[C]. In Proceedings of the 2018 International Conference on Supercomputing, ICS'18, 2018, 53-64.[96] Li A, Song S L, Liu W, Liu X, Kumar A, Corporaal H. Locality-Aware CTA Clustering for Modern GPUs[C]. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS'17, 2017, 297-311.[97] Su B Y, Keutzer K. clSpMV:A Cross-Platform OpenCL SpMV Framework on GPUs[C]. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS'12, 2012, 353-364.[98] Guo P, Wang L, Chen P. A Performance Modeling and Optimization Analysis Tool for Sparse Matrix-Vector Multiplication on GPUs[J]. IEEE Transactions on Parallel and Distributed Systems, 2014, 25(5):1112-1123.[99] Li K, Yang W, Li K. Performance Analysis and Optimization for SpMV on GPU Using Probabilistic Modeling[J]. IEEE Transactions on Parallel and Distributed Systems, Jan 2015, 26(1):196-205.[100] Chen Y, Li K, Yang W, Xiao G, Xie X, Li T. Performance-Aware Model for Sparse Matrix-Matrix Multiplication on the Sunway TaihuLight Supercomputer[J]. IEEE Transactions on Parallel and Distributed Systems, 2019.[101] Liu C, Xie B, Liu X, Xue W, Yang H, Liu X. Towards Efficient SpMV on Sunway Manycore Architectures[C]. In Proceedings of the 2018 International Conference on Supercomputing, ICS'18, 2018, 363-373.[102] Sun Q, Zhang C, Wu C, Zhang J, Li L. Bandwidth Reduced Parallel SpMV on the SW26010 Many-Core Platform[C]. In Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018, 2018, 54:1-54:10.[103] Chen Y, Xiao G, Xiao Z, Yang W. hpSpMV:A Heterogeneous Parallel Computing Scheme for SpMV on the Sunway TaihuLight Supercomputer[C]. In 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2019, 989-995.[104] You X, Yang H, Luan Z, Liu Y, Qian D. Performance Evaluation and Analysis of Linear Algebra Kernels in the Prototype Tianhe-3 Cluster[C]. In Supercomputing Frontiers, 2019, 86-105.[105] Chen D, Fang J, Chen S, Xu C, Wang Z. Optimizing Sparse Matrix-Vector Multiplications on an ARMv8-based Many-Core Architecture[J]. International Journal of Parallel Programming, 2019.[106] Chen D, Fang J, Xu C, Chen S, Wang Z. Characterizing Scalability of Sparse Matrix-Vector Multiplications on Phytium FT-2000+[J]. International Journal of Parallel Programming, 2020.[107] Ao Y, Yang C, Liu F, Yin W, Jiang L, Sun Q. Performance Optimization of the HPCG Benchmark on the Sunway TaihuLight Supercomputer[J]. ACM Trans. Archit. Code Optim., 2018, 15(1):11:1-11:20.[108] Davis T A, Hu Y. The University of Florida Sparse Matrix Collection[J]. ACM Trans. Math. Softw., 2011, 38(1):1:1-1:25.[109] Naumov M. Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU[R]. Technical report, NVIDIA, 2011.[110] Gremse F, Andreas Höfter, Schwen L O, Kiessling F, Naumann U. GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging[J]. SIAM Journal on Scientific Computing, 2015, 37(1):C54-C71.[111] Deveci M, Trott C, Rajamanickam S. Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures[J]. Parallel Computing, 2018, 78:33-46. |
[1] | 于天禹, 赵永华, 赵莲. 基于神威太湖之光架构的LOBPCG并行算法研究[J]. 数值计算与计算机应用, 2019, 40(4): 291-309. |
[2] | 徐小文. 并行代数多重网格算法:大规模计算应用现状与挑战[J]. 数值计算与计算机应用, 2019, 40(4): 243-260. |
[3] | 谢力, 王武, 冯仰德. 基于多层半可分结构矩阵的快速算法与并行实现[J]. 数值计算与计算机应用, 2017, 38(1): 37-48. |
[4] | 王天一, 姜金荣, 张贺, 何卷雄, 迟学斌. CAS-ESM编译运行脚本文件系统设计与实现[J]. 数值计算与计算机应用, 2016, 37(4): 287-298. |
[5] | 郑汉垣, 宋安平, 张武. 基于MIC的GaBP并行算法[J]. 数值计算与计算机应用, 2015, 36(1): 31-41. |
[6] | 胡伟. 常微分方程初值问题的完全三阶并行块算法及实验阶研究[J]. 数值计算与计算机应用, 2014, 35(3): 163-170. |
[7] | 王玉柱, 姜金荣, 蔡长青, 迟学斌, 岳天祥. 三维变分资料同化系统并行算法设计与实现[J]. 数值计算与计算机应用, 2013, 34(3): 231-240. |
[8] | 吴洋, 赵永华, 纪国良. 一类大规模稀疏矩阵特征问题求解的并行算法[J]. 数值计算与计算机应用, 2013, 34(2): 136-146. |
[9] | 张学波, 李晓梅. 快速求解一类Toeplitz循环三对角线性方程组的分布式并行算法[J]. 数值计算与计算机应用, 2009, 30(3): 161-169. |
[10] | 金君,乔楠,梁德旺. NAPA软件的并行优化[J]. 数值计算与计算机应用, 2008, 29(1): 65-72. |
[11] | 肖曼玉,吕全义,汪保,欧阳洁. 块三对角线性方程组的一种并行算法[J]. 数值计算与计算机应用, 2007, 28(4): 241-249. |
[12] | 朱君,赵宁. Euler方程非结构网格分布式并行计算[J]. 数值计算与计算机应用, 2007, 28(3): 161-166. |
[13] | 郑芳英,韩丛英,贺国平. 一个无约束优化问题并行算法的异步执行[J]. 数值计算与计算机应用, 2007, 28(1): 63-70. |
[14] | 曹建文,刘洋,孙家昶,姚继锋,潘峰. 大规模油藏数值模拟软件并行计算技术及在Beowulf系统上的应用进展[J]. 数值计算与计算机应用, 2006, 27(2): 86-95. |
[15] | 张学波,李晓梅. 一类Toeplitz三对角方程组的有效分布式并行算法[J]. 数值计算与计算机应用, 2005, 26(2): 101-109. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||