"A highly optimized multi-GPU level-3 BLAS. We adopt the concepts of algorithms-by-tiles treating a matrix tile as the basic data unit and operations on tiles as the basic task. Tasks are guided with a dynamic asynchronous runtime, which is cache and locality aware. The communication cost under BLASX becomes trivial as it perfectly overlaps communication and computation across multiple streams during asynchronous task progression. It also takes the current tile cache scheme one step further by proposing an innovative 2-level hierarchical tile cache, taking advantage of inter-GPU P2P communication. As a result, linear speedup is observable with BLASX under multi-GPU configurations; and the extensive benchmarks demonstrate that BLASX consistently outperforms the related leading industrial and academic projects such as cuBLAS-XT, SuperMatrix, MAGMA and PaRSEC."
https://github.com/linnanwang/BLASX
BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing - http://hgpu.org/?p=14743
No comments:
Post a Comment