Robert A. van de Geijn

RuQing G. Xu

Devin Matthews

CoRR, 2023

Formal Derivation of LU Factorization with Pivoting.

[BibT_eX]

[DOI]

CoRR, 2023

Cascading GEMM: High Precision from Low Precision.

[BibT_eX]

[DOI]

Greg M. Henry

CoRR, 2023

GEMMFIP: Unifying GEMM in BLIS.

[BibT_eX]

[DOI]

Ruqing G. Xu

CoRR, 2023

Towards a Unified Implementation of GEMM in BLIS.

[BibT_eX]

[DOI]

RuQing G. Xu

Proceedings of the 37th International Conference on Supercomputing, 2023

2022

Applying Dijkstra's Vision to Numerical Software.

[BibT_eX]

[DOI]

Proceedings of the Edsger Wybe Dijkstra: His Life, Work, and Legacy, 2022

2021

Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework.

[BibT_eX]

[DOI]

ACM Trans. Math. Softw., 2021

2020

Strassen's Algorithm Reloaded on GPUs.

[BibT_eX]

[DOI]

Chenhan D. Yu

ACM Trans. Math. Softw., 2020

2019

The MOMMS Family of Matrix Multiplication Algorithms.

[BibT_eX]

[DOI]

CoRR, 2019

Supporting mixed-datatype matrix multiplication within the BLIS framework.

[BibT_eX]

[DOI]

CoRR, 2019

A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization With Partial Pivoting.

[BibT_eX]

[DOI]

Sandra Catalán

José R. Herrero

Rafael Rodríguez-Sánchez

IEEE Access, 2019

2018

Strassen's Algorithm for Tensor Contraction.

[BibT_eX]

[DOI]

Devin A. Matthews

SIAM J. Sci. Comput., 2018

Implementing Strassen's Algorithm with CUTLASS on NVIDIA Volta GPUs.

[BibT_eX]

[DOI]

Chenhan D. Yu

CoRR, 2018

A Simple Methodology for Computing Families of Algorithms.

[BibT_eX]

[DOI]

Margaret E. Myers

Richard W. Vuduc

CoRR, 2018

Learning from Optimizing Matrix-Matrix Multiplication.

[BibT_eX]

[DOI]

Margaret E. Myers

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

2017

Householder QR Factorization With Randomization for Column Pivoting (HQRRP).

[BibT_eX]

[DOI]

Per-Gunnar Martinsson

Nathan Heavner

SIAM J. Sci. Comput., 2017

Deriving Correct High-Performance Algorithms.

[BibT_eX]

[DOI]

CoRR, 2017

Pushing the Bounds for Matrix-Matrix Multiplication.

[BibT_eX]

[DOI]

Tyler Michael Smith

CoRR, 2017

Generating Families of Practical Fast Matrix Multiplication Algorithms.

[BibT_eX]

[DOI]

Leslie Rice

Devin A. Matthews

Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

2016

The BLIS Framework: Experiments in Portability.

[BibT_eX]

[DOI]

ACM Trans. Math. Softw., 2016

Parallel Matrix Multiplication: A Systematic Journey.

[BibT_eX]

[DOI]

Martin D. Schatz

SIAM J. Sci. Comput., 2016

Automating the Last-Mile for High Performance Dense Linear Algebra.

[BibT_eX]

[DOI]

Richard Michael Veras

Tze Meng Low

Tyler Michael Smith

Franz Franchetti

CoRR, 2016

Implementing Strassen's Algorithm with BLIS.

[BibT_eX]

[DOI]

Greg M. Henry

CoRR, 2016

BLISlab: A Sandbox for Optimizing GEMM.

[BibT_eX]

[DOI]

CoRR, 2016

Strassen's algorithm reloaded.

[BibT_eX]

[DOI]

Greg M. Henry

Proceedings of the International Conference for High Performance Computing, 2016

2015

BLIS: A Framework for Rapidly Instantiating BLAS Functionality.

[BibT_eX]

[DOI]

ACM Trans. Math. Softw., 2015

Householder QR Factorization: Adding Randomization for Column Pivoting. FLAME Working Note #78.

[BibT_eX]

[DOI]

Per-Gunnar Martinsson

Nathan Heavner

CoRR, 2015

2014

Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance.

[BibT_eX]

[DOI]

ACM Trans. Math. Softw., 2014

Algorithm, Architecture, and Floating-Point Unit Codesign of a Matrix Factorization Accelerator.

[BibT_eX]

[DOI]

IEEE Trans. Computers, 2014

Exploiting Symmetry in Tensors for High Performance: Multiplication with Symmetric Tensors.

[BibT_eX]

[DOI]

Martin D. Schatz

Tze Meng Low

Tamara G. Kolda

SIAM J. Sci. Comput., 2014

Understanding performance stairs: elucidating heuristics.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE International Conference on Automated Software Engineering, 2014

Anatomy of High-Performance Many-Threaded Matrix Multiplication.

[BibT_eX]

[DOI]

Mikhail Smelyanskiy

Jeff R. Hammond

Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

2013

Elemental: A New Framework for Distributed Memory Dense Matrix Computations.

[BibT_eX]

[DOI]

Jeff R. Hammond

Nichols A. Romero

ACM Trans. Math. Softw., 2013

A case study in mechanically deriving dense linear algebra code.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2013

Deriving dense linear algebra libraries.

[BibT_eX]

[DOI]

Paolo Bientinesi

John A. Gunnels

Margaret E. Myers

Tyler Rhodes

Formal Aspects Comput., 2013

Exploiting Symmetry in Tensors for High Performance

[BibT_eX]

[DOI]

Martin D. Schatz

Tze Meng Low

Tamara G. Kolda

CoRR, 2013

Scheduling algorithms-by-blocks on small clusters.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2013

Interfaces are key.

[BibT_eX]

[DOI]

Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering, 2013

DSLs, DLA, DxT, and MDE in CSE.

[BibT_eX]

[DOI]

Proceedings of the 5th International Workshop on Software Engineering for Computational Science and Engineering, 2013

Code Generation and Optimization of Distributed-Memory Dense Linear Algebra Kernels.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Computational Science, 2013

Floating Point Architecture Extensions for Optimized Matrix Factorization.

[BibT_eX]

[DOI]

Proceedings of the 21st IEEE Symposium on Computer Arithmetic, 2013

2012

Families of Algorithms for Reducing a Matrix to Condensed Form.

[BibT_eX]

[DOI]

G. Joseph Elizondo

ACM Trans. Math. Softw., 2012

A Runtime System for Programming Out-of-Core Matrix Algorithms-by-Tiles on Multithreaded Architectures.

[BibT_eX]

[DOI]

Mercedes Marqués

ACM Trans. Math. Softw., 2012

Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures.

[BibT_eX]

[DOI]

IEEE Trans. Computers, 2012

The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations.

[BibT_eX]

[DOI]

Ernie Chan

J. Parallel Distributed Comput., 2012

Programming many-core architectures - a case study: dense matrix computations on the Intel single-chip cloud computer processor.

[BibT_eX]

[DOI]

Ernie Chan

Rob F. Van der Wijngaart

Timothy G. Mattson

Theodore E. Kubaska

Concurr. Comput. Pract. Exp., 2012

Designing Linear Algebra Algorithms by Transformation: Mechanizing the Expert Developer.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing for Computational Science, 2012

Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC.

[BibT_eX]

[DOI]

Proceedings of the SC Conference on High Performance Computing Networking, 2012

On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators.

[BibT_eX]

[DOI]

Proceedings of the IEEE 24th International Symposium on Computer Architecture and High Performance Computing, 2012

Level-3 BLAS on the TI C6678 Multi-core DSP.

[BibT_eX]

[DOI]

Murtaza Ali

Eric Stotzer

Proceedings of the IEEE 24th International Symposium on Computer Architecture and High Performance Computing, 2012

Mechanizing the expert dense linear algebra developer.

[BibT_eX]

[DOI]

Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2012

A Linear Algebra Core Design for Efficient Level-3 BLAS.

[BibT_eX]

[DOI]

Syed Zohaib Gilani

Nam Sung Kim

Michael J. Schulte

Proceedings of the 23rd IEEE International Conference on Application-Specific Systems, 2012

The Spike Factorization as Domain Decomposition Method; Equivalent and Variant Approaches.

[BibT_eX]

[DOI]

Victor Eijkhout

Proceedings of the High-Performance Scientific Computing - Algorithms and Applications., 2012

2011

libflame.

[BibT_eX]

[DOI]

Ernie Chan