Piotr Luszczek

Julie Mullen Andrew Prout

ACM Trans. Math. Softw., December, 2024

Numerical eigen-spectrum slicing, accurate orthogonal eigen-basis, and mixed-precision eigenvalue refinement using OpenMP data-dependent tasks and accelerator offload.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2024

Batched sparse and mixed-precision linear algebra interface for efficient use of GPU hardware accelerators in scientific applications.

[BibT_eX]

[DOI]

Future Gener. Comput. Syst., 2024

Interface for Sparse Linear Algebra Operations.

[BibT_eX]

[DOI]

CoRR, 2024

GPU Sharing with Triples Mode.

[BibT_eX]

[DOI]

CoRR, 2024

LLload: An Easy-to-Use HPC Utilization Tool.

[BibT_eX]

[DOI]

CoRR, 2024

Supercomputer 3D Digital Twin for User Focused Real-Time Monitoring.

[BibT_eX]

[DOI]

Antonio Rosa

Charles Yee

Jeremy Kepner

CoRR, 2024

HPC with Enhanced User Separation.

[BibT_eX]

[DOI]

CoRR, 2024

Anonymized Network Sensing Graph Challenge.

[BibT_eX]

[DOI]

CoRR, 2024

What is Normal? A Big Data Observational Science Model of Anonymized Internet Traffic.

[BibT_eX]

[DOI]

CoRR, 2024

Towards Scalable and Efficient Spiking Reinforcement Learning for Continuous Control Tasks.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Neuromorphic Systems, 2024

2023

Combining multitask and transfer learning with deep Gaussian processes for autotuning-based performance engineering.

[BibT_eX]

[DOI]

Wissam M. Sid-Lakhdar

Int. J. High Perform. Comput. Appl., July, 2023

CholeskyQR with Randomization and Pivoting for Tall Matrices (CQRRPT).

[BibT_eX]

[DOI]

CoRR, 2023

Randomized Numerical Linear Algebra : A Perspective on the Field With an Eye to Software.

[BibT_eX]

[DOI]

CoRR, 2023

GPU-based LU Factorization and Solve on Batches of Matrices with Band Structure.

[BibT_eX]

[DOI]

Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

PAQR: Pivoting Avoiding QR factorization.

[BibT_eX]

[DOI]

Wissam M. Sid-Lakhdar

David B. Williams-Young

Timothy A. Davis

Hartwig Anzt

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

Using Additive Modifications in LU Factorization Instead of Pivoting.

[BibT_eX]

[DOI]

Proceedings of the 37th International Conference on Supercomputing, 2023

Towards the FAIR Asset Tracking Across Models, Datasets, and Performance Evaluation Scenarios.

[BibT_eX]

[DOI]

Tokey Tahmid

Proceedings of the IEEE High Performance Extreme Computing Conference, 2023

2022

Software for "Threshold Pivoting in LU Factorizations".

[BibT_eX]

[DOI]

Dataset, May, 2022

Software for "Threshold Pivoting for dense LU Factorization".

[BibT_eX]

[DOI]

Dataset, May, 2022

Accelerating Restarted GMRES With Mixed Precision Arithmetic.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2022

OpenMP application experiences: Porting to accelerated nodes.

[BibT_eX]

[DOI]

Parallel Comput., 2022

Aggregation of clans to speed-up solving linear systems on parallel architectures.

[BibT_eX]

[DOI]

Dmitry A. Zaitsev

Tatiana R. Shmeleva

Int. J. Parallel Emergent Distributed Syst., 2022

Challenges of and Opportunities for a Large Diverse Software Team.

[BibT_eX]

[DOI]

Comput. Sci. Eng., 2022

AI Benchmarking for Science: Efforts from the MLCommons Science Working Group.

[BibT_eX]

[DOI]

Christine R. Kirkpatrick

Proceedings of the High Performance Computing. ISC High Performance 2022 International Workshops - Hamburg, Germany, May 29, 2022

Mixed-Precision Algorithm for Finding Selected Eigenvalues and Eigenvectors of Symmetric and Hermitian Matrices<sup>1</sup>.

[BibT_eX]

[DOI]

Yaohung M. Tsai

Sivasankaran Rajamanickam

Proceedings of the IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems, 2022

Threshold Pivoting for Dense LU Factorization.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems, 2022

High-Performance GMRES Multi-Precision Benchmark: Design, Performance, and Challenges.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM International Workshop on Performance Modeling, 2022

Deep Gaussian process with multitask and transfer learning for performance optimization.

[BibT_eX]

[DOI]

Wissam M. Sid-Lakhdar

Mohsen Aznaveh

Proceedings of the IEEE High Performance Extreme Computing Conference, 2022

Surrogate ML/AI Model Benchmarking for FAIR Principles' Conformance.

[BibT_eX]

[DOI]

Cade Brown

Proceedings of the IEEE High Performance Extreme Computing Conference, 2022

Proposed Consistent Exception Handling for the BLAS and LAPACK.

[BibT_eX]

[DOI]

Proceedings of the Sixth IEEE/ACM International Workshop on Software Correctness for HPC Applications, 2022

2021

A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines.

[BibT_eX]

[DOI]

ACM Trans. Math. Softw., 2021

Translational process: Mathematical software perspective.

[BibT_eX]

[DOI]

J. Comput. Sci., 2021

A survey of numerical linear algebra methods utilizing mixed-precision arithmetic.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2021

Materials Fingerprinting Classification.

[BibT_eX]

[DOI]

Cassie Putman Micucci

Peter K. Liaw

Louis Joseph Santodonato

David J. Keffer

Vasileios Maroulas

Comput. Phys. Commun., 2021

Task-graph scheduling extensions for efficient synchronization and communication.

[BibT_eX]

[DOI]

Proceedings of the ICS '21: 2021 International Conference on Supercomputing, 2021

2020

Software for Linear Algebra Targeting Exascale (SLATE) with a Recursive Butterfly Transform based solver.

[BibT_eX]

[DOI]

Dataset, August, 2020

A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic.

[BibT_eX]

[DOI]

CoRR, 2020

Improving the Performance of the GMRES Method Using Mixed-Precision Techniques.

[BibT_eX]

[DOI]

Proceedings of the Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI, 2020

Replacing Pivoting in Distributed Gaussian Elimination with Randomized Techniques.

[BibT_eX]

[DOI]

Proceedings of the 11th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2020

Communication Avoiding 2D Stencil Implementations over PaRSEC Task-Based Runtime.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

Scalable Data Generation for Evaluating Mixed-Precision Solvers.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE High Performance Extreme Computing Conference, 2020

Docker container based PaaS cloud computing comprehensive benchmarks using LAPACK.

[BibT_eX]

[DOI]

Dmitry Zaitsev

Proceedings of The Third International Workshop on Computer Modeling and Intelligent Systems (CMIS-2020), 2020

2019

PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP.

[BibT_eX]

[DOI]

ACM Trans. Math. Softw., 2019

Software-Defined Events through PAPI.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops, 2019

Increasing Accuracy of Iterative Refinement in Limited Floating-Point Arithmetic on Half-Precision Accelerators.

[BibT_eX]

[DOI]

Ichitaro Yamazaki

Proceedings of the 2019 IEEE High Performance Extreme Computing Conference, 2019

2018

Autotuning Techniques for Performance-Portable Point Set Registration in 3D.

[BibT_eX]

[DOI]

Supercomput. Front. Innov., 2018

The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale.

[BibT_eX]

[DOI]

SIAM Rev., 2018

Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators.

[BibT_eX]

[DOI]

Proc. IEEE, 2018

Task based Cholesky decomposition on Xeon Phi architectures using OpenMP.

[BibT_eX]

[DOI]

Int. J. Comput. Sci. Eng., 2018

2017

Design and Implementation of the PULSAR Programming System for Large Scale Computing.

[BibT_eX]

[DOI]

Supercomput. Front. Innov., 2017

Porting the PLASMA Numerical Library to the OpenMP Standard.

[BibT_eX]

[DOI]

Int. J. Parallel Program., 2017

With Extreme Computing, the Rules Have Changed.

[BibT_eX]

[DOI]

Comput. Sci. Eng., 2017

Interoperable Convergence of Storage, Networking and Computation.

[BibT_eX]

[DOI]

Micah Beck

Terry Moore

CoRR, 2017

Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Towards numerical benchmark for half-precision floating point arithmetic.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE High Performance Extreme Computing Conference, 2017

Scaling point set registration in 3D across thread counts on multicore and hardware accelerator platforms through autotuning for large scale analysis of scientific point clouds.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Conference on Big Data (IEEE BigData 2017), 2017

Bringing High Performance Computing to Big Data Algorithms.

[BibT_eX]

[DOI]

Proceedings of the Handbook of Big Data Technologies, 2017

2016

High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems.

[BibT_eX]

[DOI]

Michael A. Heroux

Int. J. High Perform. Comput. Appl., 2016

Linear algebra software for large-scale accelerated multicore computing.

[BibT_eX]

[DOI]

Acta Numer., 2016

Task-Based Cholesky Decomposition on Knights Corner Using OpenMP.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing, 2016

Performance-Portable Autotuning of OpenCL Kernels for Convolutional Layers of Deep Neural Networks.

[BibT_eX]

[DOI]

Proceedings of the 2nd Workshop on Machine Learning in HPC Environments, 2016

Heterogeneous Streaming.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

Search Space Generation and Pruning System for Autotuners.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures.

[BibT_eX]

[DOI]

Yulu Jia

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

2015

Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems.

[BibT_eX]

[DOI]

Supercomput. Front. Innov., 2015

HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi.

[BibT_eX]

[DOI]

Sci. Program., 2015

Acceleration of GPU-based Krylov solvers via data transfer reduction.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2015

A survey of recent developments in parallel implementations of Gaussian elimination.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2015

Experiences in autotuning matrix multiplication for energy minimization on GPUs.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2015

A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 30th International Conference, 2015

Randomized algorithms to update partial singular value decomposition on a hybrid CPU/GPU cluster.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2015

Performance of random sampling for computing low-rank approximations of a dense matrix on GPUs.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2015

Weighted dynamic scheduling with many parallelism grains for offloading of numerical workloads to multiple varied accelerators.

[BibT_eX]

[DOI]

Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2015

Optimization for performance and energy for batched matrix computations on GPUs.

[BibT_eX]

[DOI]

Proceedings of the 8th Workshop on General Purpose Processing using GPUs, 2015

Towards batched linear solvers on accelerated hardware platforms.

[BibT_eX]

[DOI]

Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2015

MAGMA embedded: Towards a dense linear algebra library for energy efficient extreme computing.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE High Performance Extreme Computing Conference, 2015

Flexible Linear Algebra Development and Scheduling with Cholesky Factorization.

[BibT_eX]

[DOI]

Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications, 2015

2014

Model-Driven One-Sided Factorizations on Multicore Accelerated Systems.

[BibT_eX]

[DOI]

Supercomput. Front. Innov., 2014

Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime.

[BibT_eX]

[DOI]

Parallel Process. Lett., 2014

Looking back at dense linear algebra software.

[BibT_eX]

[DOI]

Jakub Kurzak

J. Parallel Distributed Comput., 2014

Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2014

BlackjackBench: Portable Hardware Characterization with Automated Results' Analysis.

[BibT_eX]

[DOI]

Comput. J., 2014

Heterogenous Acceleration for Linear Algebra in Multi-coprocessor Environments.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing for Computational Science - VECPAR 2014 - 11th International Conference, Eugene, OR, USA, June 30, 2014

Performance and portability with OpenCL for throughput-oriented HPC workloads across accelerators, coprocessors, and multicore processors.

[BibT_eX]

[DOI]

Proceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2014

clMAGMA: high performance dense linear algebra with OpenCL.

[BibT_eX]

[DOI]

Proceedings of the International Workshop on OpenCL, 2014

New Algorithm for Computing Eigenvectors of the Symmetric Eigenvalue Problem.

[BibT_eX]

[DOI]

Azzam Haidar

Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014

Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment.

[BibT_eX]

[DOI]

Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Optimizing Krylov Subspace Solvers on Graphics Processing Units.

[BibT_eX]

[DOI]

Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014

Parallel Simulation of Superscalar Scheduling.

[BibT_eX]

[DOI]

Proceedings of the 43rd International Conference on Parallel Processing, 2014

LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU.

[BibT_eX]

[DOI]

Proceedings of the 2014 IEEE International Conference on High Performance Computing and Communications, 2014

Accelerating Numerical Dense Linear Algebra Calculations with GPUs.

[BibT_eX]

[DOI]

Proceedings of the Numerical Computations with GPUs, 2014

2013

LU Factorization with Partial Pivoting for a Multicore System with Accelerators.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2013

High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures.

[BibT_eX]

[DOI]

Hatem Ltaief

ACM Trans. Math. Softw., 2013

Soft error resilient QR factorization for hybrid system with GPGPU.

[BibT_eX]

[DOI]

J. Comput. Sci., 2013

CPU-GPU hybrid bidiagonal reduction with soft error resilience.

[BibT_eX]

[DOI]

Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2013

Parallel reduction to hessenberg form with algorithm-based fault tolerance.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2013

An improved parallel singular value algorithm and its implementation for multicore hardware.

[BibT_eX]

[DOI]

Azzam Haidar

Jakub Kurzak

Proceedings of the International Conference for High Performance Computing, 2013

Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi.

[BibT_eX]

[DOI]

Proceedings of the Parallel Processing and Applied Mathematics, 2013

Virtual Systolic Array for QR Decomposition.

[BibT_eX]

[DOI]

Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

Implementing a Systolic Algorithm for QR Factorization on Multicore Clusters with PaRSEC.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2013: Parallel Processing Workshops, 2013

2012

BlackjackBench: portable hardware characterization.

[BibT_eX]

[DOI]

SIGMETRICS Perform. Evaluation Rev., 2012

Multi-GPU Implementation of LU Factorization.

[BibT_eX]

[DOI]

Yulu Jia

Proceedings of the International Conference on Computational Science, 2012

High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors.

[BibT_eX]

[DOI]

Peng Du

Proceedings of the International Conference on Computational Science, 2012

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming.

[BibT_eX]

[DOI]

Parallel Comput., 2012

Profiling high performance dense linear algebra algorithms on multicore architectures for power and energy efficiency.

[BibT_eX]

[DOI]

Hatem Ltaief

Comput. Sci. Res. Dev., 2012

Programming the LU Factorization for a Multicore System with Accelerators.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing for Computational Science, 2012

A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction.

[BibT_eX]

[DOI]

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

Measuring Energy and Power with PAPI.

[BibT_eX]

[DOI]

Proceedings of the 41st International Conference on Parallel Processing Workshops, 2012

Anatomy of a globally recursive embedded LINPACK benchmark.

[BibT_eX]

[DOI]

Proceedings of the IEEE Conference on High Performance Extreme Computing, 2012

Scalable Dense Linear Algebra on Heterogeneous Hardware.

[BibT_eX]

[DOI]

Proceedings of the Transition of HPC Towards Exascale Computing, 2012

GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2012 Parallel Processing - 18th International Conference, 2012

Energy Footprint of Advanced Dense Numerical Linear Algebra Using Tile Algorithms on Multicore Architectures.

[BibT_eX]

[DOI]

Proceedings of the 2012 Second International Conference on Cloud and Green Computing, 2012

Dense Linear Algebra on Accelerated Multicore Hardware.

[BibT_eX]

[DOI]

Proceedings of the High-Performance Scientific Computing - Algorithms and Applications., 2012

2011

TOP500.

[BibT_eX]

[DOI]