Azzam Haidar

Orcid: 0000-0002-3177-2084

According to our database1, Azzam Haidar authored at least 79 papers between 2008 and 2023.

Collaborative distances:



In proceedings 
PhD thesis 




cuQuantum SDK: A High-Performance Library for Accelerating Quantum Science.
Proceedings of the IEEE International Conference on Quantum Computing and Engineering, 2023

Performance Analysis of Parallel FFT on Large Multi-GPU Systems.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines.
ACM Trans. Math. Softw., 2021

Accelerating Multi - Process Communication for Parallel 3-D FFT.
Proceedings of the Workshop on Exascale MPI, 2021

MAGMA templates for scalable linear algebra on emerging architectures.
Int. J. High Perform. Comput. Appl., 2020

heFFTe: Highly Efficient FFT for Exascale.
Proceedings of the Computational Science - ICCS 2020, 2020

PLASMA: Parallel Linear Algebra Software for Multicore Using OpenMP.
ACM Trans. Math. Softw., 2019

Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices.
Parallel Comput., 2019

Evaluation of directive-based performance portable programming models.
Int. J. High Perform. Comput. Netw., 2019

Investigating power capping toward energy-efficient scientific applications.
Concurr. Comput. Pract. Exp., 2019

A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations.
IEEE Trans. Parallel Distributed Syst., 2018

Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs.
IEEE Trans. Parallel Distributed Syst., 2018

The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale.
SIAM Rev., 2018

Accelerating the SVD bi-diagonalization of a batch of small matrices using GPUs.
J. Comput. Sci., 2018

Batched one-sided factorizations of tiny matrices using GPUs: Challenges and countermeasures.
J. Comput. Sci., 2018

Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers.
Proceedings of the International Conference for High Performance Computing, 2018

The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques.
Proceedings of the Computational Science - ICCS 2018, 2018

Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization.
Proceedings of the 2018 IEEE High Performance Extreme Computing Conference, 2018

Fast Cholesky factorization on GPUs for batch and native modes in MAGMA.
J. Comput. Sci., 2017

With Extreme Computing, the Rules Have Changed.
Comput. Sci. Eng., 2017

A Framework for Out of Memory SVD Algorithms.
Proceedings of the High Performance Computing - 32nd International Conference, 2017

Investigating half precision arithmetic to accelerate dense linear system solvers.
Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2017

High-performance Cholesky factorization for GPU-only execution.
Proceedings of the General Purpose GPUs, 2017

Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs.
Proceedings of the International Conference on Supercomputing, 2017

Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices.
Proceedings of the International Conference on Computational Science, 2017

Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures.
Proceedings of the International Conference on Computational Science, 2017

Out of memory SVD solver for big data.
Proceedings of the 2017 IEEE High Performance Extreme Computing Conference, 2017

Power-aware computing: Measurement, control, and performance analysis for Intel Xeon Phi.
Proceedings of the 2017 IEEE High Performance Extreme Computing Conference, 2017

Linear algebra software for large-scale accelerated multicore computing.
Acta Numer., 2016

Performance, Design, and Autotuning of Batched GEMM for GPUs.
Proceedings of the High Performance Computing - 31st International Conference, 2016

Towards Achieving Performance Portability Using Directives for Accelerators.
Proceedings of the Third Workshop on Accelerator Programming Using Directives, 2016

Heterogeneous Streaming.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs.
Proceedings of the International Conference on Computational Science 2016, 2016

High-Performance Tensor Contractions for GPUs.
Proceedings of the International Conference on Computational Science 2016, 2016

LU, QR, and Cholesky factorizations: Programming model, performance analysis and optimization techniques for the Intel Knights Landing Xeon Phi.
Proceedings of the 2016 IEEE High Performance Extreme Computing Conference, 2016

Performance analysis and acceleration of explicit integration for large kinetic networks using batched GPU computations.
Proceedings of the 2016 IEEE High Performance Extreme Computing Conference, 2016

High-Performance Matrix-Matrix Multiplications of Very Small Matrices.
Proceedings of the Euro-Par 2016: Parallel Processing, 2016

Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems.
Supercomput. Front. Innov., 2015

HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi.
Sci. Program., 2015

On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors.
Proceedings of the High Performance Computing - 30th International Conference, 2015

A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations.
Proceedings of the High Performance Computing - 30th International Conference, 2015

Performance analysis and design of a hessenberg reduction using stabilized blocked elementary transformations for new architectures.
Proceedings of the Symposium on High Performance Computing, 2015

Efficient implementation of quantum materials simulations on distributed CPU-GPU systems.
Proceedings of the International Conference for High Performance Computing, 2015

Weighted dynamic scheduling with many parallelism grains for offloading of numerical workloads to multiple varied accelerators.
Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2015

Optimization for performance and energy for batched matrix computations on GPUs.
Proceedings of the 8th Workshop on General Purpose Processing using GPUs, 2015

Towards batched linear solvers on accelerated hardware platforms.
Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2015

Divide and Conquer Symmetric Tridiagonal Eigensolver for Multicore Architectures.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Performance Analysis and Optimisation of Two-sided Factorization Algorithms for Heterogeneous Platform.
Proceedings of the International Conference on Computational Science, 2015

MAGMA embedded: Towards a dense linear algebra library for energy efficient extreme computing.
Proceedings of the 2015 IEEE High Performance Extreme Computing Conference, 2015

Flexible Linear Algebra Development and Scheduling with Cholesky Factorization.
Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications, 2015

Model-Driven One-Sided Factorizations on Multicore Accelerated Systems.
Supercomput. Front. Innov., 2014

A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine-grained memory aware tasks.
Int. J. High Perform. Comput. Appl., 2014

Heterogenous Acceleration for Linear Algebra in Multi-coprocessor Environments.
Proceedings of the High Performance Computing for Computational Science - VECPAR 2014 - 11th International Conference, Eugene, OR, USA, June 30, 2014

Accelerating Computation of Eigenvectors in the Dense Nonsymmetric Eigenvalue Problem.
Proceedings of the High Performance Computing for Computational Science - VECPAR 2014 - 11th International Conference, Eugene, OR, USA, June 30, 2014

Performance and portability with OpenCL for throughput-oriented HPC workloads across accelerators, coprocessors, and multicore processors.
Proceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2014

New Algorithm for Computing Eigenvectors of the Symmetric Eigenvalue Problem.
Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014

Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

A Fast Batched Cholesky Factorization on a GPU.
Proceedings of the 43rd International Conference on Parallel Processing, 2014

LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU.
Proceedings of the 2014 IEEE International Conference on High Performance Computing and Communications, 2014

Accelerating Numerical Dense Linear Algebra Calculations with GPUs.
Proceedings of the Numerical Computations with GPUs, 2014

Parallel algebraic domain decomposition solver for the solution of augmented systems.
Adv. Eng. Softw., 2013

Leading Edge Hybrid Multi-GPU Algorithms for Generalized Eigenproblems in Electronic Structure Calculations.
Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013

An improved parallel singular value algorithm and its implementation for multicore hardware.
Proceedings of the International Conference for High Performance Computing, 2013

Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi.
Proceedings of the Parallel Processing and Applied Mathematics, 2013

Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communication.
Proceedings of the International Conference on Supercomputing, 2013

Toward a High Performance Tile Divide and Conquer Algorithm for the Dense Symmetric Eigenvalue Problem.
SIAM J. Sci. Comput., 2012

A hybrid Hermitian general eigenvalue solver
CoRR, 2012

Poster: A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Abstract: A Novel Hybrid CPU-GPU Generalized Eigensolver for Electronic Structure Calculations Based on Fine Grained Memory Aware Tasks.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures.
Concurr. Comput. Pract. Exp., 2011

Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels.
Proceedings of the Conference on High Performance Computing Networking, 2011

Solving the Generalized Symmetric Eigenvalue Problem using Tile Algorithms on Multicore Architectures.
Proceedings of the Applications, Tools and Techniques on the Road to Exascale Computing, Proceedings of the conference ParCo 2011, 31 August, 2011

Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Using multiple levels of parallelism to enhance the performance of domain decomposition solvers.
Parallel Comput., 2010

Parallel algebraic hybrid solvers for large 3D convection-diffusion problems.
Numer. Algorithms, 2009

On the parallel scalability of hybrid linear solvers for large 3D problems. (Sur l'extensibilité parallèle de solveurs linéaires hybrides pour des problèmes tridimensionels de grandes tailles).
PhD thesis, 2008

Parallel scalability study of hybrid preconditioners in three dimensions.
Parallel Comput., 2008
