Scaling point set registration in 3D across thread counts on multicore and hardware accelerator platforms through autotuning for large scale analysis of scientific point clouds.

[BibT_eX]

[DOI]

Piotr Luszczek

Jakub Kurzak

Ichitaro Yamazaki

David J. Keffer

Jack J. Dongarra

Proceedings of the 2017 IEEE International Conference on Big Data (IEEE BigData 2017), 2017

Bringing High Performance Computing to Big Data Algorithms.

[BibT_eX]

[DOI]

Proceedings of the Handbook of Big Data Technologies, 2017

2016

Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2016

Linear algebra software for large-scale accelerated multicore computing.

[BibT_eX]

[DOI]

Acta Numer., 2016

Task-Based Cholesky Decomposition on Knights Corner Using OpenMP.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing, 2016

Performance-Portable Autotuning of OpenCL Kernels for Convolutional Layers of Deep Neural Networks.

[BibT_eX]

[DOI]

Proceedings of the 2nd Workshop on Machine Learning in HPC Environments, 2016

Search Space Generation and Pruning System for Autotuners.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

2015

Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems.

[BibT_eX]

[DOI]

Supercomput. Front. Innov., 2015

A survey of recent developments in parallel implementations of Gaussian elimination.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2015

Experiences in autotuning matrix multiplication for energy minimization on GPUs.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2015

Mixed-precision block gram Schmidt orthogonalization.

[BibT_eX]

[DOI]

Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2015

Randomized algorithms to update partial singular value decomposition on a hybrid CPU/GPU cluster.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2015

Performance of random sampling for computing low-rank approximations of a dense matrix on GPUs.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2015

Visualizing execution traces with task dependencies.

[BibT_eX]

[DOI]

Proceedings of the 2nd Workshop on Visual Performance Analysis, 2015

Divide and Conquer Symmetric Tridiagonal Eigensolver for Multicore Architectures.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Accelerating collaborative filtering using concepts from high performance computing.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Conference on Big Data (IEEE BigData 2015), Santa Clara, CA, USA, October 29, 2015

2014

Model-Driven One-Sided Factorizations on Multicore Accelerated Systems.

[BibT_eX]

[DOI]

Supercomput. Front. Innov., 2014

Design and Implementation of a Large Scale Tree-Based QR Decomposition Using a 3D Virtual Systolic Array and a Lightweight Runtime.

[BibT_eX]

[DOI]

Parallel Process. Lett., 2014

Looking back at dense linear algebra software.

[BibT_eX]

[DOI]

Piotr Luszczek

Jakub Kurzak

Jack J. Dongarra

J. Parallel Distributed Comput., 2014

Search Space Pruning Constraints Visualization.

[BibT_eX]

[DOI]

Blake Haugen

Jakub Kurzak

Proceedings of the Second IEEE Working Conference on Software Visualization, 2014

Parallel Simulation of Superscalar Scheduling.

[BibT_eX]

[DOI]

Proceedings of the 43rd International Conference on Parallel Processing, 2014

Access-averse framework for computing low-rank matrix approximations.

[BibT_eX]

[DOI]

Proceedings of the 2014 IEEE International Conference on Big Data (IEEE BigData 2014), 2014

Accelerating Numerical Dense Linear Algebra Calculations with GPUs.

[BibT_eX]

[DOI]

Proceedings of the Numerical Computations with GPUs, 2014

2013

LU Factorization with Partial Pivoting for a Multicore System with Accelerators.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2013

An improved parallel singular value algorithm and its implementation for multicore hardware.

[BibT_eX]

[DOI]

Azzam Haidar

Jakub Kurzak

Piotr Luszczek

Proceedings of the International Conference for High Performance Computing, 2013

Virtual Systolic Array for QR Decomposition.

[BibT_eX]

[DOI]

Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

Implementing a Systolic Algorithm for QR Factorization on Multicore Clusters with PaRSEC.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2013: Parallel Processing Workshops, 2013

2012

Autotuning GEMM Kernels for the Fermi GPU.

[BibT_eX]

[DOI]

Jakub Kurzak

Stanimire Tomov

Jack J. Dongarra

IEEE Trans. Parallel Distributed Syst., 2012

Programming the LU Factorization for a Multicore System with Accelerators.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing for Computational Science, 2012

Scalable Dense Linear Algebra on Heterogeneous Hardware.

[BibT_eX]

[DOI]

Proceedings of the Transition of HPC Towards Exascale Computing, 2012

Dense Linear Algebra on Accelerated Multicore Hardware.

[BibT_eX]

[DOI]

Proceedings of the High-Performance Scientific Computing - Algorithms and Applications., 2012

2011

Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA.

[BibT_eX]

[DOI]

Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

2010

Parallel Two-Sided Matrix Reduction to Band Bidiagonal Form on Multicore Architectures.

[BibT_eX]

[DOI]

Hatem Ltaief

Jakub Kurzak

Jack J. Dongarra

IEEE Trans. Parallel Distributed Syst., 2010

Scheduling two-sided transformations using tile algorithms on multicore architectures.

[BibT_eX]

[DOI]

Sci. Program., 2010

Scheduling dense linear algebra operations on multicore processors.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2010

Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing for Computational Science - VECPAR 2010, 2010

An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs.

[BibT_eX]

[DOI]

Proceedings of the Applied Parallel and Scientific Computing, 2010

Multicore and Manycore Programming.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2010 - Parallel Processing, 16th International Euro-Par Conference, Ischia, Italy, August 31, 2010

Implementing Matrix Factorizations on the Cell B. E.

[BibT_eX]

[DOI]

Jakub Kurzak

Jack J. Dongarra

Proceedings of the Scientific Computing with Multicore and Accelerators., 2010

Implementing Matrix Multiplication on the Cell B. E.

[BibT_eX]

[DOI]

Wesley Alvaro

Jakub Kurzak

Jack J. Dongarra

Proceedings of the Scientific Computing with Multicore and Accelerators., 2010

2009

QR factorization for the Cell Broadband Engine.

[BibT_eX]

[DOI]

Jakub Kurzak

Jack J. Dongarra

Sci. Program., 2009

Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor.

[BibT_eX]

[DOI]

Jakub Kurzak

Wesley Alvaro

Jack J. Dongarra

Parallel Comput., 2009

A class of parallel tiled linear algebra algorithms for multicore architectures.

[BibT_eX]

[DOI]

Parallel Comput., 2009

Accelerating scientific computations with mixed precision algorithms.

[BibT_eX]

[DOI]

Comput. Phys. Commun., 2009

2008

Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization.

[BibT_eX]

[DOI]

Jakub Kurzak

Alfredo Buttari

Jack J. Dongarra

IEEE Trans. Parallel Distributed Syst., 2008

Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy.

[BibT_eX]

[DOI]

ACM Trans. Math. Softw., 2008

Automatic Generation of FFT for Translations of Multipole Expansions in Spherical Harmonics.

[BibT_eX]

[DOI]

Jakub Kurzak

Dragan Mirkovic

B. Montgomery Pettitt

S. Lennart Johnsson

Int. J. High Perform. Comput. Appl., 2008

The PlayStation 3 for High-Performance Scientific Computing.

[BibT_eX]

[DOI]

Comput. Sci. Eng., 2008

Parallel tiled QR factorization for multicore architectures.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2008

Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the Synergistic Processing Element of the CELL Processor.

[BibT_eX]

[DOI]

Wesley Alvaro

Jakub Kurzak

Jack J. Dongarra

Proceedings of the Computational Science, 2008

Scheduling for Numerical Linear Algebra Library at Scale.

[BibT_eX]

[DOI]

Proceedings of the High Speed and Large Scale Scientific Computing - Selected Papers from the High Performance Computing Workshop, Cetraro, Italy, June 30, 2008

2007

Prospectus for a Dense Linear Algebra Software Library.

[BibT_eX]

[DOI]

Proceedings of the Handbook of Parallel Computing - Models, Algorithms and Applications., 2007

Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2007

Implementation of mixed precision in solving systems of linear equations on the Cell processor.

[BibT_eX]

[DOI]

Jakub Kurzak

Jack J. Dongarra

Concurr. Comput. Pract. Exp., 2007

Introduction to Programming High Performance Applications on the CELL Broadband Engine.

[BibT_eX]

[DOI]

Jakub Kurzak

Alfredo Buttari

Proceedings of the 15th Annual IEEE Symposium on High-Performance Interconnects, 2007

2006

Tools and techniques for performance - Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems).

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

Poster reception - Targeting multi-core architectures for linear algebra applications.

[BibT_eX]

[DOI]