Malleability in Modern HPC Systems: Current Experiences, Challenges, and Future Opportunities.
IEEE Trans. Parallel Distributed Syst., September, 2024

ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code.
Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024

DPU Offloading Programming with the OpenMP API.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

OpenMP Offloading to DPU.
Proceedings of the IEEE International Conference on Cluster Computing, 2023

A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX Code.
Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction, 2023

Guest Editorial.
IEEE Trans. Parallel Distributed Syst., 2022

Enabling Homomorphically Encrypted Inference for Large DNN Models.
IEEE Trans. Computers, 2022

cuConv: CUDA implementation of convolution for CNN inference.
Clust. Comput., 2022

OmpSs-2 and OpenACC Interoperation.
Proceedings of the 9th Workshop on Accelerator Programming Using Directives, 2022

Towards OmpSs-2 and OpenACC interoperation.
Proceedings of the PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2, 2022

ecoHMEM: Improving Object Placement Methodology for Hybrid Memory Systems in HPC.
Proceedings of the IEEE International Conference on Cluster Computing, 2022

DMRlib: Easy-Coding and Efficient Resource Management for Job Malleability.
IEEE Trans. Computers, 2021

Static Graphs for Coding Productivity in OpenACC.
Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021

JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization.
Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021

Particle-In-Cell Simulation Using Asynchronous Tasking.
Proceedings of the Euro-Par 2021: Parallel Processing, 2021

Analysis of Threading Libraries for High Performance Computing.
IEEE Trans. Computers, 2020

Guest editorial: Special Issue on Applications and System Software for Hybrid Exascale Systems.
Parallel Comput., 2020

MPI+OpenMP tasking scalability for multi-morphology simulations of the human brain.
Parallel Comput., 2019

Integrating blocking and non-blocking MPI primitives with task-based programming models.
Parallel Comput., 2019

Dynamic reconfiguration of noniterative scientific applications: A case study with HPG aligner.
Int. J. High Perform. Comput. Appl., 2019

Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs.
IEEE Access, 2019

Tasking in Accelerators: Performance Evaluation.
Proceedings of the 20th International Conference on Parallel and Distributed Computing, 2019

Introduction to AsHES 2019.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops, 2019

Dynamic Adaptable Asynchronous Progress Model for MPI RMA Multiphase Applications.
IEEE Trans. Parallel Distributed Syst., 2018

Exploring the interoperability of remote GPGPU virtualization using rCUDA and directive-based programming models.
J. Supercomput., 2018

Understanding memory access patterns using the BSC performance tools.
Parallel Comput., 2018

DMR API: Improving cluster productivity by turning applications into malleable.
Parallel Comput., 2018

Special issue on applications for the heterogeneous computing era 2017.
Parallel Comput., 2018

On the adequacy of lightweight thread approaches for high-level parallel programming models.
Future Gener. Comput. Syst., 2018

cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs.
Concurr. Comput. Pract. Exp., 2018

MPI+OpenMP Tasking Scalability for the Simulation of the Human Brain: Human Brain Project.
Proceedings of the 25th European MPI Users' Group Meeting, 2018

Improving the Interoperability between MPI and Task-Based Programming Models.
Proceedings of the 25th European MPI Users' Group Meeting, 2018

Exploring the Vision Processing Unit as Co-Processor for Inference.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

Introduction to AsHES 2018.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

Special Issue on Topics on Heterogeneous Computing.
Parallel Comput., 2017

NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch.
Proceedings of the Parallel Processing and Applied Mathematics, 2017

Chai: Collaborative heterogeneous applications for integrated-architectures.
Proceedings of the 2017 IEEE International Symposium on Performance Analysis of Systems and Software, 2017

Supporting automatic recovery in offloaded distributed programming models through MPI-3 techniques.
Proceedings of the International Conference on Supercomputing, 2017

Integrating Memory Perspective into the BSC Performance Tools.
Proceedings of the 46th International Conference on Parallel Processing Workshops, 2017

Efficient Scalable Computing through Flexible Applications and Adaptive Workloads.
Proceedings of the 46th International Conference on Parallel Processing Workshops, 2017

Efficient Data Sharing on Heterogeneous Systems.
Proceedings of the 46th International Conference on Parallel Processing, 2017

GLTO: On the Adequacy of Lightweight Thread Approaches for OpenMP Implementations.
Proceedings of the 46th International Conference on Parallel Processing, 2017

cuHinesBatch: Solving Multiple Hines systems on GPUs Human Brain Project<sup>*</sup>.
Proceedings of the International Conference on Computational Science, 2017

GLT: A Unified API for Lightweight Thread Libraries.
Proceedings of the Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28, 2017

Automating the Application Data Placement in Hybrid Memory Systems.
Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

A data-oriented profiler to assist in data partitioning and distribution for heterogeneous memory in HPC.
Parallel Comput., 2016

MultiCL: Enabling automatic scheduling for task-parallel workloads in OpenCL.
Parallel Comput., 2016

Evaluating the effect of last-level cache sharing on integrated GPU-CPU systems with heterogeneous applications.
Proceedings of the 2016 IEEE International Symposium on Workload Characterization, 2016

One-Sided Interface for Matrix Operations Using MPI-3 RMA: A Case Study with Elemental.
Proceedings of the 45th International Conference on Parallel Processing, 2016

A Review of Lightweight Thread Approaches for High Performance Computing.
Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016

Improving the user experience of the rCUDA remote GPU virtualization framework.
Concurr. Comput. Pract. Exp., 2015

VOCL-FT: introducing techniques for efficient soft error coprocessor recovery.
Proceedings of the International Conference for High Performance Computing, 2015

Casper: An Asynchronous Progress Model for MPI RMA on Many-Core Architectures.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Exploring the Suitability of Remote GPGPU Virtualization for the OpenACC Programming Model Using rCUDA.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

Toward Implementing Robust Support for Portals 4 Networks in MPICH.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

Understanding Data Access Patterns Using Object-Differentiated Memory Profiling.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

A complete and efficient CUDA-sharing solution for HPC clusters.
Parallel Comput., 2014

MT-MPI: multithreaded MPI for many-core environments.
Proceedings of the 2014 International Conference on Supercomputing, 2014

A Framework for Tracking Memory Accesses in Scientific Applications.
Proceedings of the 43rd International Conference on Parallel Processing Workshops, 2014

Boosting the performance of remote GPU virtualization using InfiniBand connect-IB and PCIe 3.0.
Proceedings of the 2014 IEEE International Conference on Cluster Computing, 2014

Toward the efficient use of multiple explicitly managed memory subsystems.
Proceedings of the 2014 IEEE International Conference on Cluster Computing, 2014

Analysis of topology-dependent MPI performance on Gemini networks.
Proceedings of the 20th European MPI Users's Group Meeting, 2013

Influence of InfiniBand FDR on the performance of remote GPU virtualization.
Proceedings of the 2013 IEEE International Conference on Cluster Computing, 2013

Evaluation of Inter- and Intra-node Data Transfer Efficiencies between GPU Devices and their Impact on Scalable Applications.
Proceedings of the 13th IEEE/ACM International Symposium on Cluster, 2013

CU2rCU: Towards the complete rCUDA remote GPU virtualization and sharing solution.
Proceedings of the 19th International Conference on High Performance Computing, 2012

Performance of CUDA Virtualized Remote GPUs in High Performance Clusters.
Proceedings of the International Conference on Parallel Processing, 2011

Enabling CUDA acceleration within virtual machines using rCUDA.
Proceedings of the 18th International Conference on High Performance Computing, 2011

rCUDA: Reducing the number of GPU-based accelerators in high performance clusters.
Proceedings of the 2010 International Conference on High Performance Computing & Simulation, 2010

An Efficient Implementation of GPU Virtualization in High Performance Clusters.
Proceedings of the Euro-Par 2009, 2009