2024
Malleability in Modern HPC Systems: Current Experiences, Challenges, and Future Opportunities.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
IEEE Trans. Parallel Distributed Syst., September, 2024
ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code.
Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024
2023
DPU Offloading Programming with the OpenMP API.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023
OpenMP Offloading to DPU.
Proceedings of the IEEE International Conference on Cluster Computing, 2023
A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX Code.
Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction, 2023
2022
IEEE Trans. Parallel Distributed Syst., 2022
Enabling Homomorphically Encrypted Inference for Large DNN Models.
IEEE Trans. Computers, 2022
cuConv: CUDA implementation of convolution for CNN inference.
Clust. Comput., 2022
OmpSs-2 and OpenACC Interoperation.
Proceedings of the 9th Workshop on Accelerator Programming Using Directives, 2022
Towards OmpSs-2 and OpenACC interoperation.
Proceedings of the PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2, 2022
ecoHMEM: Improving Object Placement Methodology for Hybrid Memory Systems in HPC.
Proceedings of the IEEE International Conference on Cluster Computing, 2022
2021
DMRlib: Easy-Coding and Efficient Resource Management for Job Malleability.
IEEE Trans. Computers, 2021
Static Graphs for Coding Productivity in OpenACC.
Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021
JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization.
Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021
Particle-In-Cell Simulation Using Asynchronous Tasking.
Proceedings of the Euro-Par 2021: Parallel Processing, 2021
2020
Analysis of Threading Libraries for High Performance Computing.
IEEE Trans. Computers, 2020
Guest editorial: Special Issue on Applications and System Software for Hybrid Exascale Systems.
Parallel Comput., 2020
2019
MPI+OpenMP tasking scalability for multi-morphology simulations of the human brain.
Parallel Comput., 2019
Integrating blocking and non-blocking MPI primitives with task-based programming models.
Parallel Comput., 2019
Dynamic reconfiguration of noniterative scientific applications: A case study with HPG aligner.
Int. J. High Perform. Comput. Appl., 2019
Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs.
IEEE Access, 2019
Tasking in Accelerators: Performance Evaluation.
Proceedings of the 20th International Conference on Parallel and Distributed Computing, 2019
Introduction to AsHES 2019.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops, 2019
2018
Dynamic Adaptable Asynchronous Progress Model for MPI RMA Multiphase Applications.
IEEE Trans. Parallel Distributed Syst., 2018
Exploring the interoperability of remote GPGPU virtualization using rCUDA and directive-based programming models.
J. Supercomput., 2018
Understanding memory access patterns using the BSC performance tools.
Parallel Comput., 2018
DMR API: Improving cluster productivity by turning applications into malleable.
Parallel Comput., 2018
Special issue on applications for the heterogeneous computing era 2017.
Parallel Comput., 2018
On the adequacy of lightweight thread approaches for high-level parallel programming models.
Future Gener. Comput. Syst., 2018
cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs.
Concurr. Comput. Pract. Exp., 2018
MPI+OpenMP Tasking Scalability for the Simulation of the Human Brain: Human Brain Project.
Proceedings of the 25th European MPI Users' Group Meeting, 2018
Improving the Interoperability between MPI and Task-Based Programming Models.
Proceedings of the 25th European MPI Users' Group Meeting, 2018
Exploring the Vision Processing Unit as Co-Processor for Inference.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018
Introduction to AsHES 2018.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018
2017
Special Issue on Topics on Heterogeneous Computing.
Parallel Comput., 2017
NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch.
Proceedings of the Parallel Processing and Applied Mathematics, 2017
Chai: Collaborative heterogeneous applications for integrated-architectures.
Proceedings of the 2017 IEEE International Symposium on Performance Analysis of Systems and Software, 2017
Supporting automatic recovery in offloaded distributed programming models through MPI-3 techniques.
Proceedings of the International Conference on Supercomputing, 2017
Integrating Memory Perspective into the BSC Performance Tools.
Proceedings of the 46th International Conference on Parallel Processing Workshops, 2017
Efficient Scalable Computing through Flexible Applications and Adaptive Workloads.
Proceedings of the 46th International Conference on Parallel Processing Workshops, 2017
Efficient Data Sharing on Heterogeneous Systems.
Proceedings of the 46th International Conference on Parallel Processing, 2017
GLTO: On the Adequacy of Lightweight Thread Approaches for OpenMP Implementations.
Proceedings of the 46th International Conference on Parallel Processing, 2017
cuHinesBatch: Solving Multiple Hines systems on GPUs Human Brain Project<sup>*</sup>.
Proceedings of the International Conference on Computational Science, 2017
GLT: A Unified API for Lightweight Thread Libraries.
Proceedings of the Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28, 2017
Automating the Application Data Placement in Hybrid Memory Systems.
Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017
2016
A data-oriented profiler to assist in data partitioning and distribution for heterogeneous memory in HPC.
Parallel Comput., 2016
MultiCL: Enabling automatic scheduling for task-parallel workloads in OpenCL.
Parallel Comput., 2016
Evaluating the effect of last-level cache sharing on integrated GPU-CPU systems with heterogeneous applications.
Proceedings of the 2016 IEEE International Symposium on Workload Characterization, 2016
One-Sided Interface for Matrix Operations Using MPI-3 RMA: A Case Study with Elemental.
Proceedings of the 45th International Conference on Parallel Processing, 2016
A Review of Lightweight Thread Approaches for High Performance Computing.
Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016
2015
Improving the user experience of the rCUDA remote GPU virtualization framework.
Concurr. Comput. Pract. Exp., 2015
VOCL-FT: introducing techniques for efficient soft error coprocessor recovery.
Proceedings of the International Conference for High Performance Computing, 2015
Casper: An Asynchronous Progress Model for MPI RMA on Many-Core Architectures.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015
Exploring the Suitability of Remote GPGPU Virtualization for the OpenACC Programming Model Using rCUDA.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015
Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015
Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015
Toward Implementing Robust Support for Portals 4 Networks in MPICH.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015
Understanding Data Access Patterns Using Object-Differentiated Memory Profiling.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015
2014
A complete and efficient CUDA-sharing solution for HPC clusters.
Parallel Comput., 2014
MT-MPI: multithreaded MPI for many-core environments.
Proceedings of the 2014 International Conference on Supercomputing, 2014
A Framework for Tracking Memory Accesses in Scientific Applications.
Proceedings of the 43rd International Conference on Parallel Processing Workshops, 2014
Boosting the performance of remote GPU virtualization using InfiniBand connect-IB and PCIe 3.0.
Proceedings of the 2014 IEEE International Conference on Cluster Computing, 2014
Toward the efficient use of multiple explicitly managed memory subsystems.
Proceedings of the 2014 IEEE International Conference on Cluster Computing, 2014
2013
Analysis of topology-dependent MPI performance on Gemini networks.
Proceedings of the 20th European MPI Users's Group Meeting, 2013
Influence of InfiniBand FDR on the performance of remote GPU virtualization.
Proceedings of the 2013 IEEE International Conference on Cluster Computing, 2013
Evaluation of Inter- and Intra-node Data Transfer Efficiencies between GPU Devices and their Impact on Scalable Applications.
Proceedings of the 13th IEEE/ACM International Symposium on Cluster, 2013
2012
CU2rCU: Towards the complete rCUDA remote GPU virtualization and sharing solution.
Proceedings of the 19th International Conference on High Performance Computing, 2012
2011
Performance of CUDA Virtualized Remote GPUs in High Performance Clusters.
Proceedings of the International Conference on Parallel Processing, 2011
Enabling CUDA acceleration within virtual machines using rCUDA.
Proceedings of the 18th International Conference on High Performance Computing, 2011
2010
rCUDA: Reducing the number of GPU-based accelerators in high performance clusters.
Proceedings of the 2010 International Conference on High Performance Computing & Simulation, 2010
2009
An Efficient Implementation of GPU Virtualization in High Performance Clusters.
Proceedings of the Euro-Par 2009, 2009