2025
Unified Collective Communication: A Unified Library for CPU, GPU, and DPU Collectives.
IEEE Micro, 2025

2024
Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI.
Proceedings of the International Conference for High Performance Computing, 2024

Offloaded MPI message matching: an optimistic approach.
Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024

Protocol Buffer Deserialization DPU Offloading in the RPC Datapath.
Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024

Unified Collective Communication (UCC): An Unified Library for CPU, GPU, and DPU Collectives.
Proceedings of the IEEE Symposium on High-Performance Interconnects, 2024

2021
NVIDIA's Cloud Native Supercomputing.
Proceedings of the Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation, 2021

2020
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)<sup>TM</sup> Streaming-Aggregation Hardware Design and Evaluation.
Proceedings of the High Performance Computing - 35th International Conference, 2020

2019
Accelerating OpenSHMEM Collectives Using In-Network Computing Approach.
Proceedings of the 31st International Symposium on Computer Architecture and High Performance Computing, 2019

2017
Towards A Data Centric System Architecture: SHARP.
Supercomput. Front. Innov., 2017

2016
Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction.
Proceedings of the First International Workshop on Communication Optimizations in HPC, 2016

2010
Overlapping computation and communication: Barrier algorithms and ConnectX-2 CORE-Direct capabilities.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations.
Proceedings of the 10th IEEE/ACM International Conference on Cluster, 2010

2007
Investigations on InfiniBand: Efficient Network Buffer Utilization at Scale.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 14th European PVM/MPI User's Group Meeting, Paris, France, September 30, 2007