Hari Subramoni

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures.

[BibT_eX]

[DOI]

Bharath Ramesh

Dhabaleswar K. D. K. Panda

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR.

[BibT_eX]

[DOI]

Dhabaleswar K. D. K. Panda

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems.

[BibT_eX]

[DOI]

Dhabaleswar K. D. K. Panda

Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020

Blink: Towards Efficient RDMA-based Communication Coroutines for Parallel Python Applications.

[BibT_eX]

[DOI]

Aamir Shafi

Proceedings of the 27th IEEE International Conference on High Performance Computing, 2020

Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters.

[BibT_eX]

[DOI]

Qinghua Zhou

Proceedings of the IEEE International Conference on Cluster Computing, 2020

Design and Characterization of InfiniBand Hardware Tag Matching in MPI.

[BibT_eX]

[DOI]

Seyedeh Mahdieh Ghazimirsaeed

Proceedings of the 20th IEEE/ACM International Symposium on Cluster, 2020

2019

Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2019

Efficient design for MPI asynchronous progress without dedicated resources.

[BibT_eX]

[DOI]

Amit Ruhela

Parallel Comput., 2019

Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?

[BibT_eX]

[DOI]

Parallel Comput., 2019

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow.

[BibT_eX]

[DOI]

CoRR, 2019

Performance Evaluation of MPI Libraries on GPU-Enabled OpenPOWER Architectures: Early Experiences.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing, 2019

Design and Evaluation of Shared Memory CommunicationBenchmarks on Emerging Architectures using MVAPICH2.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, 2019

Leveraging Network-level parallelism with Multiple Process-Endpoints for MPI Broadcast.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, 2019

OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE/ACM Performance Modeling, 2019

Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera.

[BibT_eX]

[DOI]

Proceedings of the Third IEEE/ACM Workshop on Deep Learning on Supercomputers, 2019

High performance distributed deep learning: a beginner's guide.

[BibT_eX]

[DOI]

Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

FALCON: Efficient Designs for Zero-Copy MPI Datatype Processing on Emerging Architectures.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

Designing Scalable and High-Performance MPI Libraries on Amazon Elastic Fabric Adapter.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE Symposium on High-Performance Interconnects, 2019

Designing a Profiling and Visualization Tool for Scalable and In-depth Analysis of High-Performance GPU Clusters.

[BibT_eX]

[DOI]

Bharath Ramesh

Kaushik Kandadi Suresh

Proceedings of the 26th IEEE International Conference on High Performance Computing, 2019

High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems.

[BibT_eX]

[DOI]

Proceedings of the 26th IEEE International Conference on High Performance Computing, 2019

Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019

Design and Characterization of Shared Address Space MPI Collectives on Modern Architectures.

[BibT_eX]

[DOI]

Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation.

[BibT_eX]

[DOI]

Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019

Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures.

[BibT_eX]

[DOI]

Proceedings of the 12th Workshop on General Purpose Processing Using GPUs, 2019

2018

MPI performance engineering with the MPI tool interface: The integration of MVAPICH and TAU.

[BibT_eX]

[DOI]

Parallel Comput., 2018

Networking and communication challenges for post-exascale systems.

[BibT_eX]

[DOI]

Xiaoyi Lu

Frontiers Inf. Technol. Electron. Eng., 2018

Cooperative rendezvous protocols for improved performance and overlap.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2018

Efficient Asynchronous Communication Progress for MPI without Dedicated Resources.

[BibT_eX]

[DOI]

Amit Ruhela

Proceedings of the 25th European MPI Users' Group Meeting, 2018

Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures.

[BibT_eX]

[DOI]

Proceedings of the 25th European MPI Users' Group Meeting, 2018

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

[BibT_eX]

[DOI]

Proceedings of the 25th European MPI Users' Group Meeting, 2018

Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training.

[BibT_eX]

[DOI]

Proceedings of the 25th IEEE International Conference on High Performance Computing, 2018

SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2018

2017

Designing Dynamic and Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation and Communication.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 32nd International Conference, 2017

Scalable reduction collectives with data partitioning-based multi-leader design.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2017

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures.

[BibT_eX]

[DOI]

Proceedings of the Machine Learning on HPC Environments, 2017

MPI performance engineering with the MPI tool interface: the integration of MVAPICH and TAU.

[BibT_eX]

[DOI]

Proceedings of the 24th European MPI Users' Group Meeting, 2017

Exploiting and Evaluating OpenSHMEM on KNL Architecture.

[BibT_eX]

[DOI]

Mingzhe Li

Proceedings of the OpenSHMEM and Related Technologies. Big Compute and Big Data Convergence, 2017

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning.

[BibT_eX]

[DOI]

Bracy Elton

Proceedings of the 46th International Conference on Parallel Processing, 2017

Designing Registration Caching Free High-Performance MPI Library with Implicit On-Demand Paging (ODP) of InfiniBand.

[BibT_eX]

[DOI]

Proceedings of the 24th IEEE International Conference on High Performance Computing, 2017

Kernel-Assisted Communication Engine for MPI on Emerging Manycore Processors.

[BibT_eX]

[DOI]

Khaled Hamidouche

Proceedings of the 24th IEEE International Conference on High Performance Computing, 2017

A Scalable Network-Based Performance Analysis Tool for MPI on Large-Scale HPC Systems.

[BibT_eX]

[DOI]

Xiaoyi Lu

Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

Contention-Aware Kernel-Assisted MPI Collectives for Multi-/Many-Core Systems.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

2016

CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters.

[BibT_eX]

[DOI]

Parallel Comput., 2016

INAM2: InfiniBand Network Analysis and Monitoring with MPI.

[BibT_eX]

[DOI]

Albert Mathews Augustine

Proceedings of the High Performance Computing - 31st International Conference, 2016

Designing MPI library with on-demand paging (ODP) of infiniband: challenges and benefits.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2016

Efficient Reliability Support for Hardware Multicast-Based Broadcast in GPU-enabled Streaming Applications.

[BibT_eX]

[DOI]

Proceedings of the First International Workshop on Communication Optimizations in HPC, 2016

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters.

[BibT_eX]

[DOI]

Proceedings of the 28th International Symposium on Computer Architecture and High Performance Computing, 2016

Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-Enabled Systems.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

System-Level Scalable Checkpoint-Restart for Petascale Computing.

[BibT_eX]

[DOI]

Proceedings of the 22nd IEEE International Conference on Parallel and Distributed Systems, 2016

Adaptive and Dynamic Design for MPI Tag Matching.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Conference on Cloud Computing Technology and Science, 2016

SHMEMPMI - Shared Memory Based PMI for Improved Performance and Scalability.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM 16th International Symposium on Cluster, 2016

2015

Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 30th International Conference, 2015

GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks.

[BibT_eX]

[DOI]

Proceedings of the 22nd European MPI Users' Group Meeting, 2015

On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

Impact of InfiniBand DC Transport Protocol on Energy Consumption of All-to-All Collective Algorithms.

[BibT_eX]

[DOI]

Proceedings of the 23rd IEEE Annual Symposium on High-Performance Interconnects, 2015

Offloaded GPU Collectives Using CORE-Direct and CUDA Capabilities on InfiniBand Clusters.

[BibT_eX]

[DOI]

Proceedings of the 22nd IEEE International Conference on High Performance Computing, 2015

High Performance MPI Datatype Support with User-Mode Memory Registration: Challenges, Designs, and Benefits.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

Non-Blocking PMI Extensions for Fast MPI Startup.

[BibT_eX]

[DOI]

Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

2014

Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand: Early Experiences.

[BibT_eX]

[DOI]

Proceedings of the Supercomputing - 29th International Conference, 2014

PMI Extensions for Scalable MPI Startup.

[BibT_eX]

[DOI]

Proceedings of the 21st European MPI Users' Group Meeting, 2014

Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models.

[BibT_eX]

[DOI]

Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, 2014

Designing Topology-Aware Communication Schedules for Alltoall Operations in Large InfiniBand Clusters.

[BibT_eX]

[DOI]

Proceedings of the 43rd International Conference on Parallel Processing, 2014

Wide-area overlay networking to manage science DMZ accelerated flows.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Computing, Networking and Communications, 2014

A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters.

[BibT_eX]

[DOI]

Proceedings of the 21st International Conference on High Performance Computing, 2014

2013

MVAPICH-PRISM: a proxy-based communication framework using InfiniBand and SCIF for intel MIC clusters.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2013

High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand.

[BibT_eX]

[DOI]

Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

Extending OpenSHMEM for GPU Computing.

[BibT_eX]

[DOI]

Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

MIC-RO: enabling efficient remote offload on heterogeneous many integrated core (MIC) clusters with InfiniBand.

[BibT_eX]

[DOI]

Khaled Hamidouche

Proceedings of the International Conference on Supercomputing, 2013

High-Performance Design of Hadoop RPC with RDMA over InfiniBand.

[BibT_eX]

[DOI]

Proceedings of the 42nd International Conference on Parallel Processing, 2013

A Novel Functional Partitioning Approach to Design High-Performance MPI-3 Non-blocking Alltoallv Collective on Multi-core Systems.

[BibT_eX]

[DOI]

Proceedings of the 42nd International Conference on Parallel Processing, 2013

Design of network topology aware scheduling services for large InfiniBand clusters.

[BibT_eX]

[DOI]

Devendar Bureddy

Proceedings of the 2013 IEEE International Conference on Cluster Computing, 2013

2012

Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes.

[BibT_eX]

[DOI]

Raghunath Rajachandrasekar

Proceedings of the SC Conference on High Performance Computing Networking, 2012

High performance RDMA-based design of HDFS over InfiniBand.

[BibT_eX]

[DOI]

Nusrat S. Islam

Md. Wasi-ur-Rahman

Jithin Jose

Proceedings of the SC Conference on High Performance Computing Networking, 2012

Understanding the communication characteristics in HBase: What are the fundamental bottlenecks?

[BibT_eX]

[DOI]

Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2012

Designing Network Failover and Recovery in MPI for Multi-Rail InfiniBand Clusters.

[BibT_eX]

[DOI]

S. Pai Raikar

Jérôme Vienne

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers.

[BibT_eX]

[DOI]

Bronis R. de Supinski

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

High-Performance Design of HBase with RDMA over InfiniBand.

[BibT_eX]

[DOI]

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems.

[BibT_eX]

[DOI]

Proceedings of the IEEE 20th Annual Symposium on High-Performance Interconnects, 2012

A Scalable InfiniBand Network Topology-Aware Performance Analysis Tool for MPI.

[BibT_eX]

[DOI]

Jérôme Vienne

Raghunath Rajachandrasekar

Proceedings of the Euro-Par 2012: Parallel Processing Workshops, 2012

Minimizing Network Contention in InfiniBand Clusters with a QoS-Aware Data-Staging Framework.

[BibT_eX]

[DOI]

Jai Jaswani

Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012

Can Network-Offload Based Non-blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms?

[BibT_eX]

[DOI]

Proceedings of the 2012 IEEE International Conference on Cluster Computing Workshops, 2012

Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports.

[BibT_eX]

[DOI]

Jithin Jose

Proceedings of the 12th IEEE/ACM International Symposium on Cluster, 2012

2011

Collective Communication, Network Support For.

[BibT_eX]

[DOI]

Proceedings of the Encyclopedia of Parallel Computing, 2011

High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT.

[BibT_eX]

[DOI]

Comput. Sci. Res. Dev., 2011

Codesign for InfiniBand Clusters.

[BibT_eX]

[DOI]

Karen Tomko

Computer, 2011

Memcached Design on High Performance RDMA Capable Interconnects.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Parallel Processing, 2011

Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL.

[BibT_eX]

[DOI]

Proceedings of the IEEE 19th Annual Symposium on High Performance Interconnects, 2011

INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool.

[BibT_eX]

[DOI]

N. Dandapanthula

Jérôme Vienne

Ron Brightwell

Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters.

[BibT_eX]

[DOI]

Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER), 2011

2010

Intra-Socket and Inter-Socket Communication in Multi-core Systems.

[BibT_eX]

[DOI]

IEEE Comput. Archit. Lett., 2010

Streaming, low-latency communication in on-line trading systems.

[BibT_eX]

[DOI]

Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather.

[BibT_eX]

[DOI]

Abhinav Vishnu

Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

High Performance Design and Implementation of Nemesis Communication Layer for Two-Sided and One-Sided MPI Semantics in MVAPICH2.

[BibT_eX]

[DOI]

Miao Luo

Ping Lai

Emilio Pasquale Mancini

Proceedings of the 39th International Conference on Parallel Processing, 2010

Improving Application Performance and Predictability Using Multiple Virtual Lanes in Modern Multi-core InfiniBand Clusters.

[BibT_eX]

[DOI]

Proceedings of the 39th International Conference on Parallel Processing, 2010

Design and Evaluation of Generalized Collective Communication Primitives with Overlap Using ConnectX-2 Offload Engine.

[BibT_eX]

[DOI]

Proceedings of the IEEE 18th Annual Symposium on High Performance Interconnects, 2010

High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand.

[BibT_eX]

[DOI]

Proceedings of the 10th IEEE/ACM International Conference on Cluster, 2010

High Performance Topology-Aware Communication in Multicore Processors.

[BibT_eX]

[DOI]

Proceedings of the Scientific Computing with Multicore and Accelerators., 2010

2009

Designing multi-leader-based Allgather algorithms for multi-core clusters.

[BibT_eX]

[DOI]

Gopalakrishnan Santhanaraman

Matthew J. Koop

Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand.

[BibT_eX]

[DOI]

Proceedings of the ICPP 2009, 2009

Designing Next Generation Clusters: Evaluation of InfiniBand DDR/QDR on Intel Computing Platforms.

[BibT_eX]

[DOI]

Matthew J. Koop