Dhabaleswar K. Panda

Orcid: 0000-0002-0356-1781

Affiliations:
  • Ohio State University, Columbus, USA


According to our database1, Dhabaleswar K. Panda authored at least 580 papers between 1988 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer.
CoRR, 2024

Accelerating communication with multi-HCA aware collectives in MPI.
Concurr. Comput. Pract. Exp., 2024

Creating intelligent cyberinfrastructure for democratizing AI.
AI Mag., 2024

OMB-CXL: A Micro-Benchmark Suite for Evaluating MPI Communication Utilizing Compute Express Link Memory Devices.
Proceedings of the Practice and Experience in Advanced Research Computing 2024: Human Powered Computing, 2024

Infer-HiRes: Accelerating Inference for High-Resolution Images with Quantization and Distributed Deep Learning.
Proceedings of the Practice and Experience in Advanced Research Computing 2024: Human Powered Computing, 2024

OMB-FPGA: A Microbenchmark Suite for FPGA-aware MPIs using OpenCL and SYCL.
Proceedings of the Practice and Experience in Advanced Research Computing 2024: Human Powered Computing, 2024

Design and Implementation of an IPC-based Collective MPI Library for Intel GPUs.
Proceedings of the Practice and Experience in Advanced Research Computing 2024: Human Powered Computing, 2024

Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters.
Proceedings of the ISC High Performance 2024 Research Paper Proceedings (39th International Conference), 2024

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

PML-MPI: A Pre-Trained ML Framework for Efficient Collective Algorithm Selection in MPI.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

Towards Accelerating k-NN with MPI and Near-Memory Processing.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

HINT: Designing Cache-Efficient MPI_Alltoall using Hybrid Memory Copy Ordering and Non-Temporal Instructions.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

Message from the HCW 2024 Technical Program Committee Co-Chairs.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

The Case for Co-Designing Model Architectures with Hardware.
Proceedings of the 53rd International Conference on Parallel Processing, 2024

OHIO: Improving RDMA Network Scalability in MPI_Alltoall Through Optimized Hierarchical and Intra/Inter-Node Communication Overlap Design.
Proceedings of the IEEE Symposium on High-Performance Interconnects, 2024

Demystifying the Communication Characteristics for Distributed Transformer Models.
Proceedings of the IEEE Symposium on High-Performance Interconnects, 2024

Characterizing Communication in Distributed Parameter-Efficient Fine-Tuning for Large Language Models.
Proceedings of the IEEE Symposium on High-Performance Interconnects, 2024

Accelerating Large Language Model Training with Hybrid GPU-based Compression.
Proceedings of the 24th IEEE International Symposium on Cluster, 2024

2023
High Performance MPI over the Slingshot Interconnect.
J. Comput. Sci. Technol., February, 2023

Network-Assisted Noncontiguous Transfers for GPU-Aware MPI Libraries.
IEEE Micro, 2023

Performance Characterization of using Quantization for DNN Inference on Edge Devices: Extended Version.
CoRR, 2023

DPU-Bench: A Micro-Benchmark Suite to Measure Offload Efficiency Of SmartNICs.
Proceedings of the Practice and Experience in Advanced Research Computing, 2023

Optimizing Amber for Device-to-Device GPU Communication.
Proceedings of the Practice and Experience in Advanced Research Computing, 2023

SAI: AI-Enabled Speech Assistant Interface for Science Gateways in HPC.
Proceedings of the High Performance Computing - 38th International Conference, 2023

Democratizing HPC Access and Use with Knowledge Graphs.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

A Novel Framework for Efficient Offloading of Communication Operations to Bluefield SmartNICs.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

In-Depth Evaluation of a Lower-Level Direct-Verbs API on InfiniBand-based Clusters: Early Experiences.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc<sup>*</sup>.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication.
Proceedings of the 37th International Conference on Supercomputing, 2023

Performance Characterization of Using Quantization for DNN Inference on Edge Devices.
Proceedings of the 7th IEEE International Conference on Fog and Edge Computing, 2023

Designing In-network Computing Aware Reduction Collectives in MPI.
Proceedings of the IEEE Symposium on High-Performance Interconnects, 2023

Battle of the BlueFields: An In-Depth Comparison of the BlueField-2 and BlueField-3 SmartNICs.
Proceedings of the IEEE Symposium on High-Performance Interconnects, 2023

Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference.
Proceedings of the 30th IEEE International Conference on High Performance Computing, 2023

Optimized All-to-All Connection Establishment for High-Performance MPI Libraries Over InfiniBand.
Proceedings of the 30th IEEE International Conference on High Performance Computing, 2023

How to Educate HPC-Enabled AI and Data Science to Students and Professionals in a Holistic Manner?
Proceedings of the 30th IEEE International Conference on High Performance Computing, 2023

Implementing and Optimizing a GPU-aware MPI Library for Intel GPUs: Early Experiences.
Proceedings of the 23rd IEEE/ACM International Symposium on Cluster, 2023

ScaMP: Scalable Meta-Parallelism for Deep Learning Search.
Proceedings of the 23rd IEEE/ACM International Symposium on Cluster, 2023

HARVEST: High-Performance Artificial Vision Framework for Expert Labeling using Semi-Supervised Training.
Proceedings of the IEEE International Conference on Big Data, 2023

MPI4Spark Meets YARN: Enhancing MPI4Spark through YARN support for HPC.
Proceedings of the IEEE International Conference on Big Data, 2023

Benchmarking Modern Databases for Storing and Profiling Very Large Scale HPC Communication Data.
Proceedings of the Benchmarking, Measuring, and Optimizing, 2023

2022
Optimizing Distributed DNN Training Using CPUs and BlueField-2 DPUs.
IEEE Micro, 2022

High Performance MPI over the Slingshot Interconnect: Early Experiences.
Proceedings of the PEARC '22: Practice and Experience in Advanced Research Computing, Boston, MA, USA, July 10, 2022

Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters.
Proceedings of the High Performance Computing - 37th International Conference, 2022

"Hey CAI" - Conversational AI Enabled User Interface for HPC Tools.
Proceedings of the High Performance Computing - 37th International Conference, 2022

Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters.
Proceedings of the High Performance Computing - 37th International Conference, 2022

Arm meets Cloud: A Case Study of MPI Library Performance on AWS Arm-based HPC Cloud with Elastic Fabric Adapter.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

Challenges and Opportunities in Designing High-Performance and Scalable Middleware for HPC and AI: Past, Present, and Future.
Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

Highly Efficient Alltoall and Alltoallv Communication Algorithms for GPU Systems.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

Towards Java-based HPC using the MVAPICH2 Library: Early Experiences.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

Designing Hierarchical Multi-HCA Aware Allgather in MPI.
Proceedings of the Workshop Proceedings of the 51st International Conference on Parallel Processing, 2022

Network Assisted Non-Contiguous Transfers for GPU-Aware MPI Libraries.
Proceedings of the IEEE Symposium on High-Performance Interconnects, 2022

Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads.
Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

Efficient Personalized and Non-Personalized Alltoall Communication for Modern Multi-HCA GPU-Based Clusters.
Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

Designing Efficient Pipelined Communication Schemes using Compression in MPI Libraries.
Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters.
Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022


Spark Meets MPI: Towards High-Performance Communication Framework for Spark using MPI.
Proceedings of the IEEE International Conference on Cluster Computing, 2022

2021
The MVAPICH project: Transforming research into high-performance MPI library for HPC community.
J. Comput. Sci., 2021

Cross-layer Visualization and Profiling of Network and I/O Communication for HPC Clusters.
CoRR, 2021

INAM: Cross-stack Profiling and Analysis of Communication in MPI-based Applications.
Proceedings of the PEARC '21: Practice and Experience in Advanced Research Computing, 2021

Designing a ROCm-Aware MPI Library for AMD GPUs: Early Experiences.
Proceedings of the High Performance Computing - 36th International Conference, 2021

BluesMPI: Efficient MPI Non-blocking Alltoall Offloading Designs on Modern BlueField Smart NICs.
Proceedings of the High Performance Computing - 36th International Conference, 2021

Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters<sup>*</sup>.
Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021

SUPER: SUb-Graph Parallelism for TransformERs.
Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021

Scaling Single-Image Super-Resolution Training on Modern HPC Clusters: Early Experiences.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops, 2021

Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs.
Proceedings of the IEEE Symposium on High-Performance Interconnects, 2021

Layout-aware Hardware-assisted Designs for Derived Data Types in MPI.
Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021

Large-Message Nonblocking MPI_Iallgather and MPI Ibcast Offload via BlueField-2 DPU.
Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021

DistMILE: A Distributed Multi-Level Framework for Scalable Graph Embedding.
Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021

Towards Architecture-aware Hierarchical Communication Trees on Modern HPC Systems.
Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021

Efficient MPI-based Communication for GPU-Accelerated Dask Applications.
Proceedings of the 21st IEEE/ACM International Symposium on Cluster, 2021

Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems.
Proceedings of the 21st IEEE/ACM International Symposium on Cluster, 2021

2020
Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects.
IEEE Micro, 2020

FALCON-X: Zero-copy MPI derived datatype processing on modern CPU and GPU architectures.
J. Parallel Distributed Comput., 2020

Future Directions of the Cyberinfrastructure for Sustained Scientific Innovation (CSSI) Program.
CoRR, 2020

EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications.
Concurr. Comput. Pract. Exp., 2020

Frontera: The Evolution of Leadership Computing at the National Science Foundation.
Proceedings of the PEARC '20: Practice and Experience in Advanced Research Computing, 2020

Accelerated Real-time Network Monitoring and Profiling at Scale using OSU INAM.
Proceedings of the PEARC '20: Practice and Experience in Advanced Research Computing, 2020

Communication-Aware Hardware-Assisted MPI Overlap Engine.
Proceedings of the High Performance Computing - 35th International Conference, 2020

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow.
Proceedings of the High Performance Computing - 35th International Conference, 2020

MPI Meets Cloud: Case Study with Amazon EC2 and Microsoft Azure.
Proceedings of the Fourth IEEE/ACM Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, 2020

Exploring Hybrid MPI+Kokkos Tasks Programming Model.
Proceedings of the 3rd IEEE/ACM Annual Parallel Applications Workshop: Alternatives To MPI+X, 2020

GEMS: GPU-enabled memory-aware model-parallelism system for distributed DNN training.
Proceedings of the International Conference for High Performance Computing, 2020

Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR.
Proceedings of the 6th IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments, 2020

Scalable MPI Collectives using SHARP: Large Scale Performance Evaluation on the TACC Frontera System.
Proceedings of the Workshop on Exascale MPI, 2020

Performance Characterization of Network Mechanisms for Non-Contiguous Data Transfers in MPI.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

Analyzing and Understanding the Impact of Interconnect Performance on HPC, Big Data, and Deep Learning Applications: A Case Study with InfiniBand EDR and HDR.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems.
Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020

Blink: Towards Efficient RDMA-based Communication Coroutines for Parallel Python Applications.
Proceedings of the 27th IEEE International Conference on High Performance Computing, 2020

Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters.
Proceedings of the IEEE International Conference on Cluster Computing, 2020

Design and Characterization of InfiniBand Hardware Tag Matching in MPI.
Proceedings of the 20th IEEE/ACM International Symposium on Cluster, 2020

2019
Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast.
IEEE Trans. Parallel Distributed Syst., 2019

Efficient design for MPI asynchronous progress without dedicated resources.
Parallel Comput., 2019

Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?
Parallel Comput., 2019

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow.
CoRR, 2019

CCF THPC inaugural issue editorial.
CCF Trans. High Perform. Comput., 2019

Performance Evaluation of MPI Libraries on GPU-Enabled OpenPOWER Architectures: Early Experiences.
Proceedings of the High Performance Computing, 2019

Design and Evaluation of Shared Memory CommunicationBenchmarks on Emerging Architectures using MVAPICH2.
Proceedings of the IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, 2019

Leveraging Network-level parallelism with Multiple Process-Endpoints for MPI Broadcast.
Proceedings of the IEEE/ACM Third Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, 2019

Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera.
Proceedings of the Third IEEE/ACM Workshop on Deep Learning on Supercomputers, 2019

High performance distributed deep learning: a beginner's guide.
Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

Introduction to HPBDC 2019.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops, 2019

FALCON: Efficient Designs for Zero-Copy MPI Datatype Processing on Emerging Architectures.
Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

C-GDR: High-Performance Container-Aware GPUDirect MPI Communication Schemes on RDMA Networks.
Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

SimdHT-Bench: Characterizing SIMD-Aware Hash Table Designs on Emerging CPU Architectures.
Proceedings of the IEEE International Symposium on Workload Characterization, 2019

UMR-EC: A Unified and Multi-Rail Erasure Coding Library for High-Performance Distributed Storage Systems.
Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, 2019

Designing Scalable and High-Performance MPI Libraries on Amazon Elastic Fabric Adapter.
Proceedings of the 2019 IEEE Symposium on High-Performance Interconnects, 2019

SCOR-KV: SIMD-Aware Client-Centric and Optimistic RDMA-Based Key-Value Store for Emerging CPU Architectures.
Proceedings of the 26th IEEE International Conference on High Performance Computing, 2019

Designing a Profiling and Visualization Tool for Scalable and In-depth Analysis of High-Performance GPU Clusters.
Proceedings of the 26th IEEE International Conference on High Performance Computing, 2019

High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems.
Proceedings of the 26th IEEE International Conference on High Performance Computing, 2019

Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters.
Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019

Design and Characterization of Shared Address Space MPI Collectives on Modern Architectures.
Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation.
Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019

Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures.
Proceedings of the 12th Workshop on General Purpose Processing Using GPUs, 2019

2018
DLoBD: A Comprehensive Study of Deep Learning over Big Data Stacks on HPC Clusters.
IEEE Trans. Multi Scale Comput. Syst., 2018

MPI performance engineering with the MPI tool interface: The integration of MVAPICH and TAU.
Parallel Comput., 2018

Networking and communication challenges for post-exascale systems.
Frontiers Inf. Technol. Electron. Eng., 2018

MR-Advisor: A comprehensive tuning, profiling, and prediction tool for MapReduce execution frameworks on HPC clusters.
J. Parallel Distributed Comput., 2018

Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences.
CoRR, 2018

Analyzing, Modeling, and Provisioning QoS for NVMe SSDs.
Proceedings of the 11th IEEE/ACM International Conference on Utility and Cloud Computing, 2018

Cooperative rendezvous protocols for improved performance and overlap.
Proceedings of the International Conference for High Performance Computing, 2018

Efficient Asynchronous Communication Progress for MPI without Dedicated Resources.
Proceedings of the 25th European MPI Users' Group Meeting, 2018

Multi-Threading and Lock-Free MPI RMA Based Graph Processing on KNL and POWER Architectures.
Proceedings of the 25th European MPI Users' Group Meeting, 2018

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Proceedings of the 25th European MPI Users' Group Meeting, 2018

Introduction to HPBDC 2018.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

Accelerating TensorFlow with Adaptive RDMA-Based gRPC.
Proceedings of the 25th IEEE International Conference on High Performance Computing, 2018

OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training.
Proceedings of the 25th IEEE International Conference on High Performance Computing, 2018

Cutting the Tail: Designing High Performance Message Brokers to Reduce Tail Latencies in Stream Processing.
Proceedings of the IEEE International Conference on Cluster Computing, 2018

SALaR: Scalable and Adaptive Designs for Large Message Reduction Collectives.
Proceedings of the IEEE International Conference on Cluster Computing, 2018

High-Performance Multi-Rail Erasure Coding Library over Modern Data Center Architectures: Early Experiences.
Proceedings of the ACM Symposium on Cloud Computing, 2018

Spark-uDAPL: Cost-Saving Big Data Analytics on Microsoft Azure Cloud with RDMA Networks<sup>*</sup>.
Proceedings of the IEEE International Conference on Big Data (IEEE BigData 2018), 2018

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures.
Proceedings of the Benchmarking, Measuring, and Optimizing, 2018

2017
A Comprehensive Study of MapReduce Over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters.
IEEE Trans. Parallel Distributed Syst., 2017

Scalable and Distributed Key-Value Store-based Data Management Using RDMA-Memcached.
IEEE Data Eng. Bull., 2017

Stampede 2: The Evolution of an XSEDE Supercomputer.
Proceedings of the Practice and Experience in Advanced Research Computing 2017: Sustainability, 2017

Designing Locality and NUMA Aware MPI Runtime for Nested Virtualization based HPC Cloud with SR-IOV Enabled InfiniBand.
Proceedings of the 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, 2017

Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds?
Proceedings of the 10th International Conference on Utility and Cloud Computing, 2017

HPC Meets Cloud: Building Efficient Clouds for HPC, Big Data, and Deep Learning Middleware and Applications.
Proceedings of the 10th International Conference on Utility and Cloud Computing, 2017

Designing Dynamic and Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation and Communication.
Proceedings of the High Performance Computing - 32nd International Conference, 2017

Scalable reduction collectives with data partitioning-based multi-leader design.
Proceedings of the International Conference for High Performance Computing, 2017

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures.
Proceedings of the Machine Learning on HPC Environments, 2017

MPI performance engineering with the MPI tool interface: the integration of MVAPICH and TAU.
Proceedings of the 24th European MPI Users' Group Meeting, 2017

S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters.
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

Exploiting and Evaluating OpenSHMEM on KNL Architecture.
Proceedings of the OpenSHMEM and Related Technologies. Big Compute and Big Data Convergence, 2017

High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV Enabled InfiniBand Clusters.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

Introduction to HPBDC Workshop.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling.
Proceedings of the 46th International Conference on Parallel Processing, 2017

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning.
Proceedings of the 46th International Conference on Parallel Processing, 2017

High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads.
Proceedings of the 37th IEEE International Conference on Distributed Computing Systems, 2017

Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-Capable Networks.
Proceedings of the 25th IEEE Annual Symposium on High-Performance Interconnects, 2017

Designing Registration Caching Free High-Performance MPI Library with Implicit On-Demand Paging (ODP) of InfiniBand.
Proceedings of the 24th IEEE International Conference on High Performance Computing, 2017

Kernel-Assisted Communication Engine for MPI on Emerging Manycore Processors.
Proceedings of the 24th IEEE International Conference on High Performance Computing, 2017

MPI-LiFE: Designing High-Performance Linear Fascicle Evaluation of Brain Connectome with MPI.
Proceedings of the 24th IEEE International Conference on High Performance Computing, 2017

A Scalable Network-Based Performance Analysis Tool for MPI on Large-Scale HPC Systems.
Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

Contention-Aware Kernel-Assisted MPI Collectives for Multi-/Many-Core Systems.
Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

Swift-X: Accelerating OpenStack Swift with RDMA for Building an Efficient HPC Cloud.
Proceedings of the 17th IEEE/ACM International Symposium on Cluster, 2017

NVMD: Non-volatile memory assisted design for accelerating MapReduce and DAG execution frameworks on HPC systems.
Proceedings of the 2017 IEEE International Conference on Big Data (IEEE BigData 2017), 2017

Performance characterization and acceleration of big data workloads on OpenPOWER system.
Proceedings of the 2017 IEEE International Conference on Big Data (IEEE BigData 2017), 2017

Characterizing and accelerating indexing techniques on distributed ordered tables.
Proceedings of the 2017 IEEE International Conference on Big Data (IEEE BigData 2017), 2017

Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka.
Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, 2017

Building Efficient HPC Cloud with SR-IOV-Enabled InfiniBand: The MVAPICH2 Approach.
Proceedings of the Research Advances in Cloud Computing, 2017

2016
Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters.
J. Supercomput., 2016

CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters.
Parallel Comput., 2016

Experiences and Benefits of Running RDMA Hadoop and Spark on SDSC Comet.
Proceedings of the XSEDE16 Conference on Diversity, 2016

INAM2: InfiniBand Network Analysis and Monitoring with MPI.
Proceedings of the High Performance Computing - 31st International Conference, 2016

Can Non-volatile Memory Benefit MapReduce Applications on HPC Clusters?
Proceedings of the 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems, 2016

Designing MPI library with on-demand paging (ODP) of infiniband: challenges and benefits.
Proceedings of the International Conference for High Performance Computing, 2016

OpenSHMEM Non-blocking Data Movement Operations with MVAPICH2-X: Early Experiences.
Proceedings of the 2016 PGAS Applications Workshop, 2016

Efficient Reliability Support for Hardware Multicast-Based Broadcast in GPU-enabled Streaming Applications.
Proceedings of the First International Workshop on Communication Optimizations in HPC, 2016

MR-Advisor: A Comprehensive Tuning Tool for Advising HPC Users to Accelerate MapReduce Applications on Supercomputers.
Proceedings of the 28th International Symposium on Computer Architecture and High Performance Computing, 2016

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters.
Proceedings of the 28th International Symposium on Computer Architecture and High Performance Computing, 2016

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning.
Proceedings of the 23rd European MPI Users' Group Meeting, EuroMPI 2016, 2016

Designing high performance communication runtime for GPU managed memory: early experiences.
Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, 2016

Performance Characterization of Hypervisor-and Container-Based Virtualization for HPC on SR-IOV Enabled InfiniBand Clusters.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

HPBDC Introduction and Committees.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-Enabled Systems.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

High Performance Design for HDFS with Byte-Addressability of NVM and RDMA.
Proceedings of the 2016 International Conference on Supercomputing, 2016

High Performance MPI Library for Container-Based HPC Cloud on InfiniBand Clusters.
Proceedings of the 45th International Conference on Parallel Processing, 2016

System-Level Scalable Checkpoint-Restart for Petascale Computing.
Proceedings of the 22nd IEEE International Conference on Parallel and Distributed Systems, 2016

Enabling Performance Efficient Runtime Support for Hybrid MPI+UPC++ Programming Models.
Proceedings of the 18th IEEE International Conference on High Performance Computing and Communications; 14th IEEE International Conference on Smart City; 2nd IEEE International Conference on Data Science and Systems, 2016

Mizan-RMA: Accelerating Mizan Graph Processing Framework with MPI RMA.
Proceedings of the 23rd IEEE International Conference on High Performance Computing, 2016

CUDA M3: Designing Efficient CUDA Managed Memory-Aware MPI by Exploiting GDR and IPC.
Proceedings of the 23rd IEEE International Conference on High Performance Computing, 2016

Slurm-V: Extending Slurm for Building Efficient HPC Cloud with SR-IOV and IVShmem.
Proceedings of the Euro-Par 2016: Parallel Processing, 2016

Adaptive and Dynamic Design for MPI Tag Matching.
Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016

Impact of HPC Cloud Networking Technologies on Accelerating Hadoop RPC and HBase.
Proceedings of the 2016 IEEE International Conference on Cloud Computing Technology and Science, 2016

Designing Virtualization-Aware and Automatic Topology Detection Schemes for Accelerating Hadoop on SR-IOV-Enabled Clouds.
Proceedings of the 2016 IEEE International Conference on Cloud Computing Technology and Science, 2016

Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters.
Proceedings of the 2016 IEEE International Conference on Cloud Computing Technology and Science, 2016

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters.
Proceedings of the IEEE/ACM 16th International Symposium on Cluster, 2016

SHMEMPMI - Shared Memory Based PMI for Improved Performance and Scalability.
Proceedings of the IEEE/ACM 16th International Symposium on Cluster, 2016

Boldio: A hybrid and resilient burst-buffer over lustre for accelerating big data I/O.
Proceedings of the 2016 IEEE International Conference on Big Data (IEEE BigData 2016), 2016

High-performance design of apache spark with RDMA and its benefits on various workloads.
Proceedings of the 2016 IEEE International Conference on Big Data (IEEE BigData 2016), 2016

Efficient data access strategies for Hadoop and Spark on HPC cluster with heterogeneous storage.
Proceedings of the 2016 IEEE International Conference on Big Data (IEEE BigData 2016), 2016

Performance characterization of hadoop workloads on SR-IOV-enabled virtualized InfiniBand clusters.
Proceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, 2016

2015
Accelerating Big Data Processing on Modern Clusters.
Proceedings of the 1st Workshop on Performance Analysis of Big Data Systems, 2015

Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters.
Proceedings of the High Performance Computing - 30th International Conference, 2015

A case for application-oblivious energy-efficient MPI runtime.
Proceedings of the International Conference for High Performance Computing, 2015

GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks.
Proceedings of the 22nd European MPI Users' Group Meeting, 2015

Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM.
Proceedings of the OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies, 2015

Scalable Out-of-core OpenSHMEM Library for HPC.
Proceedings of the OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies, 2015

A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X.
Proceedings of the OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies, 2015

Can RDMA benefit online data processing workloads on memcached and MySQL?
Proceedings of the 2015 IEEE International Symposium on Performance Analysis of Systems and Software, 2015

High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

High-Performance Coarray Fortran Support with MVAPICH2-X: Initial Experience and Evaluation.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

Accelerating I/O Performance of Big Data Analytics on HPC Clusters through RDMA-Based Key-Value Store.
Proceedings of the 44th International Conference on Parallel Processing, 2015

Impact of InfiniBand DC Transport Protocol on Energy Consumption of All-to-All Collective Algorithms.
Proceedings of the 23rd IEEE Annual Symposium on High-Performance Interconnects, 2015

Offloaded GPU Collectives Using CORE-Direct and CUDA Capabilities on InfiniBand Clusters.
Proceedings of the 22nd IEEE International Conference on High Performance Computing, 2015

High Performance OpenSHMEM Strided Communication Support with InfiniBand UMR.
Proceedings of the 22nd IEEE International Conference on High Performance Computing, 2015

High-Performance and Scalable Design of MPI-3 RMA on Xeon Phi Clusters.
Proceedings of the Euro-Par 2015: Parallel Processing, 2015

High Performance MPI Datatype Support with User-Mode Memory Registration: Challenges, Designs, and Benefits.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

Non-Blocking PMI Extensions for Fast MPI Startup.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

A Plugin-Based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS.
Proceedings of the Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2015

Benchmarking key-value stores on high-performance storage and interconnects for web-scale workloads.
Proceedings of the 2015 IEEE International Conference on Big Data (IEEE BigData 2015), Santa Clara, CA, USA, October 29, 2015

Performance characterization and acceleration of in-memory file systems for Hadoop and Spark applications on HPC clusters.
Proceedings of the 2015 IEEE International Conference on Big Data (IEEE BigData 2015), Santa Clara, CA, USA, October 29, 2015

2014
GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation.
IEEE Trans. Parallel Distributed Syst., 2014

A Micro-benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks.
Proceedings of the Big Data Benchmarks, Performance Optimization, and Emerging Hardware, 2014

Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand: Early Experiences.
Proceedings of the Supercomputing - 29th International Conference, 2014

Understanding the Memory-Utilization of MPI Libraries: Challenges and Designs in Implementing the MPI_T Interface.
Proceedings of the 21st European MPI Users' Group Meeting, 2014

PMI Extensions for Scalable MPI Startup.
Proceedings of the 21st European MPI Users' Group Meeting, 2014

Initial study of multi-endpoint runtime for MPI+OpenMP hybrid programming model on multi-core systems.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

Scalable MiniMD Design with Hybrid MPI and OpenSHMEM.
Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, 2014

Designing Scalable Out-of-core Sorting with Hybrid MPI+PGAS Programming Models.
Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, 2014

A Comprehensive Performance Evaluation of OpenSHMEM Libraries on InfiniBand Clusters.
Proceedings of the OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools, 2014

High Performance Alltoall and Allgather Designs for InfiniBand MIC Clusters.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Optimizing Collective Communication in UPC.
Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014

HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects.
Proceedings of the 2014 International Conference on Supercomputing, 2014

Performance Modeling for RDMA-Enhanced Hadoop MapReduce.
Proceedings of the 43rd International Conference on Parallel Processing, 2014

Designing Topology-Aware Communication Schedules for Alltoall Operations in Large InfiniBand Clusters.
Proceedings of the 43rd International Conference on Parallel Processing, 2014

HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters.
Proceedings of the 43rd International Conference on Parallel Processing, 2014

Message from the general co-chairs IEEE ICPADS 2014.
Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, 2014

Wide-area overlay networking to manage science DMZ accelerated flows.
Proceedings of the International Conference on Computing, Networking and Communications, 2014

MIC-Check: a distributed check pointing framework for the intel many integrated cores architecture.
Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, 2014

SOR-HDFS: a SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS.
Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, 2014

Accelerating Spark with RDMA for Big Data Processing: Early Experiences.
Proceedings of the 22nd IEEE Annual Symposium on High-Performance Interconnects, 2014

High performance MPI library over SR-IOV enabled infiniband clusters.
Proceedings of the 21st International Conference on High Performance Computing, 2014

A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters.
Proceedings of the 21st International Conference on High Performance Computing, 2014

Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters.
Proceedings of the 21st International Conference on High Performance Computing, 2014

Can Inter-VM Shmem Benefit MPI Applications on SR-IOV Based Virtualized Infiniband Clusters?
Proceedings of the Euro-Par 2014 Parallel Processing, 2014

MapReduce over Lustre: Can RDMA-Based Approach Benefit?
Proceedings of the Euro-Par 2014 Parallel Processing, 2014

Scalable Graph500 design with MPI-3 RMA.
Proceedings of the 2014 IEEE International Conference on Cluster Computing, 2014

High performance OpenSHMEM for Xeon Phi clusters: Extensions, runtime designs and application co-design.
Proceedings of the 2014 IEEE International Conference on Cluster Computing, 2014

In-memory I/O and replication for HDFS with Memcached: Early experiences.
Proceedings of the 2014 IEEE International Conference on Big Data (IEEE BigData 2014), 2014

2013
Redesigning MPI shared memory communication for large multi-core architecture.
Comput. Sci. Res. Dev., 2013

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks.
Proceedings of the Advancing Big Data Benchmarks, 2013

Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models.
Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013

MetaData persistence using storage class memory: experiences with flash-backed DRAM.
Proceedings of the 1st Workshop on Interactions of NVM/FLASH with Operating Systems and Workloads, 2013

MVAPICH-PRISM: a proxy-based communication framework using InfiniBand and SCIF for intel MIC clusters.
Proceedings of the International Conference for High Performance Computing, 2013

Efficient and truly passive MPI-3 RMA using InfiniBand atomics.
Proceedings of the 20th European MPI Users's Group Meeting, 2013

High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand.
Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

Evaluation of Energy Characteristics of MPI Communication Primitives with RAPL.
Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

Extending OpenSHMEM for GPU Computing.
Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

MIC-RO: enabling efficient remote offload on heterogeneous many integrated core (MIC) clusters with InfiniBand.
Proceedings of the International Conference on Supercomputing, 2013

Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs.
Proceedings of the 42nd International Conference on Parallel Processing, 2013

High-Performance Design of Hadoop RPC with RDMA over InfiniBand.
Proceedings of the 42nd International Conference on Parallel Processing, 2013

A Novel Functional Partitioning Approach to Design High-Performance MPI-3 Non-blocking Alltoallv Collective on Multi-core Systems.
Proceedings of the 42nd International Conference on Parallel Processing, 2013

A 1 PB/s file system to checkpoint three million MPI tasks.
Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, 2013

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters.
Proceedings of the IEEE 21st Annual Symposium on High-Performance Interconnects, 2013

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?
Proceedings of the IEEE 21st Annual Symposium on High-Performance Interconnects, 2013

Tutorials.
Proceedings of the IEEE 21st Annual Symposium on High-Performance Interconnects, 2013

Design of network topology aware scheduling services for large InfiniBand clusters.
Proceedings of the 2013 IEEE International Conference on Cluster Computing, 2013

A scalable and portable approach to accelerate hybrid HPL on heterogeneous CPU-GPU clusters.
Proceedings of the 2013 IEEE International Conference on Cluster Computing, 2013

Does RDMA-based enhanced Hadoop MapReduce need a new performance model?
Proceedings of the ACM Symposium on Cloud Computing, SOCC '13, 2013

Efficient Intra-node Communication on Intel-MIC Clusters.
Proceedings of the 13th IEEE/ACM International Symposium on Cluster, 2013

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience.
Proceedings of the 13th IEEE/ACM International Symposium on Cluster, 2013

2012
A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters.
Proceedings of the Specifying Big Data Benchmarks, 2012

Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

High performance RDMA-based design of HDFS over InfiniBand.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

OMB-GPU: A Micro-Benchmark Suite for Evaluating MPI Libraries on GPU Clusters.
Proceedings of the Recent Advances in the Message Passing Interface, 2012

Understanding the communication characteristics in HBase: What are the fundamental bottlenecks?
Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2012

Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

Designing Network Failover and Recovery in MPI for Multi-Rail InfiniBand Clusters.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

High-Performance Design of HBase with RDMA over InfiniBand.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

Congestion avoidance on manycore high performance computing systems.
Proceedings of the International Conference on Supercomputing, 2012

SSD-Assisted Hybrid Memory to Accelerate Memcached over High Performance Networks.
Proceedings of the 41st International Conference on Parallel Processing, 2012

Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation.
Proceedings of the 41st International Conference on Parallel Processing, 2012

Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems.
Proceedings of the IEEE 20th Annual Symposium on High-Performance Interconnects, 2012

A Scalable InfiniBand Network Topology-Aware Performance Analysis Tool for MPI.
Proceedings of the Euro-Par 2012: Parallel Processing Workshops, 2012

Minimizing Network Contention in InfiniBand Clusters with a QoS-Aware Data-Staging Framework.
Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012

Can Network-Offload Based Non-blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms?
Proceedings of the 2012 IEEE International Conference on Cluster Computing Workshops, 2012

Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports.
Proceedings of the 12th IEEE/ACM International Symposium on Cluster, 2012

2011
Collective Communication, Network Support For.
Proceedings of the Encyclopedia of Parallel Computing, 2011

InfiniBand.
Proceedings of the Encyclopedia of Parallel Computing, 2011

MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters.
Comput. Sci. Res. Dev., 2011

High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT.
Comput. Sci. Res. Dev., 2011

Codesign for InfiniBand Clusters.
Computer, 2011

Optimizing MPI One Sided Communication on Multi-core InfiniBand Clusters Using Shared Memory Backed Windows.
Proceedings of the Recent Advances in the Message Passing Interface, 2011

Design and Implementation of Key Proposed MPI-3 One-Sided Communication Semantics on InfiniBand.
Proceedings of the Recent Advances in the Message Passing Interface, 2011

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart.
Proceedings of the International Conference on Parallel Processing, 2011

Memcached Design on High Performance RDMA Capable Interconnects.
Proceedings of the International Conference on Parallel Processing, 2011

Beyond block I/O: Rethinking traditional storage primitives.
Proceedings of the 17th International Conference on High-Performance Computer Architecture (HPCA-17 2011), 2011

Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL.
Proceedings of the IEEE 19th Annual Symposium on High Performance Interconnects, 2011

Multi-threaded UPC runtime with network endpoints: Design alternatives and evaluation on multi-core architectures.
Proceedings of the 18th International Conference on High Performance Computing, 2011

Can Checkpoint/Restart Mechanisms Benefit from Hierarchical Data Staging?
Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool.
Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2.
Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER), 2011

Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters.
Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER), 2011

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefit.
Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER), 2011

Can a Decentralized Metadata Service Layer Benefit Parallel Filesystems?
Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER), 2011

High Performance Pipelined Process Migration with RDMA.
Proceedings of the 11th IEEE/ACM International Symposium on Cluster, 2011

2010
Designing truly one-sided MPI-2 RMA intra-node communication on multi-core systems.
Comput. Sci. Res. Dev., 2010

Scalable Earthquake Simulation on Petascale Supercomputers.
Proceedings of the Conference on High Performance Computing Networking, 2010

Unifying UPC and MPI runtimes: experience with MVAPICH.
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, 2010

Designing high-performance and resilient message passing on InfiniBand.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application.
Proceedings of the 24th International Conference on Supercomputing, 2010

High Performance Design and Implementation of Nemesis Communication Layer for Two-Sided and One-Sided MPI Semantics in MVAPICH2.
Proceedings of the 39th International Conference on Parallel Processing, 2010

Improving Application Performance and Predictability Using Multiple Virtual Lanes in Modern Multi-core InfiniBand Clusters.
Proceedings of the 39th International Conference on Parallel Processing, 2010

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters.
Proceedings of the 39th International Conference on Parallel Processing, 2010

Design and Evaluation of Generalized Collective Communication Primitives with Overlap Using ConnectX-2 Offload Engine.
Proceedings of the IEEE 18th Annual Symposium on High Performance Interconnects, 2010

Designing High-End Computing Systems with InfiniBand and High-Speed Ethernet.
Proceedings of the IEEE 18th Annual Symposium on High Performance Interconnects, 2010

RDMA-Based Job Migration Framework for MPI over InfiniBand.
Proceedings of the 2010 IEEE International Conference on Cluster Computing, 2010

High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand.
Proceedings of the 10th IEEE/ACM International Conference on Cluster, 2010

An MPI-Stream Hybrid Programming Model for Computational Clusters.
Proceedings of the 10th IEEE/ACM International Conference on Cluster, 2010

2009
IPDPS 2007: Comments from the Guest Editor.
J. Parallel Distributed Comput., 2009

ProOnE: a general-purpose protocol onload engine for multi- and many-core architectures.
Comput. Sci. Res. Dev., 2009

Topology agnostic hot-spot avoidance with InfiniBand.
Concurr. Comput. Pract. Exp., 2009

Impact of Node Level Caching in MPI Job Launch Mechanisms.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2009

TupleQ: Fully-asynchronous and zero-copy MPI over InfiniBand.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Designing multi-leader-based Allgather algorithms for multi-core clusters.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand.
Proceedings of the ICPPW 2009, 2009

Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems.
Proceedings of the ICPP 2009, 2009

Designing Efficient FTP Mechanisms for High Performance Data-Transfer over InfiniBand.
Proceedings of the ICPP 2009, 2009

CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems.
Proceedings of the ICPP 2009, 2009

Designing Next Generation Clusters: Evaluation of InfiniBand DDR/QDR on Intel Computing Platforms.
Proceedings of the 17th IEEE Symposium on High Performance Interconnects, 2009

Tutorial: Designing High-End Computing Systems with Infiniband and 10-Gigabit Ethernet.
Proceedings of the 17th IEEE Symposium on High Performance Interconnects, 2009

Tutorial: Infiniband and 10-Gigabit Ethernet for Dummies.
Proceedings of the 17th IEEE Symposium on High Performance Interconnects, 2009

Fast checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on multicore architecture.
Proceedings of the 16th International Conference on High Performance Computing, 2009

An efficient hardware-software approach to network fault tolerance with InfiniBand.
Proceedings of the 2009 IEEE International Conference on Cluster Computing, August 31, 2009

RDMA over Ethernet - A preliminary study.
Proceedings of the 2009 IEEE International Conference on Cluster Computing, August 31, 2009

Design alternatives for implementing fence synchronization in MPI-2 one-sided communication for InfiniBand clusters.
Proceedings of the 2009 IEEE International Conference on Cluster Computing, August 31, 2009

Reducing network contention with mixed workloads on modern multicore, clusters.
Proceedings of the 2009 IEEE International Conference on Cluster Computing, August 31, 2009

Natively Supporting True One-Sided Communication in.
Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009

2008
Lock-Free Asynchronous Rendezvous Design for MPI Point-to-Point Communication.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2008

Designing passive synchronization for MPI-2 one-sided communication to maximize overlap.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Scaling alltoall collective on multi-core systems.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

MVAPICH-Aptus: Scalable high-performance multi-transport MPI over InfiniBand.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Can software reliability outperform hardware reliability on high performance interconnects?: a case study with MPI over infiniband.
Proceedings of the 22nd Annual International Conference on Supercomputing, 2008

IMCa: A High Performance Caching Front-End for GlusterFS on InfiniBand.
Proceedings of the 2008 International Conference on Parallel Processing, 2008

Performance of HPC Middleware over InfiniBand WAN.
Proceedings of the 2008 International Conference on Parallel Processing, 2008

Designing an Efficient Kernel-Level and User-Level Hybrid Approach for MPI Intra-Node Communication on Multi-Core Systems.
Proceedings of the 2008 International Conference on Parallel Processing, 2008

Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand.
Proceedings of the 16th Annual IEEE Symposium on High Performance Interconnects (HOTI 2008), 2008

ScELA: Scalable and Extensible Launching Architecture for Clusters.
Proceedings of the High Performance Computing, 2008

Designing a High-Performance Clustered NAS: A Case Study with pNFS over RDMA on InfiniBand.
Proceedings of the High Performance Computing, 2008

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet.
Proceedings of the High Performance Computing, 2008

Designing next generation clusters with InfiniBand and 10GE/iWARP: Opportunities and challenges.
Proceedings of the 2008 IEEE International Conference on Cluster Computing, 29 September, 2008

Scalable MPI design over InfiniBand using eXtended Reliable Connection.
Proceedings of the 2008 IEEE International Conference on Cluster Computing, 29 September, 2008

Efficient one-copy MPI shared memory communication in Virtual Machines.
Proceedings of the 2008 IEEE International Conference on Cluster Computing, 29 September, 2008

Optimized Distributed Data Sharing Substrate in Multi-core Commodity Clusters: A Comprehensive Study with Applications.
Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), 2008

MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics.
Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), 2008

Advanced RDMA-Based Admission Control for Modern Data-Centers.
Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), 2008

2007
Nomad: migrating OS-bypass networks in virtual machines.
Proceedings of the 3rd International Conference on Virtual Execution Environments, 2007

Virtual machine aware communication libraries for high performance computing.
Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, 2007

DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements.
Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, 2007

pNFS/PVFS2 over InfiniBand: early experiences.
Proceedings of the 2nd International Petascale Data Storage Workshop (PDSW '07), 2007

Analyzing the impact of supporting out-of-order communication on in-order performance with iWARP.
Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, 2007

MPI-2 One-Sided Usage and Implementation for Read Modify Write Operations: A Case Study with HPCC.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 14th European PVM/MPI User's Group Meeting, Paris, France, September 30, 2007

On using connection-oriented vs. connection-less transport for performance and scalability of collective and one-sided operations: trade-offs and impact.
Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2007

Benefits of I/O Acceleration Technology (I/OAT) in Clusters.
Proceedings of the 2007 IEEE International Symposium on Performance Analysis of Systems and Software, 2007

Automatic Path Migration over InfiniBand: Early Experiences.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

High Performance MPI on IBM 12x InfiniBand Architecture.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Designing Efficient Asynchronous Memory Operations Using Hardware Copy Engine: A Case Study with I/OAT.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters.
Proceedings of the 21th Annual International Conference on Supercomputing, 2007

Designing NFS with RDMA for Security, Performance and Scalability.
Proceedings of the 2007 International Conference on Parallel Processing (ICPP 2007), 2007

High Performance MPI over iWARP: Early Experiences.
Proceedings of the 2007 International Conference on Parallel Processing (ICPP 2007), 2007

Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand.
Proceedings of the 2007 International Conference on Parallel Processing (ICPP 2007), 2007

Advanced Flow-control Mechanisms for the Sockets Direct Protocol over InfiniBand.
Proceedings of the 2007 International Conference on Parallel Processing (ICPP 2007), 2007

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms.
Proceedings of the 15th Annual IEEE Symposium on High-Performance Interconnects, 2007

Efficient asynchronous memory copy operations on multi-core systems and I/OAT.
Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

Designing high-end computing systems with InfiniBand and10-Gigabit Ethernet iWARP.
Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

Zero-copy protocol for MPI using infiniband unreliable datagram.
Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

Lightweight kernel-level primitives for high-performance MPI intra-node communication over multi-core systems.
Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

High performance virtual machine migration with RDMA over modern interconnects.
Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective.
Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2007), 2007

High Performance Distributed Lock Management Services using Network-based Remote Atomic Operations.
Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2007), 2007

Reducing Connection Memory Requirements of MPI for InfiniBand Clusters: A Message Coalescing Approach.
Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2007), 2007

Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System.
Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2007), 2007

2006
Bridging the Ethernet-Ethernot Performance Gap.
IEEE Micro, 2006

NIC-based reduction algorithms for large-scale clusters.
Int. J. High Perform. Comput. Netw., 2006

High Performance Remote Memory Access Communication: The Armci Approach.
Int. J. High Perform. Comput. Appl., 2006

High Performance VMM-Bypass I/O in Virtual Machines.
Proceedings of the 2006 USENIX Annual Technical Conference, 2006

Scalable systems software - A software based approach for providing network fault tolerance in clusters with uDAPL interface: MPI level design and performance evaluation.
Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

MPI and communication - High-performance and scalable MPI over InfiniBand with reduced memory usage: an in-depth performance analysis.
Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

Panel: Data intensive computing.
Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

Efficient Shared Memory and RDMA Based Design for MPI_Allgather over InfiniBand.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2006

RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2006

Benefits of high speed interconnects to cluster file systems: a case study with Lustre.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Adaptive connection management for scalable MPI over InfiniBand.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Shared receive queue based scalable MPI design for InfiniBand clusters.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Efficient SMP-aware MPI-level broadcast over InfiniBand's hardware multicast.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Designing next generation data-centers with advanced communication protocols and systems services.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Asynchronous zero-copy communication for synchronous sockets in the sockets direct protocol (SDP) over InfiniBand.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

A case for high performance computing with virtual machines.
Proceedings of the 20th Annual International Conference on Supercomputing, 2006

High Performance Block I/O for Global File System (GFS) with InfiniBand RDMA.
Proceedings of the 2006 International Conference on Parallel Processing (ICPP 2006), 2006

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand.
Proceedings of the 2006 International Conference on Parallel Processing (ICPP 2006), 2006

NemC: A Network Emulator for Cluster-of-Clusters.
Proceedings of the 15th International Conference On Computer Communications and Networks, 2006

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand.
Proceedings of the 14th IEEE Symposium on High-Performance Interconnects, 2006

DDSS: A Low-Overhead Distributed Data Sharing Substrate for Cluster-Based Data-Centers over Modern Interconnects.
Proceedings of the High Performance Computing, 2006

Exploiting RDMA operations for Providing Efficient Fine-Grained Resource Monitoring in Cluster-based Servers.
Proceedings of the 2006 IEEE International Conference on Cluster Computing, 2006

Designing High Performance and Scalable MPI Intra-node Communication Support for Clusters.
Proceedings of the 2006 IEEE International Conference on Cluster Computing, 2006

Designing Efficient Cooperative Caching Schemes for Multi-Tier Data-Centers over RDMA-enabled Networks.
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2006), 2006

Design of High Performance MVAPICH2: MPI2 over InfiniBand.
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2006), 2006

MPI over uDAPL: Can High Performance and Portability Exist Across Architectures?.
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2006), 2006

2005
Evaluating InfiniBand Performance with PCI Express.
IEEE Micro, 2005

Exploiting NIC architectural support for enhancing IP-based protocols on high-performance networks.
J. Parallel Distributed Comput., 2005

Selective preemption strategies for parallel job scheduling.
Int. J. High Perform. Comput. Netw., 2005

High Performance Broadcast Support in La-Mpi Over Quadrics.
Int. J. High Perform. Comput. Appl., 2005

Designing Zero-Copy Message Passing Interface Derived Datatype Communication Over Infiniband: Alternative Approaches and Performance Evaluation.
Int. J. High Perform. Comput. Appl., 2005

Efficient Hardware Multicast Group Management for Multiple MPI Communicators over InfiniBand.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2005

Design Alternatives and Performance Trade-Offs for Implementing MPI-2 over InfiniBand.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2005

Designing a Portable MPI-2 over Modern Interconnects Using uDAPL Interface.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2005

On the provision of prioritization and soft qos in dynamically reconfigurable shared data-centers over infiniband.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005

Design and Implementation of Open MPI over Quadrics/Elan4.
Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

Performance Modeling of Subnet Management on Fat Tree InfiniBand Networks using OpenSM.
Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

Scheduling of MPI-2 One Sided Operations over InfiniBand.
Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

Analysis of Design Considerations for Optimizing Multi-Channel MPI over InfiniBand.
Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

High performance support of parallel virtual file system (PVFS2) over Quadrics.
Proceedings of the 19th Annual International Conference on Supercomputing, 2005

LiMIC: Support for High-Performance MPI Intra-node Communication on Linux Cluster.
Proceedings of the 34th International Conference on Parallel Processing (ICPP 2005), 2005

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?.
Proceedings of the 13th Annual IEEE Symposium on High Performance Interconnects (HOTIC 2005), 2005

Performance Characterization of a 10-Gigabit Ethernet TOE.
Proceedings of the 13th Annual IEEE Symposium on High Performance Interconnects (HOTIC 2005), 2005

Supporting MPI-2 One Sided Communication on Multi-rail InfiniBand Clusters: Design Challenges and Performance Benefits.
Proceedings of the High Performance Computing, 2005

High Performance RDMA Based All-to-All Broadcast for InfiniBand Clusters.
Proceedings of the High Performance Computing, 2005

Performance Evaluation of MM5 on Clusters with Modern Interconnects: Scalability and Impact.
Proceedings of the Euro-Par 2005, Parallel Processing, 11th International Euro-Par Conference, Lisbon, Portugal, August 30, 2005

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device.
Proceedings of the 2005 IEEE International Conference on Cluster Computing (CLUSTER 2005), September 26, 2005

Supporting iWARP Compatibility and Features for Regular Network Adapters.
Proceedings of the 2005 IEEE International Conference on Cluster Computing (CLUSTER 2005), September 26, 2005

Head-to-TOE Evaluation of High-Performance Sockets over Protocol Offload Engines.
Proceedings of the 2005 IEEE International Conference on Cluster Computing (CLUSTER 2005), September 26, 2005

Can high performance software DSM systems designed with InfiniBand features benefit from PCI-Express?
Proceedings of the 5th International Symposium on Cluster Computing and the Grid (CCGrid 2005), 2005

Architecture for caching responses with multiple dynamic dependencies in multi-tier data-centers over InfiniBand.
Proceedings of the 5th International Symposium on Cluster Computing and the Grid (CCGrid 2005), 2005

2004
Microbenchmark Performance Comparison of High-Speed Cluster Interconnects.
IEEE Micro, 2004

High Performance RDMA-Based MPI Implementation over InfiniBand.
Int. J. Parallel Program., 2004

Application-bypass reduction for large-scale clusters.
Int. J. High Perform. Comput. Netw., 2004

Optimisation and performance evaluation of mechanisms for latency tolerance in remote memory access communication on clusters.
Int. J. High Perform. Comput. Netw., 2004

Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation.
Proceedings of the ACM/IEEE SC2004 Conference on High Performance Networking and Computing, 2004

Zero-Copy MPI Derived Datatype Communication over InfiniBand.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2004

Efficient Implementation of MPI-2 Passive One-Sided Communication on InfiniBand Clusters.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2004

Sockets Direct Protocol over InfiniBand in clusters: is it beneficial?
Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software, 2004

Efficient and Scalable Barrier over Quadrics and Myrinet with a New NIC-Based Collective Message Passing Protocol.
Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), 2004

High Performance Implementation of MPI Derived Datatype Communication over InfiniBand.
Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), 2004

Host-Assisted Zero-Copy Remote Memory Access Communication on InfiniBand.
Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), 2004

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand.
Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), 2004

Fast and Scalable MPI-Level Broadcast Using InfiniBand?s Hardware Multicast Support.
Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), 2004

Design and Implementation of MPICH2 over InfiniBand with RDMA Support.
Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), 2004

Applying MPI Derived Datatypes to the NAS Benchmarks: A Case Study.
Proceedings of the 33rd International Conference on Parallel Processing Workshops (ICPP 2004 Workshops), 2004

Efficient and Scalable All-to-All Personalized Exchange for InfiniBand-Based Clusters.
Proceedings of the 33rd International Conference on Parallel Processing (ICPP 2004), 2004

Performance evaluation of InfiniBand with PCI Express.
Proceedings of the 12th Annual IEEE Symposium on High Performance Interconnects, 2004

Fast and Scalable Startup of MPI Programs in InfiniBand Clusters.
Proceedings of the High Performance Computing, 2004

Scalable, high-performance NIC-based all-to-all broadcast over Myrinet/GM.
Proceedings of the 2004 IEEE International Conference on Cluster Computing (CLUSTER 2004), 2004

NIC-based offload of dynamic user-defined modules for Myrinet clusters.
Proceedings of the 2004 IEEE International Conference on Cluster Computing (CLUSTER 2004), 2004

State of InfiniBand in designing HPC clusters, storage/file systems, and datacenters [datacenters read as data centers].
Proceedings of the 2004 IEEE International Conference on Cluster Computing (CLUSTER 2004), 2004

Efficient Barrier and Allreduce on Infiniband clusters using multicast and adaptive algorithms.
Proceedings of the 2004 IEEE International Conference on Cluster Computing (CLUSTER 2004), 2004

Towards provision of quality of service guarantees in job scheduling.
Proceedings of the 2004 IEEE International Conference on Cluster Computing (CLUSTER 2004), 2004

Unifier: unifying cache management and communication buffer management for PVFS over InfiniBand.
Proceedings of the 4th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2004), 2004

Designing high performance DSM systems using InfiniBand features.
Proceedings of the 4th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2004), 2004

High performance MPI-2 one-sided communication over InfiniBand.
Proceedings of the 4th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2004), 2004

2003
Demotion-based exclusive caching through demote buffering: design and evaluations over different networks.
Proceedings of the International Workshop on Storage Network Architecture and Parallel I/Os, 2003

Scalable NIC-based Reduction on Large-scale Clusters.
Proceedings of the ACM/IEEE SC2003 Conference on High Performance Networking and Computing, 2003

Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics.
Proceedings of the ACM/IEEE SC2003 Conference on High Performance Networking and Computing, 2003

Fast and Scalable Barrier Using RDMA and Multicast Mechanisms for InfiniBand-Based Clusters.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface,10th European PVM/MPI Users' Group Meeting, Venice, Italy, September 29, 2003

Towards NIC-based intrusion detection.
Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 24, 2003

QoPS: A QoS Based Scheme for Parallel Job Scheduling.
Proceedings of the Job Scheduling Strategies for Parallel Processing, 2003

Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters.
Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), 2003

Implementing TreadMarks over GM on Myrinet: Challenges, Design Experience, and Performance Evaluation.
Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), 2003

Efficient Collective Operations Using Remote Memory Operations on VIA-Based Clusters.
Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), 2003

Optimizing Synchronization Operations for Remote Memory Communication Systems.
Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), 2003

High performance RDMA-based MPI implementation over InfiniBand.
Proceedings of the 17th Annual International Conference on Supercomputing, 2003

High Performance and Reliable NIC-Based Multicast over Myrinet/GM-2.
Proceedings of the 32nd International Conference on Parallel Processing (ICPP 2003), 2003

PVFS over InfiniBand: Design and Performance Evaluation.
Proceedings of the 32nd International Conference on Parallel Processing (ICPP 2003), 2003

QoS-Aware Middleware for Cluster-Based Servers to support Interactive and Resource-Adaptive Applications.
Proceedings of the 12th International Symposium on High-Performance Distributed Computing (HPDC-12 2003), 2003

Impact of High Performance Sockets on Data Intensive Applications.
Proceedings of the 12th International Symposium on High-Performance Distributed Computing (HPDC-12 2003), 2003

Micro-benchmark level performance comparison of high-speed cluster interconnects.
Proceedings of the 11th Annual IEEE Symposium on High Performance Interconnects, 2003

Exploiting Non-blocking Remote Memory Access Communication in Scientific Benchmarks.
Proceedings of the High Performance Computing - HiPC 2003, 10th International Conference, 2003

MIBA: A Micro-Benchmark Suite for Evaluating InfiniBand Architecture Implementations.
Proceedings of the Computer Performance Evaluations, 2003

Supporting Efficient Noncontiguous Access in PVFS over InfiniBand.
Proceedings of the 2003 IEEE International Conference on Cluster Computing (CLUSTER 2003), 2003

Designing Next Generation Clusters with Infiniband: Opportunities and Challenges.
Proceedings of the 2003 IEEE International Conference on Cluster Computing (CLUSTER 2003), 2003

Optimizing Mechanisms for Latency Tolerance in Remote Memory Access Communication on Clusters.
Proceedings of the 2003 IEEE International Conference on Cluster Computing (CLUSTER 2003), 2003

Application-Bypas Broadcast in MPICH over GM.
Proceedings of the 3rd IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2003), 2003

2002
HIPIQS: A High-Performance Switch Architecture Using Input Queuing.
IEEE Trans. Parallel Distributed Syst., 2002

Feature estimation for efficient streaming.
Proceedings of the IEEE/SIGGRAPH Symposium on Volume Visualization and Graphics, 2002

Active Network Interface: Opportunities and Challenges.
Proceedings of the 27th Annual IEEE Conference on Local Computer Networks (LCN 2002), 2002

MPI/IO on DAFS over VIA: Implementation and Performance Evaluation.
Proceedings of the 16th International Parallel and Distributed Processing Symposium (IPDPS 2002), 2002

Can User-Level Protocols Take Advantage of Multi-CPU NICs?.
Proceedings of the 16th International Parallel and Distributed Processing Symposium (IPDPS 2002), 2002

Workshop Introduction.
Proceedings of the 16th International Parallel and Distributed Processing Symposium (IPDPS 2002), 2002

Protocols and Strategies for Optimizing Performance of Remote Memory Operations on Clusters.
Proceedings of the 16th International Parallel and Distributed Processing Symposium (IPDPS 2002), 2002

A Reliable Multicast Algorithm for Mobile Ad Hoc Networks.
Proceedings of the 22nd International Conference on Distributed Computing Systems (ICDCS'02), 2002

Tutorial 2: InfiniBand Architecture and Where it is Headed.
Proceedings of the 10th Annual IEEE Symposium on High Performance Interconnects (HOTIC 2002), August 21, 2002

Impact of On-Demand Connection Management in MPI over VIA.
Proceedings of the 2002 IEEE International Conference on Cluster Computing (CLUSTER 2002), 2002

Efficient Barrier Using Remote Memory Operations on VIA-Based Clusters.
Proceedings of the 2002 IEEE International Conference on Cluster Computing (CLUSTER 2002), 2002

High Performance User Level Sockets over Gigabit Ethernet.
Proceedings of the 2002 IEEE International Conference on Cluster Computing (CLUSTER 2002), 2002

2001
Hybrid Algorithms for Complete Exchange in 2D Meshes.
IEEE Trans. Parallel Distributed Syst., 2001

Architectural Support for Efficient Multicasting in Irregular Networks.
IEEE Trans. Parallel Distributed Syst., 2001

Efficient Multicast on Irregular Switch-Based Cut-Through Networks with Up-Down Routing.
IEEE Trans. Parallel Distributed Syst., 2001

MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems.
IEEE Trans. Parallel Distributed Syst., 2001

Design Alternatives for Virtual Interface Architecture and an Implementation on IBM Netfinity NT Cluster.
J. Parallel Distributed Comput., 2001

Adaptive Routing on the New Switch Chip for IBM SP Systems.
J. Parallel Distributed Comput., 2001

EMP: zero-copy OS-bypass NIC-driven gigabit ethernet message passing.
Proceedings of the 2001 ACM/IEEE conference on Supercomputing, 2001

Efficient Multicast Algorithms for Heterogeneous Switch-based Irregular Networks of Workstations.
Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS-01), 2001

Performance Benefits of NIC-Based Barrier on Myrinet/GM.
Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS-01), 2001

Fast NIC-Based Barrier over Myrinet/GM.
Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS-01), 2001

VIBe: A Micro-benchmark Suite for Evaluating Virtual Interface Architecture (VIA) Implementations.
Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS-01), 2001

NIC-Based Rate Control for Proportional Bandwidth Allocation in Myrinet Clusters.
Proceedings of the 2001 International Conference on Parallel Processing, 2001

Implementing TreadMarksover VIA on Myrinet and Gigabit Ethernet: Challenges, Design Experience, and Performance Evaluation.
Proceedings of the 2001 International Conference on Parallel Processing, 2001

2000
Implementing Multidestination Worms in Switch-Based Parallel Systems: Architectural Alternatives and Their Impact.
IEEE Trans. Parallel Distributed Syst., 2000

Adaptive Routing in RS/6000 SP-Like Bidirectional Multistage Interconnection Networks.
Proceedings of the 14th International Parallel & Distributed Processing Symposium (IPDPS'00), 2000

Efficient Virtual Interface Architecture (VIA) Support for the IBM SP Switch-Connected NT Clusters.
Proceedings of the 14th International Parallel & Distributed Processing Symposium (IPDPS'00), 2000

Characterization and Enhancement of Dynamic Mapping Heuristics for Heterogeneous Systems.
Proceedings of the 2000 International Workshop on Parallel Processing, 2000

Balancing Web Server Load for Adaptable Video Distribution.
Proceedings of the 2000 International Workshop on Parallel Processing, 2000

Characterization and enhancement of Static Mapping Heuristics for Heterogeneous Systems.
Proceedings of the High Performance Computing, 2000

Can Scatter Communication Take Advantage of Multidestination Message Passing?
Proceedings of the High Performance Computing, 2000

Fast Collective Communication Algorithms for Reflective Memory Network Clusters.
Proceedings of the Network-Based Parallel Computing: Communication, 2000

Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination Messages.
Proceedings of the Network-Based Parallel Computing: Communication, 2000

Comparison and Evaluation of Design Choices for Implementing the Virtual Interface Architecture (VIA).
Proceedings of the Network-Based Parallel Computing: Communication, 2000

1999
Multidestination Message Passing in Wormhole k-ary n-cube Networks with Base Routing Conformed Paths.
IEEE Trans. Parallel Distributed Syst., 1999

Multiple Multicast with Minimized Node Contention on Wormhole k-ary n-cube Networks.
IEEE Trans. Parallel Distributed Syst., 1999

Exploiting the Benefits of Multiple-Path Network DSM Systems: Architectural Alternatives and Performance Evaluation.
IEEE Trans. Computers, 1999

Low-Latency Message Passing on Workstation Clusters using SCRAMNet.
Proceedings of the 13th International Parallel Processing Symposium / 10th Symposium on Parallel and Distributed Processing (IPPS / SPDP '99), 1999

All-to-All Broadcast on Switch-Based Clusters of Workstations.
Proceedings of the 13th International Parallel Processing Symposium / 10th Symposium on Parallel and Distributed Processing (IPPS / SPDP '99), 1999

Implementing Efficient MPI on LAPI for IBM RS/6000 SP Systems: Experiences and Performance Evaluation.
Proceedings of the 13th International Parallel Processing Symposium / 10th Symposium on Parallel and Distributed Processing (IPPS / SPDP '99), 1999

Communication Modeling of Heterogeneous Networks of Workstations for Performance Characterization of Collective Operations.
Proceedings of the 8th Heterogeneous Computing Workshop, 1999

Low Latency Message-Passing for Reflective Memory Networks.
Proceedings of the Network-Based Parallel Computing: Communication, 1999

1998
Efficient Broadcast and Multicast on Multistage Interconnection Networks Using Multiport Encoding.
IEEE Trans. Parallel Distributed Syst., 1998

Alleviating Consumption Channel Bottleneck in Wormhole-Routed k-ary n-Cube Systems.
IEEE Trans. Parallel Distributed Syst., 1998

Designing communication strategies for heterogeneous parallel systems.
Parallel Comput., 1998

Experiences with Software MPEG-2 Video Decompression on an SMP PC.
Proceedings of the 1998 International Conference on Parallel Processing Workshops, 1998

Where to Provide Support for Efficient Multicasting in Irregular Networks: Network Interface or Switch?
Proceedings of the 1998 International Conference on Parallel Processing (ICPP '98), 1998

Impact of Adaptivity on the Behaviour of Networks of Workstations under Bursty Traffic.
Proceedings of the 1998 International Conference on Parallel Processing (ICPP '98), 1998

Efficient Collective Communication on Heterogeneous Networks of Workstations.
Proceedings of the 1998 International Conference on Parallel Processing (ICPP '98), 1998

1997
Bandwidth-Optimal Complete Exchange on Wormhole-Routed 2D/3D Torus Networks: A Diagonal-Propagation Approach.
IEEE Trans. Parallel Distributed Syst., 1997

Special Issue on Workstation Clusters and Network-Based Computing: Guest Editors' Introduction.
J. Parallel Distributed Comput., 1997

Simulation of Modern Parallel Systems: A CSIM-based Approach.
Proceedings of the 29th conference on Winter simulation, 1997

Multicasting in Irregular Networks with Cut-Through Switches Using Tree-Based Multidestination Worms.
Proceedings of the Parallel Computer Routing and Communication, 1997

Designing High-Performance Communication Subsystems: Top Five Problems to Solve and Five Problems Not to Solve During the Next Five Years (Panel).
Proceedings of the Parallel Computer Routing and Communication, 1997

Multicasting on Switch-Based Irregular Networks Using Multi-drop Path-Based Multidestination Worms.
Proceedings of the Parallel Computer Routing and Communication, 1997

How Can We Design Better Networks for DSM Systems?
Proceedings of the Parallel Computer Routing and Communication, 1997

A Reliable Hardware Barrier Synchronization Scheme.
Proceedings of the 11th International Parallel Processing Symposium (IPPS '97), 1997

Optimal Multicast with Packetization and Network Interface Support.
Proceedings of the 1997 International Conference on Parallel Processing (ICPP '97), 1997

How Much Does Network Contention Affect Distributed Shared Memory Performance?
Proceedings of the 1997 International Conference on Parallel Processing (ICPP '97), 1997

Multicast on Irregular Switch-Based Networks with Wormhole Routing.
Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture (HPCA '97), 1997

Prioritized demand multiplexing (PDM): a low-latency virtual channel flow control framework for prioritized traffic.
Proceedings of the Fourth International on High-Performance Computing, 1997

1996
A Trip-Based Multicasting Model in Wormhole-Routed Networks with Virtual Channels.
IEEE Trans. Parallel Distributed Syst., 1996

Designing Clustered Multiprocessor Systems under Packaging and Technological Advancements.
IEEE Trans. Parallel Distributed Syst., 1996

Benefits of Processor Clustering in Designing Large Parallel Systems: When and How?
Proceedings of IPPS '96, 1996

Hybrid Algorithms for Complete Exchange in 2D Meshes.
Proceedings of the 10th international conference on Supercomputing, 1996

Minimizing Node Contention in Multiple Multicast on Wormhole k-ary N-Cube Networks.
Proceedings of the 1996 International Conference on Parallel Processing, 1996

Reducing Cache Invalidation Overheads in Wormhole Routed DSMs Using Multidestination Message Passing.
Proceedings of the 1996 International Conference on Parallel Processing, 1996

Designing Processor-Cluster Based Systems: Interplay Between Organizations and Broadcasting Algorithms.
Proceedings of the 1996 International Conference on Parallel Processing, 1996

1995
Fast barrier synchronization in wormhole k-ary n-cube networks with multidestination worms.
Future Gener. Comput. Syst., 1995

An efficient scheme for complete exchange in 2D tori.
Proceedings of IPPS '95, 1995

Global reduction in wormhole k-ary n-cube networks with multidestination exchange worms.
Proceedings of IPPS '95, 1995

1994
Multidestination Message Passing Mechanism Conforming to Base Wormhole Routing Scheme.
Proceedings of the Parallel Computer Routing and Communication, 1994

Architectural issues in designing heterogeneous parallel systems with passive star-coupled optical interconnection.
Proceedings of the International Symposium on Parallel Architectures, 1994

Clustering and Intra-Processor Scheduling for Explicitly-Parallel Programs on Distributed-Memory Systems.
Proceedings of the 8th International Symposium on Parallel Processing, 1994

Designing Large Hierarchical Multiprocessor Systems under Processor, Interconnection, and Packaging Advancements.
Proceedings of the 1994 International Conference on Parallel Processing, 1994

1993
Task Assignment on Distributed-Memory Systems with Adaptive Wormhole Routing.
Proceedings of the Fifth IEEE Symposium on Parallel and Distributed Processing, 1993

Scalable Architectures with k-ary n-Cube Cluster-c organization.
Proceedings of the Fifth IEEE Symposium on Parallel and Distributed Processing, 1993

A Trip-Based Multicasting Model for Wormhole-Routed Networks with Virtual Channels.
Proceedings of the Seventh International Parallel Processing Symposium, 1993

Barrier Synchronization in Distributed-Memory Multiprocessing Using Rendezvous Primitives.
Proceedings of the Seventh International Parallel Processing Symposium, 1993

Impact of Multiple Consumption Channels on Wormhole Routed <i>k</i>-ary <i>n</i>-cube Networks.
Proceedings of the Seventh International Parallel Processing Symposium, 1993

1991
Fast Data Manipulation in Multiprocessors Using Parallel Pipelined Memories.
J. Parallel Distributed Comput., 1991

Architectural Design of Orthogonal Multiprocessor for Multidimensional Information Processing.
J. Inf. Sci. Eng., 1991

Message Vectorization for Converting Multicomputer Programs to Shared-Memory Multiprocessors.
Proceedings of the International Conference on Parallel Processing, 1991

1990
OMP: a RISC-based multiprocessor using orthogonal-access memories and multiple spanning buses.
Proceedings of the 4th international conference on Supercomputing, 1990

Algorithm-Driven Simulation and Performance Projection of a RISC-based Orthogonal Multiprocessor.
Proceedings of the 1990 International Conference on Parallel Processing, 1990

Reconfigurable vector register windows for fast matrix computation on the orthogonal multiprocessor.
Proceedings of the Application Specific Array Processors, 1990

1989
Optical arithmetic using high-radix symbolic substitution rules.
Proceedings of the 9th Symposium on Computer Arithmetic, 1989

1988
A Parallel-Serial Binary Arbitration Scheme for Collision-Free Multi-Access Techniques.
Comput. Networks, 1988


  Loading...