2024
Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression.
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the International Conference for High Performance Computing, 2024
UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture.
Proceedings of the 61st ACM/IEEE Design Automation Conference, 2024
2023
Near-Lossless MPI Tracing and Proxy Application Autogeneration.
IEEE Trans. Parallel Distributed Syst., 2023
2021
Logically Parallel Communication for Fast MPI+Threads Applications.
IEEE Trans. Parallel Distributed Syst., 2021
IEEE Trans. Parallel Distributed Syst., 2021
Translational research in the MPICH project.
J. Comput. Sci., 2021
Pilgrim: scalable and (near) lossless MPI tracing.
Proceedings of the International Conference for High Performance Computing, 2021
Lightweight preemptive user-level threads.
Proceedings of the PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021
OpenSHMEM over MPI as a Performance Contender: Thorough Analysis and Optimizations.
Proceedings of the OpenSHMEM and Related Technologies. OpenSHMEM in the Era of Exascale and Smart Networks, 2021
Daps: A Dynamic Asynchronous Progress Stealing Model for MPI Communication.
Proceedings of the IEEE International Conference on Cluster Computing, 2021
RMACXX: An Efficient High-Level C++ Interface over MPI-3 RMA.
Proceedings of the 21st IEEE/ACM International Symposium on Cluster, 2021
2020
Analyzing the Performance Trade-Off in Implementing User-Level Threads.
IEEE Trans. Parallel Distributed Syst., 2020
Memory-Efficient and Skew-Tolerant MapReduce Over MPI for Supercomputing Systems.
IEEE Trans. Parallel Distributed Syst., 2020
Analysis of Threading Libraries for High Performance Computing.
IEEE Trans. Computers, 2020
CAB-MPI: exploring interprocess work-stealing towards balanced MPI communication.
Proceedings of the International Conference for High Performance Computing, 2020
Implementing Flexible Threading Support in Open MPI.
Proceedings of the Workshop on Exascale MPI, 2020
How I learned to stop worrying about user-visible endpoints and love MPI.
Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020
Probing the Underlying Implementation Mechanisms of SW26010.
Proceedings of the 22nd IEEE International Conference on High Performance Computing and Communications; 18th IEEE International Conference on Smart City; 6th IEEE International Conference on Data Science and Systems, 2020
2019
Scalable Deep Learning via I/O Analysis and Optimization.
ACM Trans. Parallel Comput., 2019
Guest Editor's Introduction: P2S2: SI 2016.
Parallel Comput., 2019
International workshop on programming models and applications for multicores and manycores (PMAM 2018).
Parallel Comput., 2019
Foreword to the special issue for the Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2 2017).
Parallel Comput., 2019
Special issue on the message passing interface.
Parallel Comput., 2019
Characterization of Power Usage and Performance in Data-Intensive Applications Using MapReduce over MPI.
Proceedings of the Parallel Computing: Technology Trends, 2019
Software combining to mitigate multithreaded MPI contention.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the ACM International Conference on Supercomputing, 2019
Optimized Execution of Parallel Loops via User-Defined Scheduling Policies.
Proceedings of the 48th International Conference on Parallel Processing, 2019
An Auto Code Generator for Stencil on SW26010.
Proceedings of the 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems, 2019
BOLT: Optimizing OpenMP Parallel Regions with User-Level Threads.
Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques, 2019
2018
Dynamic Adaptable Asynchronous Progress Model for MPI RMA Multiphase Applications.
IEEE Trans. Parallel Distributed Syst., 2018
Argobots: A Lightweight Low-Level Threading and Tasking Framework.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
IEEE Trans. Parallel Distributed Syst., 2018
Lock Contention Management in Multithreaded MPI.
ACM Trans. Parallel Comput., 2018
Exploring the interoperability of remote GPGPU virtualization using rCUDA and directive-based programming models.
J. Supercomput., 2018
Int. J. High Perform. Comput. Appl., 2018
On the adequacy of lightweight thread approaches for high-level parallel programming models.
Future Gener. Comput. Syst., 2018
Lessons learned from analyzing dynamic promotion for user-level threading.
Proceedings of the International Conference for High Performance Computing, 2018
Characterization of MPI usage on a production supercomputer.
Proceedings of the International Conference for High Performance Computing, 2018
Scalable Communication Endpoints for MPI+Threads Applications.
Proceedings of the 24th IEEE International Conference on Parallel and Distributed Systems, 2018
On the Power of Combiner Optimizations in MapReduce Over MPI Workflows.
Proceedings of the 24th IEEE International Conference on Parallel and Distributed Systems, 2018
Process-in-process: techniques for practical address-space sharing.
Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, 2018
K-mer Counting for Genomic Big Data.
Proceedings of the Big Data - BigData 2018, 2018
2017
Enabling scalable and accurate clustering of distributed ligand geometries on supercomputers.
Parallel Comput., 2017
Exploring versioned distributed arrays for resilience in scientific applications.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Int. J. High Perform. Comput. Appl., 2017
Special issue on programming models and applications for multicores and manycores.
Int. J. High Perform. Comput. Appl., 2017
Foreword to the Special Issue of the workshop on the seventh international workshop on programming models and applications for multicores and manycores (PMAM 2016).
Concurr. Comput. Pract. Exp., 2017
Why is MPI so slow?: analyzing the fundamental limits in implementing MPI-3.1.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the International Conference for High Performance Computing, 2017
Memory Compression Techniques for Network Address Management in MPI.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017
Mimir: Memory-Efficient and Scalable MapReduce for Large Supercomputing Systems.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017
GLTO: On the Adequacy of Lightweight Thread Approaches for OpenMP Implementations.
Proceedings of the 46th International Conference on Parallel Processing, 2017
Parallel I/O Optimizations for Scalable Deep Learning.
Proceedings of the 23rd IEEE International Conference on Parallel and Distributed Systems, 2017
Hexe: A Toolkit for Heterogeneous Memory Management.
Proceedings of the 23rd IEEE International Conference on Parallel and Distributed Systems, 2017
Portable Topology-Aware MPI-I/O.
Proceedings of the 23rd IEEE International Conference on Parallel and Distributed Systems, 2017
Bloomfish: A Highly Scalable Distributed K-mer Counting Framework.
Proceedings of the 23rd IEEE International Conference on Parallel and Distributed Systems, 2017
Process-Based Asynchronous Progress Model for MPI Point-to-Point Communication.
Proceedings of the 19th IEEE International Conference on High Performance Computing and Communications; 15th IEEE International Conference on Smart City; 3rd IEEE International Conference on Data Science and Systems, 2017
Towards Scalable Deep Learning via I/O Analysis and Optimization.
Proceedings of the 19th IEEE International Conference on High Performance Computing and Communications; 15th IEEE International Conference on Smart City; 3rd IEEE International Conference on Data Science and Systems, 2017
Exploiting Common Neighborhoods to Optimize MPI Neighborhood Collectives.
Proceedings of the 24th IEEE International Conference on High Performance Computing, 2017
GLT: A Unified API for Lightweight Thread Libraries.
Proceedings of the Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28, 2017
S-Aligner: Ultrascalable Read Mapping on Sunway Taihu Light.
Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017
A Performance Study of UCX over InfiniBand.
Proceedings of the 17th IEEE/ACM International Symposium on Cluster, 2017
Scalable Assembly for Massive Genomic Graphs.
Proceedings of the 17th IEEE/ACM International Symposium on Cluster, 2017
Advanced Thread Synchronization for Multithreaded MPI Implementations.
Proceedings of the 17th IEEE/ACM International Symposium on Cluster, 2017
2016
MPI-ACC: Accelerator-Aware MPI for Scientific Applications.
,
,
,
,
,
,
,
,
,
,
,
IEEE Trans. Parallel Distributed Syst., 2016
Survey of Techniques and Architectures for Designing Energy-Efficient Data Centers.
IEEE Syst. J., 2016
Special Issue on Cluster Computing.
Parallel Comput., 2016
A data-oriented profiler to assist in data partitioning and distribution for heterogeneous memory in HPC.
Parallel Comput., 2016
Special Issue on Parallel Programming Models and Systems Software for High-End Computing.
Parallel Comput., 2016
MultiCL: Enabling automatic scheduling for task-parallel workloads in OpenCL.
Parallel Comput., 2016
Performance analysis of data intensive cloud systems based on data management and replication: a survey.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Distributed Parallel Databases, 2016
An implementation and evaluation of the MPI 3.0 one-sided communication interface.
Concurr. Comput. Pract. Exp., 2016
Programming models and applications for multicores and manycores.
Concurr. Comput. Pract. Exp., 2016
Work stealing for GPU-accelerated parallel programs in a global address space framework.
Concurr. Comput. Pract. Exp., 2016
A survey and taxonomy on energy efficient resource allocation techniques for cloud computing systems.
,
,
,
,
,
,
,
,
,
,
,
Computing, 2016
Scaling FMM with Data-Driven OpenMP Tasks on Multicore Architectures.
Proceedings of the OpenMP: Memory, Devices, and Tasks, 2016
Scalability Challenges in Current MPI One-Sided Implementations.
Proceedings of the 15th International Symposium on Parallel and Distributed Computing, 2016
SWAP-Assembler 2: Optimization of De Novo Genome Assembler at Extreme Scale.
Proceedings of the 45th International Conference on Parallel Processing, 2016
One-Sided Interface for Matrix Operations Using MPI-3 RMA: A Case Study with Elemental.
Proceedings of the 45th International Conference on Parallel Processing, 2016
Compiler-Assisted Overlapping of Communication and Computation in MPI Applications.
Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016
A Review of Lightweight Thread Approaches for High Performance Computing.
Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016
2015
Scalable Network Communication Using Unreliable RDMA.
Proceedings of the Handbook on Data Centers, 2015
Remote Memory Access Programming in MPI-3.
ACM Trans. Parallel Comput., 2015
Scalable connectionless RDMA over unreliable datagrams.
Parallel Comput., 2015
Introduction Special Section of ICCCN 2014 Conference.
Comput. Commun., 2015
Improving concurrency and asynchrony in multithreaded MPI applications using software offloading.
Proceedings of the International Conference for High Performance Computing, 2015
VOCL-FT: introducing techniques for efficient soft error coprocessor recovery.
Proceedings of the International Conference for High Performance Computing, 2015
Fault tolerant MapReduce-MPI for HPC clusters.
Proceedings of the International Conference for High Performance Computing, 2015
MPI+Threads: runtime contention and remedies.
Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2015
Casper: An Asynchronous Progress Model for MPI RMA on Many-Core Architectures.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015
HiCOMB 2015 Keynote and Invited Talks.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015
AsHES Introduction and Committees.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015
Versioning Architectures for Local and Global Memory.
Proceedings of the 21st IEEE International Conference on Parallel and Distributed Systems, 2015
Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the International Conference on Computational Science, 2015
MPI+ULT: Overlapping Communication and Computation with User-Level Threads.
Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications, 2015
Exploring the Suitability of Remote GPGPU Virtualization for the OpenACC Programming Model Using rCUDA.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015
Empirical Comparison of Three Versioning Architectures.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015
Flexible Error Recovery Using Versions in Global View Resilience.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015
Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015
Analyzing MPI-3.0 Process-Level Shared Memory: A Case Study with Stencil Computations.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015
Runtime Support for Irregular Computation in MPI-Based Applications.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015
Accurate Scoring of Drug Conformations at the Extreme Scale.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015
Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015
Techniques for Enabling Highly Efficient Message Passing on Many-Core Architectures.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015
Implementation and Evaluation of MPI Nonblocking Collective I/O.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015
Toward Implementing Robust Support for Portals 4 Networks in MPICH.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015
Understanding Data Access Patterns Using Object-Differentiated Memory Profiling.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015
SWAP-Assembler 2: Scalable Genome Assembler towards Millions of Cores - Practice and Experience.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015
Lessons Learned Implementing User-Level Failure Mitigation in MPICH.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015
Characterizing MPI and Hybrid MPI+Threads Applications at Scale: Case Study with BFS.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015
2014
Processing MPI Derived Datatypes on Noncontiguous GPU-Resident Data.
IEEE Trans. Parallel Distributed Syst., 2014
Special issue on programming models and applications for multicores and manycores - Guest Editors' Introduction.
Parallel Comput., 2014
Addressing failures in exascale computing.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Int. J. High Perform. Comput. Appl., 2014
Enabling communication concurrency through flexible MPI endpoints.
Int. J. High Perform. Comput. Appl., 2014
SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores.
BMC Bioinform., 2014
Nonblocking Epochs in MPI One-Sided Communication.
Proceedings of the International Conference for High Performance Computing, 2014
MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications.
Proceedings of the International Conference for High Performance Computing, 2014
Simplifying the recovery model of user-level failure mitigation.
Proceedings of the 2014 Workshop on Exascale MPI, 2014
Implementing the MPI-3.0 Fortran 2008 Binding.
Proceedings of the 21st European MPI Users' Group Meeting, 2014
Portable, MPI-interoperable coarray fortran.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014
MT-MPI: multithreaded MPI for many-core environments.
Proceedings of the 2014 International Conference on Supercomputing, 2014
A Framework for Tracking Memory Accesses in Scientific Applications.
Proceedings of the 43rd International Conference on Parallel Processing Workshops, 2014
WorkQ: A many-core producer/consumer execution model applied to PGAS computations.
Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, 2014
Toward the efficient use of multiple explicitly managed memory subsystems.
Proceedings of the 2014 IEEE International Conference on Cluster Computing, 2014
2013
Designing energy efficient communication runtime systems: a view from PGAS models.
J. Supercomput., 2013
Guest Editors' introduction.
J. Supercomput., 2013
A survey on resource allocation in high performance distributed computing systems.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Parallel Comput., 2013
Special issue on programming models, systems software, and tools for High-End Computing.
Parallel Comput., 2013
Guest Editors' Introduction: Special Issue on Applications for the Heterogeneous Computing Era.
Int. J. High Perform. Comput. Appl., 2013
Guest editors' introduction: Special issue on Cluster, Grid, and Cloud Computing.
Future Gener. Comput. Syst., 2013
MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory.
Computing, 2013
An overview of energy efficiency techniques in cluster computing systems.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Clust. Comput., 2013
Container-Based Job Management for Fair Resource Sharing.
Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013
Analysis of topology-dependent MPI performance on Gemini networks.
Proceedings of the 20th European MPI Users's Group Meeting, 2013
Enabling MPI interoperability through flexible communication endpoints.
Proceedings of the 20th European MPI Users's Group Meeting, 2013
Synchronization and Ordering Semantics in Hybrid MPI+GPU Programming.
Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013
Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions.
Proceedings of the 42nd International Conference on Parallel Processing, 2013
Enhancing Performance Portability of MPI Applications through Annotation-Based Transformations.
Proceedings of the 42nd International Conference on Parallel Processing, 2013
MPI-Interoperable Generalized Active Messages.
Proceedings of the 19th IEEE International Conference on Parallel and Distributed Systems, 2013
Online Performance Projection for Clusters with Heterogeneous GPUs.
Proceedings of the 19th IEEE International Conference on Parallel and Distributed Systems, 2013
pVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments.
Proceedings of the IEEE 33rd International Conference on Distributed Computing Systems, 2013
On the efficacy of GPU-integrated MPI for scientific applications.
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, 2013
On the Reproducibility of MPI Reduction Operations.
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013
Topic 15: GPU and Accelerator Computing - (Introduction).
Proceedings of the Euro-Par 2013 Parallel Processing, 2013
Optimization Strategies for MPI-Interoperable Active Messages.
Proceedings of the IEEE 11th International Conference on Dependable, 2013
Toward Asynchronous and MPI-Interoperable Active Messages.
Proceedings of the 13th IEEE/ACM International Symposium on Cluster, 2013
Optimizing Burrows-Wheeler Transform-Based Sequence Alignment on Multicore Architectures.
Proceedings of the 13th IEEE/ACM International Symposium on Cluster, 2013
2012
Applications for the Heterogeneous Computing Era.
Int. J. High Perform. Comput. Appl., 2012
Leveraging MPI's One-Sided Communication Interface for Shared-Memory Programming.
Proceedings of the Recent Advances in the Message Passing Interface, 2012
Efficient Multithreaded Context ID Allocation in MPI.
Proceedings of the Recent Advances in the Message Passing Interface, 2012
Efficient Intranode Communication in GPU-Accelerated Systems.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012
Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012
DMA-Assisted, Intranode Communication in GPU Accelerated Systems.
Proceedings of the 14th IEEE International Conference on High Performance Computing and Communication & 9th IEEE International Conference on Embedded Software and Systems, 2012
MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems.
Proceedings of the 14th IEEE International Conference on High Performance Computing and Communication & 9th IEEE International Conference on Embedded Software and Systems, 2012
Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments.
Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012
Transparent Accelerator Migration in a Virtualized GPU Environment.
Proceedings of the 12th IEEE/ACM International Symposium on Cluster, 2012
2011
Mpi on millions of Cores.
Parallel Process. Lett., 2011
Special Issue on Programming Models, Software and Tools for High-End Computing.
Int. J. High Perform. Comput. Appl., 2011
Special Issue on Programming Models and Systems Software Support for High-End Computing Applications.
Int. J. High Perform. Comput. Appl., 2011
Mapping communication layouts to network hardware characteristics on massive-scale blue gene systems.
Comput. Sci. Res. Dev., 2011
Poster: High-level, one-sided programming models on MPI: a case study with global arrays and NWChem.
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2011
Multi-core and Network Aware MPI Topology Functions.
Proceedings of the Recent Advances in the Message Passing Interface, 2011
Noncollective Communicator Creation in MPI.
Proceedings of the Recent Advances in the Message Passing Interface, 2011
Dynamic Time-Variant Connection Management for PGAS Models on InfiniBand.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011
RDMA Capable iWARP over Datagrams.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011
Building algorithmically nonstop fault tolerant MPI programs.
Proceedings of the 18th International Conference on High Performance Computing, 2011
Energy-aware hierarchical scheduling of applications in large scale data centers.
Proceedings of the 2011 International Conference on Cloud and Service Computing, 2011
2010
A Pipelined Algorithm for Large, Irregular All-Gather Problems.
Int. J. High Perform. Comput. Appl., 2010
The Importance of Non-Data-Communication Overheads in MPI.
Int. J. High Perform. Comput. Appl., 2010
Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming.
Int. J. High Perform. Comput. Appl., 2010
Global-scale distributed I/O with ParaMEDIC.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Concurr. Comput. Pract. Exp., 2010
Implementing MPI on Windows: Comparison with Common Approaches on Unix.
Proceedings of the Recent Advances in the Message Passing Interface, 2010
Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems.
Proceedings of the Recent Advances in the Message Passing Interface, 2010
PMI: A Scalable Parallel Process-Management Interface for Extreme-Scale Systems.
Proceedings of the Recent Advances in the Message Passing Interface, 2010
A study of hardware assisted IP over InfiniBand and its impact on enterprise data center performance.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2010
Designing High-End Computing Systems with InfiniBand and High-Speed Ethernet.
Proceedings of the IEEE 18th Annual Symposium on High Performance Interconnects, 2010
Fault-tolerant communication runtime support for data-centric programming models.
Proceedings of the 2010 International Conference on High Performance Computing, 2010
iWARP redefined: Scalable connectionless communication over high-speed Ethernet.
Proceedings of the 2010 International Conference on High Performance Computing, 2010
Designing Energy Efficient Communication Runtime Systems for Data Centric Programming Models.
Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications, 2010
Power and Performance Characterization of Computational Kernels on the GPU.
Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications, 2010
Minimizing MPI Resource Contention in Multithreaded Multicore Environments.
Proceedings of the 2010 IEEE International Conference on Cluster Computing, 2010
Hybrid parallel programming with MPI and unified parallel C.
Proceedings of the 7th Conference on Computing Frontiers, 2010
2009
ProOnE: a general-purpose protocol onload engine for multi- and many-core architectures.
Comput. Sci. Res. Dev., 2009
Toward message passing for a million processes: characterizing MPI on a massive scale blue gene/P.
Comput. Sci. Res. Dev., 2009
Tools and Environments for Multicore and Many-Core Architectures.
Computer, 2009
MPI on a Million Processors.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2009
GePSeA: A General-Purpose Software Acceleration Framework for Lightweight Task Offloading.
Proceedings of the ICPP 2009, 2009
Improving Resource Availability by Relaxing Network Allocation Constraints on Blue Gene/P.
Proceedings of the ICPP 2009, 2009
Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers.
Proceedings of the 15th IEEE International Conference on Parallel and Distributed Systems, 2009
Understanding Network Saturation Behavior on Large-Scale Blue Gene/P Systems.
Proceedings of the 15th IEEE International Conference on Parallel and Distributed Systems, 2009
Tutorial: Designing High-End Computing Systems with Infiniband and 10-Gigabit Ethernet.
Proceedings of the 17th IEEE Symposium on High Performance Interconnects, 2009
Tutorial: Infiniband and 10-Gigabit Ethernet for Dummies.
Proceedings of the 17th IEEE Symposium on High Performance Interconnects, 2009
Natively Supporting True One-Sided Communication in.
Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009
2008
Asymmetric interactions in symmetric multi-core systems: analysis, enhancements and evaluation.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2008
Massively parallel genomic sequence search on the Blue Gene/P architecture.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2008
A Simple, Pipelined Algorithm for Large, Irregular All-gather Problems.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2008
Non-data-communication Overheads in MPI: Analysis on Blue Gene/P.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2008
Toward Efficient Support for Multithreaded MPI Communication.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2008
Semantics-based distributed I/O for mpiBLAST.
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008
Impact of Network Sharing in Multi-Core Architectures.
Proceedings of the 17th International Conference on Computer Communications and Networks, 2008
Semantic-based distributed i/o with the paramedic framework.
Proceedings of the 17th International Symposium on High-Performance Distributed Computing (HPDC-17 2008), 2008
Making a Case for Proactive Flow Control in Optical Circuit-Switched Networks.
Proceedings of the High Performance Computing, 2008
Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems.
Proceedings of the High Performance Computing, 2008
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet.
Proceedings of the High Performance Computing, 2008
Are nonblocking networks really needed for high-end-computing workloads?
Proceedings of the 2008 IEEE International Conference on Cluster Computing, 29 September, 2008
2007
Analyzing the impact of supporting out-of-order communication on in-order performance with iWARP.
Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, 2007
Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007
Nonuniformly Communicating Noncontiguous Data: A Case Study with PETSc and MPI.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007
Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling.
Proceedings of the 2007 International Conference on Parallel Processing (ICPP 2007), 2007
Advanced Flow-control Mechanisms for the Sockets Direct Protocol over InfiniBand.
Proceedings of the 2007 International Conference on Parallel Processing (ICPP 2007), 2007
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multicore Environments.
Proceedings of the 15th Annual IEEE Symposium on High-Performance Interconnects, 2007
Designing high-end computing systems with InfiniBand and10-Gigabit Ethernet iWARP.
Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007
2006
Bridging the Ethernet-Ethernot Performance Gap.
IEEE Micro, 2006
Designing next generation data-centers with advanced communication protocols and systems services.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006
Asynchronous zero-copy communication for synchronous sockets in the sockets direct protocol (SDP) over InfiniBand.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006
2005
Exploiting NIC architectural support for enhancing IP-based protocols on high-performance networks.
J. Parallel Distributed Comput., 2005
On the provision of prioritization and soft qos in dynamically reconfigurable shared data-centers over infiniband.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
Performance Characterization of a 10-Gigabit Ethernet TOE.
Proceedings of the 13th Annual IEEE Symposium on High Performance Interconnects (HOTIC 2005), 2005
Supporting iWARP Compatibility and Features for Regular Network Adapters.
Proceedings of the 2005 IEEE International Conference on Cluster Computing (CLUSTER 2005), September 26, 2005
Head-to-TOE Evaluation of High-Performance Sockets over Protocol Offload Engines.
Proceedings of the 2005 IEEE International Conference on Cluster Computing (CLUSTER 2005), September 26, 2005
Architecture for caching responses with multiple dynamic dependencies in multi-tier data-centers over InfiniBand.
Proceedings of the 5th International Symposium on Cluster Computing and the Grid (CCGrid 2005), 2005
2004
Sockets Direct Protocol over InfiniBand in clusters: is it beneficial?
Proceedings of the 2004 IEEE International Symposium on Performance Analysis of Systems and Software, 2004
Towards provision of quality of service guarantees in job scheduling.
Proceedings of the 2004 IEEE International Conference on Cluster Computing (CLUSTER 2004), 2004
2003
QoPS: A QoS Based Scheme for Parallel Job Scheduling.
Proceedings of the Job Scheduling Strategies for Parallel Processing, 2003
Efficient Collective Operations Using Remote Memory Operations on VIA-Based Clusters.
Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), 2003
Impact of High Performance Sockets on Data Intensive Applications.
Proceedings of the 12th International Symposium on High-Performance Distributed Computing (HPDC-12 2003), 2003
2002
High Performance User Level Sockets over Gigabit Ethernet.
Proceedings of the 2002 IEEE International Conference on Cluster Computing (CLUSTER 2002), 2002