Scott Levy

Parallel Comput., 2021

Characterizing Memory Failures Using Benford's Law.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2021: Parallel Processing Workshops, 2021

MiniMod: A Modular Miniapplication Benchmarking Framework for HPC.

[BibT_eX]

[DOI]

W. Pepper Marts

Proceedings of the IEEE International Conference on Cluster Computing, 2021

pMEMCPY: a simple, lightweight, and portable I/O library for storing data in persistent memory.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2021

Understanding the Effects of DRAM Correctable Error Logging at Scale.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2021

2020

The Program with a Personality: Analysis of Elk Cloner, the First Personal Computer Virus.

[BibT_eX]

[DOI]

Jedidiah R. Crandall

CoRR, 2020

The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints.

[BibT_eX]

[DOI]

Patrick M. Widener

Concurr. Comput. Pract. Exp., 2020

Hardware MPI message matching: Insights into MPI matching behavior to inform design.

[BibT_eX]

[DOI]

Ryan E. Grant

Michael J. Levenhagen

Taylor L. Groves

Concurr. Comput. Pract. Exp., 2020

ALAMO: Autonomous Lightweight Allocation, Management, and Optimization.

[BibT_eX]

[DOI]

Proceedings of the Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI, 2020

Message from the Workshop Chair.

[BibT_eX]

[DOI]

Proceedings of the 10th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, 2020

RaDD Runtimes: Radical and Different Distributed Runtimes with SmartNICs.

[BibT_eX]

[DOI]

Ryan E. Grant

Whit Schonbein

Proceedings of the Fourth IEEE/ACM Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, 2020

Evaluating MPI Message Size Summary Statistics.

[BibT_eX]

[DOI]

Proceedings of the EuroMPI/USA '20: 27th European MPI Users' Group Meeting, 2020

The Case for Explicit Reuse Semantics for RDMA Communication.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

Low-cost MPI Multithreaded Message Matching Benchmarking.

[BibT_eX]

[DOI]

Whit Schonbein

W. Pepper Marts

Ryan E. Grant

Proceedings of the 22nd IEEE International Conference on High Performance Computing and Communications; 18th IEEE International Conference on Smart City; 6th IEEE International Conference on Data Science and Systems, 2020

2019

Using simulation to examine the effect of MPI message matching costs on application performance.

[BibT_eX]

[DOI]

Parallel Comput., 2019

Mediating Data Center Storage Diversity in HPC Applications with FAODEL.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing, 2019

Evaluating tradeoffs between MPI message matching offload hardware capacity and performance.

[BibT_eX]

[DOI]

Proceedings of the 26th European MPI Users' Group Meeting, 2019

Space-Efficient Reed-Solomon Encoding to Detect and Correct Pointer Corruption.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2019: Parallel Processing Workshops, 2019

2018

Characterizing MPI matching via trace-based simulation.

[BibT_eX]

[DOI]

Parallel Comput., 2018

Lessons learned from memory errors observed over the lifetime of Cielo.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2018

Using Simulation to Examine the Effect of MPI Message Matching Costs on Application Performance.

[BibT_eX]

[DOI]

Proceedings of the 25th European MPI Users' Group Meeting, 2018

Open Science on Trinity's Knights Landing Partition: An Analysis of User Job Data.

[BibT_eX]

[DOI]

Kevin T. Pedretti

Proceedings of the 47th International Conference on Parallel Processing, 2018

Faodel: Data Management for Next-Generation Application Workflows.

[BibT_eX]

[DOI]

Proceedings of the 9th Workshop on Scientific Cloud Computing, 2018

2017

Empress: extensible metadata provider for extreme-scale scientific simulations.

[BibT_eX]

[DOI]

Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems, 2017

It's Not the Heat, It's the Humidity: Scheduling Resilience Activity at Scale.

[BibT_eX]

[DOI]

Patrick M. Widener

Proceedings of the Euro-Par 2017: Parallel Processing Workshops, 2017

Lifetime memory reliability data from the field.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, 2017

Evaluating the Viability of Using Compression to Mitigate Silent Corruption of Read-Mostly Application Data.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

2016

On noise and the performance benefit of nonblocking collectives.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2016

Understanding performance interference in next-generation HPC systems.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2016

Improving application resilience to memory errors with lightweight compression.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2016

How I Learned to Stop Worrying and Love In Situ Analytics: Leveraging Latent Synchronization in MPI Collective Algorithms.

[BibT_eX]

[DOI]

Proceedings of the 23rd European MPI Users' Group Meeting, EuroMPI 2016, 2016

An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart.

[BibT_eX]

[DOI]

Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, 2016

Horseshoes and Hand Grenades: The Case for Approximate Coordination in Local Checkpointing Protocols.

[BibT_eX]

[DOI]

Patrick M. Widener

Proceedings of the Euro-Par 2016: Parallel Processing Workshops, 2016

Improving DRAM Fault Characterization through Machine Learning.

[BibT_eX]

[DOI]

Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2016

Scheduling In-Situ Analytics in Next-Generation Applications.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM 16th International Symposium on Cluster, 2016

2015

A study of the viability of exploiting memory content similarity to improve resilience to memory errors.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2015

Canaries in a Coal Mine: Using Application-Level Checkpoints to Detect Memory Failures.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2015: Parallel Processing Workshops, 2015

2014

Understanding the Effects of Communication and Coordination on Checkpointing at Scale.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2014

Exploring the effect of noise on the performance benefit of nonblocking allreduce.

[BibT_eX]

[DOI]

Proceedings of the 21st European MPI Users' Group Meeting, 2014

Characterizing the Impact of Rollback Avoidance at Extreme-Scale: A Modeling Approach.

[BibT_eX]

[DOI]

Proceedings of the 43rd International Conference on Parallel Processing, 2014

2013

Using Simulation to Evaluate the Performance of Resilience Strategies at Scale.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation, 2013

Exploiting Content Similarity to Improve Memory Performance in Large-Scale High-Performance Computing Systems.

[BibT_eX]

[DOI]

Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

Evaluating the feasibility of using memory content similarity to improve system resilience.

[BibT_eX]

[DOI]

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers, 2013

Using unreliable virtual hardware to inject errors in extreme-scale systems.

[BibT_eX]

[DOI]

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, 2013

Asking the Right Questions: Benchmarking Fault-Tolerant Extreme-Scale Systems.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2013: Parallel Processing Workshops, 2013

2011

Exploiting MISD Performance Opportunities in Multi-core Systems.

[BibT_eX]

[DOI]

Donour Sizemore