Kathryn M. Mohror

Marc Snir

IEEE Trans. Parallel Distributed Syst., June, 2024

DFTracer: An Analysis-Friendly Data Flow Tracer for AI-Driven Workflows.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2024

The Impact of Asynchronous I/O in Checkpoint-Restart Workloads.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

Understanding Highly Configurable Storage for Diverse Workloads.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2024

2023

ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems.

[BibT_eX]

[DOI]

CoRR, 2023

IOMax: Maximizing Out-of-Core I/O Analysis Performance on HPC Systems.

[BibT_eX]

[DOI]

Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

Mimir: Extending I/O Interfaces to Express User Intent for Complex Workloads in HPC.

[BibT_eX]

[DOI]

Hariharan Devarajan

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

UnifyFS: A User-level Shared File System for Unified Access to Distributed Local Storage.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

I/O characterization and performance evaluation of large-scale storage architectures for heterogeneous workloads.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2023

2022

The COVID-19 High-Performance Computing Consortium.

[BibT_eX]

[DOI]

Comput. Sci. Eng., 2022

DFMan: A Graph-based Optimization of Dataflow Scheduling on High-Performance Computing Systems.

[BibT_eX]

[DOI]

Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

Extracting and characterizing I/O behavior of HPC workloads.

[BibT_eX]

[DOI]

Hariharan Devarajan

Proceedings of the IEEE International Conference on Cluster Computing, 2022

2021

SpotSDC: Revealing the Silent Data Corruption Propagation in High-Performance Computing Systems.

[BibT_eX]

[DOI]

IEEE Trans. Vis. Comput. Graph., 2021

Mitigating Inter-Job Interference via Process-Level Quality-of-Service.

[BibT_eX]

[DOI]

Lee Savoie

David K. Lowenthal

Nikhil Jain

ACM Trans. Parallel Comput., 2021

Understanding I/O Behavior in Scientific and Data-Intensive Computing (Dagstuhl Seminar 21332).

[BibT_eX]

[DOI]

Dagstuhl Reports, 2021

Large-Scale Scientific Computing in the Fight Against COVID-19.

[BibT_eX]

[DOI]

John West

John M. Shalf

Comput. Sci. Eng., 2021

Interactive Supercomputing With Jupyter.

[BibT_eX]

[DOI]

Comput. Sci. Eng., 2021

It's Time to Talk About HPC Storage: Perspectives on the Past and Future.

[BibT_eX]

[DOI]

Bradley W. Settlemyer

Comput. Sci. Eng., 2021

VELOC: VEry Low Overhead Checkpointing in the Age of Exascale.

[BibT_eX]

[DOI]

CoRR, 2021

Understanding the use of message passing interface in exascale proxy applications.

[BibT_eX]

[DOI]

Nawrin Sultana

Martin Rüfenacht

Anthony Skjellum

Purushotham V. Bangalore

Ignacio Laguna

Concurr. Comput. Pract. Exp., 2021

Understanding a program's resiliency through error propagation.

[BibT_eX]

[DOI]

Proceedings of the PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021

File System Semantics Requirements of HPC Applications.

[BibT_eX]

[DOI]

Chen Wang

Marc Snir

Proceedings of the HPDC '21: The 30th International Symposium on High-Performance Parallel and Distributed Computing, 2021

O(1) Communication for Distributed SGD through Two-Level Gradient Averaging.

[BibT_eX]

[DOI]

Subhadeep Bhattacharya

Weikuan Yu

Fahim Tahmid Chowdhury

Proceedings of the IEEE International Conference on Cluster Computing, 2021

2020

QMPI: A next generation MPI profiling interface for modern HPC platforms.

[BibT_eX]

[DOI]

Parallel Comput., 2020

Ad Hoc File Systems for High-Performance Computing.

[BibT_eX]

[DOI]

J. Comput. Sci. Technol., 2020

EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2020

Extending the MPI Stages Model of Fault Tolerance.

[BibT_eX]

[DOI]

Proceedings of the Workshop on Exascale MPI, 2020

Emulating I/O Behavior in Scientific Workflows on High Performance Computing Systems.

[BibT_eX]

[DOI]

Proceedings of the Fifth IEEE/ACM International Parallel Data Systems Workshop, 2020

Recorder 2.0: Efficient Parallel I/O Tracing and Analysis.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

First IEEE International Workshop on High-Performance Storage (HPS).

[BibT_eX]

[DOI]

Marc Snir

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

Understanding HPC Application I/O Behavior Using System Level Statistics.

[BibT_eX]

[DOI]

Proceedings of the 27th IEEE International Conference on High Performance Computing, 2020

2019

Failure recovery for bulk synchronous applications with MPI stages.

[BibT_eX]

[DOI]

Parallel Comput., 2019

The MPI_T events interface: An early evaluation and overview of the interface.

[BibT_eX]

[DOI]

Parallel Comput., 2019

A large-scale study of MPI usage in open-source HPC applications.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2019

VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning.

[BibT_eX]

[DOI]

Proceedings of the 48th International Conference on Parallel Processing, 2019

Efficient User-Level Storage Disaggregation for Deep Learning.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019

ExaMPI: A Modern Design and Implementation to Accelerate Message Passing Interface Innovation.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 6th Latin American Conference, 2019

2018

ADAPT: algorithmic differentiation applied to floating-point precision tuning.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2018

MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications.

[BibT_eX]

[DOI]

Proceedings of the 25th European MPI Users' Group Meeting, 2018

Enabling callback-driven runtime introspection via MPI_T.

[BibT_eX]

[DOI]

Proceedings of the 25th European MPI Users' Group Meeting, 2018

DisCVar: discovering critical variables using algorithmic differentiation for transient faults.

[BibT_eX]

[DOI]

Harshitha Menon

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018

Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems.

[BibT_eX]

[DOI]

Proceedings of the 26th IEEE International Symposium on Modeling, 2018

A Study of Network Quality of Service in Many-Core MPI Applications.

[BibT_eX]

[DOI]

Lee Savoie

David K. Lowenthal

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

Direct-FUSE: Removing the Middleman for High-Performance FUSE File System Support.

[BibT_eX]

[DOI]

Proceedings of the 8th International Workshop on Runtime and Operating Systems for Supercomputers, 2018

2017

Challenges and Opportunities of User-Level File Systems for HPC (Dagstuhl Seminar 17202).

[BibT_eX]

[DOI]

André Brinkmann

Weikuan Yu

Dagstuhl Reports, 2017

MetaKV: A Key-Value Store for Metadata Management of Distributed Burst Buffers.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

The Popper Convention: Making Reproducible Systems Evaluation Practical.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Demo abstract: PopperCI: Automated reproducibility validation.

[BibT_eX]

[DOI]

Robert Ricci

Proceedings of the 2017 IEEE Conference on Computer Communications Workshops, 2017

PopperCI: Automated reproducibility validation.

[BibT_eX]

[DOI]

Ivo Jimenez

Proceedings of the 2017 IEEE Conference on Computer Communications Workshops, 2017

Accelerating Big Data Infrastructure and Applications (Ongoing Collaboration).

[BibT_eX]

[DOI]

Proceedings of the 37th IEEE International Conference on Distributed Computing Systems Workshops, 2017

2016

Standing on the Shoulders of Giants by Managing Scientific Experiments Like Software.

[BibT_eX]

[DOI]

I Aver: Providing Declarative Experiment Specifications Facilitates the Evaluation of Computer Systems Research.

[BibT_eX]

[DOI]

Adv. Math. Commun., 2016

Evaluating and extending user-level fault tolerance in MPI applications.

[BibT_eX]

[DOI]

Howard Pritchard

Int. J. High Perform. Comput. Appl., 2016

Exploring the MPI tool information interface: features and capabilities.

[BibT_eX]

[DOI]

Tanzima Z. Islam

Int. J. High Perform. Comput. Appl., 2016

An ephemeral burst-buffer file system for scientific applications.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2016

Allowing MPI tools builders to forget about Fortran.

[BibT_eX]

[DOI]

Søren Rasmussen

Proceedings of the 23rd European MPI Users' Group Meeting, EuroMPI 2016, 2016

MPI Sessions: Leveraging Runtime Infrastructure to Increase Scalability of Applications at Exascale.

[BibT_eX]

[DOI]

Proceedings of the 23rd European MPI Users' Group Meeting, EuroMPI 2016, 2016

Structural Clustering: A New Approach to Support Performance Analysis at Scale.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

I/O Aware Power Shifting.

[BibT_eX]

[DOI]

Lee Savoie

David K. Lowenthal

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Characterizing and Reducing Cross-Platform Performance Variability Using OS-Level Virtualization.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

Managing I/O Interference in a Shared Burst Buffer System.

[BibT_eX]

[DOI]

Sagar Thapaliya

Purushotham V. Bangalore

Jay F. Lofstead

Proceedings of the 45th International Conference on Parallel Processing, 2016

2015

Tackling the reproducibility problem in storage systems research with declarative experiment specifications.

[BibT_eX]

[DOI]

Proceedings of the 10th Parallel Data Storage Workshop, 2015

The Role of Container Technology in Reproducible Computer Systems Research.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Conference on Cloud Engineering, 2015

2014

Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System.

[BibT_eX]

[DOI]

Greg Bronevetsky

IEEE Trans. Parallel Distributed Syst., 2014

Exploring the Capabilities of the New MPI_T Interface.

[BibT_eX]

[DOI]

Tanzima Z. Islam

Proceedings of the 21st European MPI Users' Group Meeting, 2014

FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery.

[BibT_eX]

[DOI]

Naoya Maruyama

Satoshi Matsuoka

Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

IO-Cop: Managing Concurrent Accesses to Shared Parallel File System.

[BibT_eX]

[DOI]

Sagar Thapaliya

Purushotham V. Bangalore

Jay F. Lofstead

Proceedings of the 43rd International Conference on Parallel Processing Workshops, 2014

A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers.

[BibT_eX]

[DOI]

Naoya Maruyama

Satoshi Matsuoka

Proceedings of the 14th IEEE/ACM International Symposium on Cluster, 2014

2013

McrEngine: A scalable checkpointing system using data-aware aggregation and compression.

[BibT_eX]

[DOI]

Rudolf Eigenmann

Sci. Program., 2013

There goes the neighborhood: performance degradation due to nearby jobs.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2013

HIPS Introduction.

[BibT_eX]

[DOI]

Raghunath Rajachandrasekar

Stephen L. Olivier

Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

A 1 PB/s file system to checkpoint three million MPI tasks.

[BibT_eX]

[DOI]

Dhabaleswar K. Panda

Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, 2013

Alignment-Based Metrics for Trace Comparison.

[BibT_eX]

[DOI]

Matthias Weber

Holger Brunst

Wolfgang E. Nagel

Proceedings of the Euro-Par 2013 Parallel Processing, 2013

2012

Trace profiling: Scalable event tracing on high-end parallel systems.

[BibT_eX]

[DOI]

Parallel Comput., 2012

Design and modeling of a non-blocking checkpointing system.

[BibT_eX]

[DOI]

Satoshi Matsuoka

Proceedings of the SC Conference on High Performance Computing Networking, 2012

Integrated in-system storage architecture for high performance computing.

[BibT_eX]

[DOI]

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers, 2012

Asynchronous checkpoint migration with MRNet in the Scalable Checkpoint / Restart Library.

[BibT_eX]

[DOI]

Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2012

2010

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System.

[BibT_eX]

[DOI]

Greg Bronevetsky

Proceedings of the Conference on High Performance Computing Networking, 2010

2009

Evaluating similarity-based trace reduction techniques for scalable performance analysis.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE Conference on High Performance Computing, 2009

Scalable Event Trace Visualization.

[BibT_eX]

[DOI]

Allan Snavely

Proceedings of the Euro-Par 2009, 2009

2007

Scalable event-based performance measurement in high-end environments.

[BibT_eX]

[DOI]

SIGMETRICS Perform. Evaluation Rev., 2007

A study of tracing overhead on a high-performance linux cluster.

[BibT_eX]

[DOI]

Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2007

Towards Scalable Event Tracing for High End Systems.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing and Communications, 2007

2005

Integrating Database Technology with Comparison-based Parallel Performance Diagnosis: The PerfTrack Performance Experiment Management Tool.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE SC2005 Conference on High Performance Networking and Computing, 2005

PPerfGrid: A Grid Services-based Tool for the Exchange of Heterogeneous Parallel Performance Data.

[BibT_eX]

[DOI]

Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

2004

Performance Tool Support for MPI-2 on Linux.

[BibT_eX]

[DOI]