Michela Taufer

Int. J. High Perform. Comput. Appl., 2022

Workflows Community Summit: Tightening the Integration between Computing Facilities and Scientific Workflows.

[DOI]

Rafael Ferreira da Silva

CoRR, 2022

Ubique: A New Model for Untangling Inter-task Data Dependence in Complex HPC Workflows.

[DOI]

Proceedings of the 18th IEEE International Conference on e-Science, 2022

Reproducing and Extending Analytical Performance Models of Generalized Hierarchical Scheduling.

[DOI]

Proceedings of the 18th IEEE International Conference on e-Science, 2022

Scalable Composition and Analysis Techniques for Massive Scientific Workflows.

[DOI]

Carlos Eduardo Arango Gutierrez

Brian Van Essen

Jonathan E. Allen

Felice C. Lightstone

Proceedings of the 18th IEEE International Conference on e-Science, 2022

One Step Closer to Converged Computing: Achieving Scalability with Cloud-Native HPC.

[DOI]

Yoonho Park

Proceedings of the 4th IEEE/ACM International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC, 2022

2021

Enabling rapid COVID-19 small molecule drug design through scalable deep learning of generative models.

[DOI]

Int. J. High Perform. Comput. Appl., 2021

A Dynamic, Hierarchical Resource Model for Converged Computing.

[DOI]

CoRR, 2021

ExaWorks: Workflows for Exascale.

[DOI]

CoRR, 2021

Workflows Community Summit: Advancing the State-of-the-art of Scientific Workflows Management Systems Research and Development.

[DOI]

Rafael Ferreira da Silva

Alvaro Vidal-Torreira

CoRR, 2021

Workflows Community Summit: Bringing the Scientific Workflows Community Together.

[DOI]

CoRR, 2021

Keeping science on keel when software moves.

[DOI]

Carlos Eduardo Arango Gutierrez

Commun. ACM, 2021

ExaWorks: Workflows for Exascale.

[DOI]

Proceedings of the 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS), 2021

Towards Standard Kubernetes Scheduling Interfaces for Converged Computing.

[DOI]

Claudia Misale

Daniel J. Milroy

Proceedings of the Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation, 2021

Generalizable coordination of large multiscale workflows: challenges and learnings at scale.

[DOI]

Christopher B. Stanley

Tomas Oppelstrup

Chris Neale

Sara Kokkila Schumacher

Stephen Herbein

Timothy S. Carpenter

Sandrasegaram Gnanakaran

Proceedings of the International Conference for High Performance Computing, 2021

Monitoring Large Scale Supercomputers: A Case Study with the Lassen Supercomputer.

[DOI]

Carlos Eduardo Arango Gutierrez

Nathan Besaw

Proceedings of the IEEE International Conference on Cluster Computing, 2021

It's a Scheduling Affair: GROMACS in the Cloud with the KubeFlux Scheduler.

[DOI]

Claudia Misale

Maurizio Drocco

Daniel J. Milroy

Stephen Herbein

Yoonho Park

Proceedings of the 3rd International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC, 2021

2020

Flux: Overcoming scheduling challenges for exascale workflows.

[DOI]

Thomas R. W. Scogland

Becky Springmeyer

Michela Taufer

Future Gener. Comput. Syst., 2020

ArcherGear: data race equivalencing for expeditious HPC debugging.

[DOI]

Samuel Thayer

Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

2019

Pruners.

[DOI]

Christopher M. Chambreau

Simone Atzeni

Michael Bentley

Int. J. High Perform. Comput. Appl., 2019

A three-phase workflow for general and expressive representations of nondeterminism in HPC applications.

[DOI]

Int. J. High Perform. Comput. Appl., 2019

Multi-Level Analysis of Compiler-Induced Variability and Performance Tradeoffs.

[DOI]

Michael Bentley

Ian Briggs

Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, 2019

2018

Record-and-Replay Techniques for HPC Systems: A Survey.

[DOI]

Supercomput. Front. Innov., 2018

SWORD: A Bounded Memory-Overhead Detector of OpenMP Data Races in Production Runs.

[DOI]

Simone Atzeni

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

PRIONN: Predicting Runtime and IO using Neural Networks.

[DOI]

Proceedings of the 47th International Conference on Parallel Processing, 2018

Thread-local concurrency: a technique to handle data race detection at programming model abstraction.

[DOI]

Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, 2018

2017

Noise Injection Techniques to Expose Subtle and Unintended Message Races.

[DOI]

Christopher M. Chambreau

Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

OpenMP Tools Interface: Synchronization Information for Data Race Detection.

[DOI]

Proceedings of the Scaling OpenMP for Exascale Performance and Portability, 2017

FLiT: Cross-platform floating-point result-consistency tester and workload.

[DOI]

Geoffrey Sawaya

Michael Bentley

Ian Briggs

Proceedings of the 2017 IEEE International Symposium on Workload Characterization, 2017

2016

Testing Infrastructure for OpenMP Debugging Interface Implementations.

[DOI]

Proceedings of the OpenMP: Memory, Devices, and Tasks, 2016

ARCHER: Effectively Spotting Data Races in Large OpenMP Applications.

[DOI]

Simone Atzeni

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters.

[DOI]

Stephen Herbein

Don Lipari

Thomas R. W. Scogland

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, 2016

2015

Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference.

[DOI]

Saurabh Bagchi

Todd Gamblin

IEEE Trans. Parallel Distributed Syst., 2015

Debugging high-performance computing applications at massive scales.

[DOI]

Commun. ACM, 2015

Clock delta compression for scalable order-replay of non-deterministic parallel applications.

[DOI]

Proceedings of the International Conference for High Performance Computing, 2015

Lessons Learned from Implementing OMPD: A Debugging Interface for OpenMP.

[DOI]

Proceedings of the OpenMP: Heterogenous Execution and Data Movements, 2015

A Scalable Prescriptive Parallel Debugging Model.

[DOI]

Nicklas Bo Jensen

Niklas Quarfot Nielsen

Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

2014

Towards providing low-overhead data race detection for large OpenMP applications.

[DOI]

Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, 2014

Accurate application progress analysis for large-scale parallel debugging.

[DOI]

Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014

Flux: A Next-Generation Resource Management Framework for Large HPC Centers.

[DOI]

Proceedings of the 43rd International Conference on Parallel Processing Workshops, 2014

2013

LIBI: A framework for bootstrapping extreme scale software systems.

[DOI]

Matthew P. LeGendre

Parallel Comput., 2013

An Optimal Algorithm for Extreme Scale Job Launching.

[DOI]

Proceedings of the 12th IEEE International Conference on Trust, 2013

Overcoming extreme-scale reproducibility challenges through a unified, targeted, and multilevel toolset.

[DOI]

Zvonimir Rakamaric

Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering, 2013

Efficient and Scalable Retrieval Techniques for Global File Properties.

[DOI]

Michael J. Brim

Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

Massively parallel loading.

[DOI]

Felix Wolf

Proceedings of the International Conference on Supercomputing, 2013

2012

Beyond DVFS: A First Look at Performance under a Hardware-Enforced Power Bound.

[DOI]

Barry Rountree

David K. Lowenthal

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

Probabilistic diagnosis of performance faults in large-scale parallel applications.

[DOI]

Saurabh Bagchi

Todd Gamblin

Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

2011

Large scale debugging of parallel tasks with AutomaDeD.

[DOI]

Todd Gamblin

Proceedings of the Conference on High Performance Computing Networking, 2011

Exascale Algorithms for Generalized MPI_Comm_split.

[DOI]

Adam Moody

Proceedings of the Recent Advances in the Message Passing Interface, 2011

2010

AutomaDeD: Automata-based debugging for dissimilar parallel tasks.

[DOI]

Greg Bronevetsky

Saurabh Bagchi

Proceedings of the 2010 IEEE/IFIP International Conference on Dependable Systems and Networks, 2010

2009

Scalable temporal order analysis for large scale debugging.

[DOI]

Proceedings of the ACM/IEEE Conference on High Performance Computing, 2009

2008

Lessons learned at 208K: towards debugging millions of cores.

[DOI]

Proceedings of the ACM/IEEE Conference on High Performance Computing, 2008

Overcoming Scalability Challenges for Tool Daemon Launching.

[DOI]

Proceedings of the 2008 International Conference on Parallel Processing, 2008

2007

Dynamic Binary Instrumentation and Data Aggregation on Large Scale Systems.

[DOI]

Steven Y. Ko

Barry Rountree

Int. J. Parallel Program., 2007

Benchmarking the Stack Trace Analysis Tool for BlueGene/L.

Proceedings of the Parallel Computing: Architectures, 2007

Stack Trace Analysis for Large Scale Debugging.

[DOI]

Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Pynamic: the Python Dynamic Benchmark.

[DOI]

John C. Gyllenhaal

Patrick Miller

Proceedings of the IEEE 10th International Symposium on Workload Characterization, 2007

2005

Scalable dynamic binary instrumentation for Blue Gene/L.

[DOI]

Andrew Bernat

Steven Y. Ko

Barry Rountree

SIGARCH Comput. Archit. News, 2005

2002

Scalable analysis techniques for microprocessor performance counter metrics.

[DOI]