2024
Distributed Order Recording Techniques for Efficient Record-and-Replay of Multi - Threaded Programs.
Proceedings of the IEEE International Conference on Cluster Computing, 2024
2023
Fluxion: A Scalable Graph-Based Resource Model for HPC Scheduling Challenges.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023
2022
An analytical performance model of generalized hierarchical scheduling.
,
,
,
,
,
,
,
,
,
,
Int. J. High Perform. Comput. Appl., 2022
Workflows Community Summit: Tightening the Integration between Computing Facilities and Scientific Workflows.
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2022
Ubique: A New Model for Untangling Inter-task Data Dependence in Complex HPC Workflows.
Proceedings of the 18th IEEE International Conference on e-Science, 2022
Reproducing and Extending Analytical Performance Models of Generalized Hierarchical Scheduling.
Proceedings of the 18th IEEE International Conference on e-Science, 2022
Scalable Composition and Analysis Techniques for Massive Scientific Workflows.
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 18th IEEE International Conference on e-Science, 2022
One Step Closer to Converged Computing: Achieving Scalability with Cloud-Native HPC.
,
,
,
,
,
,
,
,
,
,
Proceedings of the 4th IEEE/ACM International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC, 2022
2021
Enabling rapid COVID-19 small molecule drug design through scalable deep learning of generative models.
,
,
,
,
,
,
,
,
,
,
,
Int. J. High Perform. Comput. Appl., 2021
A Dynamic, Hierarchical Resource Model for Converged Computing.
CoRR, 2021
ExaWorks: Workflows for Exascale.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2021
Workflows Community Summit: Advancing the State-of-the-art of Scientific Workflows Management Systems Research and Development.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2021
Workflows Community Summit: Bringing the Scientific Workflows Community Together.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2021
Keeping science on keel when software moves.
Commun. ACM, 2021
ExaWorks: Workflows for Exascale.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS), 2021
Towards Standard Kubernetes Scheduling Interfaces for Converged Computing.
Proceedings of the Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation, 2021
Generalizable coordination of large multiscale workflows: challenges and learnings at scale.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the International Conference for High Performance Computing, 2021
Monitoring Large Scale Supercomputers: A Case Study with the Lassen Supercomputer.
Proceedings of the IEEE International Conference on Cluster Computing, 2021
It's a Scheduling Affair: GROMACS in the Cloud with the KubeFlux Scheduler.
Proceedings of the 3rd International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC, 2021
2020
Flux: Overcoming scheduling challenges for exascale workflows.
,
,
,
,
,
,
,
,
,
,
,
Future Gener. Comput. Syst., 2020
ArcherGear: data race equivalencing for expeditious HPC debugging.
Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020
2019
,
,
,
,
,
,
,
,
,
,
,
Int. J. High Perform. Comput. Appl., 2019
A three-phase workflow for general and expressive representations of nondeterminism in HPC applications.
Int. J. High Perform. Comput. Appl., 2019
Multi-Level Analysis of Compiler-Induced Variability and Performance Tradeoffs.
Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, 2019
2018
Record-and-Replay Techniques for HPC Systems: A Survey.
Supercomput. Front. Innov., 2018
SWORD: A Bounded Memory-Overhead Detector of OpenMP Data Races in Production Runs.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018
PRIONN: Predicting Runtime and IO using Neural Networks.
Proceedings of the 47th International Conference on Parallel Processing, 2018
Thread-local concurrency: a technique to handle data race detection at programming model abstraction.
Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, 2018
2017
Noise Injection Techniques to Expose Subtle and Unintended Message Races.
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017
OpenMP Tools Interface: Synchronization Information for Data Race Detection.
Proceedings of the Scaling OpenMP for Exascale Performance and Portability, 2017
FLiT: Cross-platform floating-point result-consistency tester and workload.
Proceedings of the 2017 IEEE International Symposium on Workload Characterization, 2017
2016
Testing Infrastructure for OpenMP Debugging Interface Implementations.
Proceedings of the OpenMP: Memory, Devices, and Tasks, 2016
ARCHER: Effectively Spotting Data Races in Large OpenMP Applications.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016
Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters.
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, 2016
2015
Diagnosis of Performance Faults in LargeScale MPI Applications via Probabilistic Progress-Dependence Inference.
IEEE Trans. Parallel Distributed Syst., 2015
Debugging high-performance computing applications at massive scales.
,
,
,
,
,
,
,
,
,
,
Commun. ACM, 2015
Clock delta compression for scalable order-replay of non-deterministic parallel applications.
Proceedings of the International Conference for High Performance Computing, 2015
Lessons Learned from Implementing OMPD: A Debugging Interface for OpenMP.
Proceedings of the OpenMP: Heterogenous Execution and Data Movements, 2015
A Scalable Prescriptive Parallel Debugging Model.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015
2014
Towards providing low-overhead data race detection for large OpenMP applications.
Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, 2014
Accurate application progress analysis for large-scale parallel debugging.
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014
Flux: A Next-Generation Resource Management Framework for Large HPC Centers.
Proceedings of the 43rd International Conference on Parallel Processing Workshops, 2014
2013
LIBI: A framework for bootstrapping extreme scale software systems.
Parallel Comput., 2013
An Optimal Algorithm for Extreme Scale Job Launching.
Proceedings of the 12th IEEE International Conference on Trust, 2013
Overcoming extreme-scale reproducibility challenges through a unified, targeted, and multilevel toolset.
Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering, 2013
Efficient and Scalable Retrieval Techniques for Global File Properties.
Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013
Massively parallel loading.
Proceedings of the International Conference on Supercomputing, 2013
2012
Beyond DVFS: A First Look at Performance under a Hardware-Enforced Power Bound.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012
Probabilistic diagnosis of performance faults in large-scale parallel applications.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012
2011
Large scale debugging of parallel tasks with AutomaDeD.
Proceedings of the Conference on High Performance Computing Networking, 2011
Exascale Algorithms for Generalized MPI_Comm_split.
Proceedings of the Recent Advances in the Message Passing Interface, 2011
2010
AutomaDeD: Automata-based debugging for dissimilar parallel tasks.
Proceedings of the 2010 IEEE/IFIP International Conference on Dependable Systems and Networks, 2010
2009
Scalable temporal order analysis for large scale debugging.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2009
2008
Lessons learned at 208K: towards debugging millions of cores.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2008
Overcoming Scalability Challenges for Tool Daemon Launching.
Proceedings of the 2008 International Conference on Parallel Processing, 2008
2007
Dynamic Binary Instrumentation and Data Aggregation on Large Scale Systems.
Int. J. Parallel Program., 2007
Benchmarking the Stack Trace Analysis Tool for BlueGene/L.
Proceedings of the Parallel Computing: Architectures, 2007
Stack Trace Analysis for Large Scale Debugging.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007
Pynamic: the Python Dynamic Benchmark.
Proceedings of the IEEE 10th International Symposium on Workload Characterization, 2007
2005
Scalable dynamic binary instrumentation for Blue Gene/L.
SIGARCH Comput. Archit. News, 2005
2002
Scalable analysis techniques for microprocessor performance counter metrics.
Proceedings of the 2002 ACM/IEEE conference on Supercomputing, 2002