Christian Engelmann

Proceedings of the Euro-Par 2021: Parallel Processing Workshops, 2021

2020

GPU lifetimes on titan supercomputer: survival analysis and reliability.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2020

Models for Resilience Design Patterns.

[BibT_eX]

[DOI]

Mohit Kumar

Proceedings of the 10th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, 2020

PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems.

[BibT_eX]

[DOI]

Proceedings of the 25th IEEE Pacific Rim International Symposium on Dependable Computing, 2020

3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2020: Parallel Processing, 2020

2019

Self-stabilizing Connected Components.

[BibT_eX]

[DOI]

Proceedings of the 9th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, 2019

Concepts for OpenMP Target Offload Resilience.

[BibT_eX]

[DOI]

Geoffroy R. Vallée

Swaroop Pophale

Proceedings of the OpenMP: Conquering the Full Hardware Spectrum, 2019

2018

Epidemic failure detection and consensus for extreme parallelism.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2018

Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing.

[BibT_eX]

[DOI]

Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering, 2018

A Comprehensive Informative Metric for Analyzing HPC System Status Using the LogSCAN Platform.

[BibT_eX]

[DOI]

Yawei Hui

Byung-Hoon Park

Proceedings of the IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale, 2018

Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale, 2018

Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery.

[BibT_eX]

[DOI]

Proceedings of the 26th Euromicro International Conference on Parallel, 2018

A Comprehensive Informative Metric for Summarizing HPC System Status.

[BibT_eX]

[DOI]

Yawei Hui

Byung-Hoon Park

Proceedings of the 8th IEEE Symposium on Large Data Analysis and Visualization, 2018

Performance Efficient Multiresilience Using Checkpoint Recovery in Iterative Algorithms.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2018: Parallel Processing Workshops, 2018

Machine Learning Models for GPU Error Prediction in a Large Scale HPC System.

[BibT_eX]

[DOI]

Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2018

Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System.

[BibT_eX]

[DOI]

Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2018

A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2018

Real-Time Assessment of Supercomputer Status by a Comprehensive Informative Metric through Streaming Processing.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Big Data (IEEE BigData 2018), 2018

2017

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale.

[BibT_eX]

[DOI]

Supercomput. Front. Innov., 2017

Failures in large scale systems: long-term measurement, analysis, and implications.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2017

Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities.

[BibT_eX]

[DOI]

Proceedings of the 25th IEEE International Symposium on Modeling, 2017

Towards New Metrics for High-Performance Computing Resilience.

[BibT_eX]

[DOI]

Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, 2017

A Pattern Language for High-Performance Computing Resilience.

[BibT_eX]

[DOI]

Proceedings of the 22nd European Conference on Pattern Languages of Programs, 2017

Pattern-Based Modeling of High-Performance Computing Resilience.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2017: Parallel Processing Workshops, 2017

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

2016

A new deadlock resolution protocol and message matching algorithm for the extreme-scale simulator.

[BibT_eX]

[DOI]

Thomas J. Naughton

Concurr. Comput. Pract. Exp., 2016

Language Support for Reliable Memory Regions.

[BibT_eX]

[DOI]

Leonardo Arturo Bautista-Gomez

Proceedings of the Languages and Compilers for Parallel Computing, 2016

Reducing Waste in Extreme Scale Systems through Introspective Analysis.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Mini-Ckpts: Surviving OS Failures in Persistent Memory.

[BibT_eX]

[DOI]

Proceedings of the 2016 International Conference on Supercomputing, 2016

Havens: Explicit reliable memory regions for HPC applications.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE High Performance Extreme Computing Conference, 2016

Adding Fault Tolerance to NPB Benchmarks Using ULFM.

[BibT_eX]

[DOI]

Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, 2016

A Cooperative Approach to Virtual Machine Based Fault Injection.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2016: Parallel Processing Workshops, 2016

Benchmark Generation and Simulation at Extreme Scale.

[BibT_eX]

[DOI]

Mahesh Lagadapati

Frank Mueller

Proceedings of the 20th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications, 2016

Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy.

[BibT_eX]

[DOI]

Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2016

2015

Scalable and Fault Tolerant Failure Detection and Consensus.

[BibT_eX]

[DOI]

Proceedings of the 22nd European MPI Users' Group Meeting, 2015

2014

Addressing failures in exascale computing.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2014

Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale.

[BibT_eX]

[DOI]

Future Gener. Comput. Syst., 2014

Supporting the Development of Resilient Message Passing Applications Using Simulation.

[BibT_eX]

[DOI]

Proceedings of the 22nd Euromicro International Conference on Parallel, 2014

What Is the Right Balance for Performance and Isolation with Virtualization in HPC?

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2014: Parallel Processing Workshops, 2014

Improving the Performance of the Extreme-Scale Simulator.

[BibT_eX]

[DOI]

Thomas J. Naughton

Proceedings of the 18th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications, 2014

2013

Tools for Simulation and Benchmark Generation at Exascale.

[BibT_eX]

[DOI]

Mahesh Lagadapati

Frank Mueller

Proceedings of the Tools for High Performance Computing 2013, 2013

Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems.

[BibT_eX]

[DOI]

Thomas J. Naughton

Proceedings of the 42nd International Conference on Parallel Processing, 2013

A Runtime Environment for Supporting Research in Resilient HPC System Software & Tools.

[BibT_eX]

[DOI]

Proceedings of the First International Symposium on Computing and Networking, 2013

Using Performance Tools to Support Experiments in HPC Resilience.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2013: Parallel Processing Workshops, 2013

2012

Proactive process-level live migration and back migration in HPC environments.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2012

Detection and correction of silent data corruption for large-scale high-performance computing.

[BibT_eX]

[DOI]

Proceedings of the SC Conference on High Performance Computing Networking, 2012

File I/O for MPI Applications in Redundant Execution Scenarios.

[BibT_eX]

[DOI]

Swen Böhm

Proceedings of the 20th Euromicro International Conference on Parallel, 2012

NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines.

[BibT_eX]

[DOI]

Chao Wang

Sudharshan S. Vazhkudai

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

Combining Partial Redundancy and Checkpointing for HPC.

[BibT_eX]

[DOI]

Proceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems, 2012

2011

Poster: detection and correction of silent data corruption for large-scale high-performance computing.

[BibT_eX]

[DOI]

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2011

Poster: a tunable, software-based DRAM error detection and correction library for HPC.

[BibT_eX]

[DOI]

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2011

xSim: The extreme-scale simulator.

[BibT_eX]

[DOI]

Swen Böhm

Proceedings of the 2011 International Conference on High Performance Computing & Simulation, 2011

Simulation of Large-Scale HPC Architectures.

[BibT_eX]

[DOI]

Ian S. Jones

Proceedings of the 2011 International Conference on Parallel Processing Workshops, 2011

A Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

A Tunable, Software-Based DRAM Error Detection and Correction Library for HPC.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

2010

System-level virtualization research at Oak Ridge National Laboratory.

[BibT_eX]

[DOI]

Future Gener. Comput. Syst., 2010

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures.

[BibT_eX]

[DOI]

Min Li

Sudharshan S. Vazhkudai

Proceedings of the Conference on High Performance Computing Networking, 2010

Hybrid Checkpointing for MPI Jobs in HPC Environments.

[BibT_eX]

[DOI]

Proceedings of the 16th IEEE International Conference on Parallel and Distributed Systems, 2010

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments.

[BibT_eX]

[DOI]

Swen Böhm

Kulathep Charoenpornwattana

Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications, 2010

2009

Symmetric active/active metadata service for high availability parallel file systems.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2009

A tunable holistic resiliency approach for high-performance computing systems.

[BibT_eX]

[DOI]

Nichamon Naksinehaboon

Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009

High Performance Computing with Harness over InfiniBand.

[BibT_eX]

[DOI]

Proceedings of the 17th Euromicro International Conference on Parallel, 2009

Proactive Fault Tolerance Using Preemptive Migration.

[BibT_eX]

[DOI]

Proceedings of the 17th Euromicro International Conference on Parallel, 2009

Performance comparison of two virtual machine scenarios using an HPC application: a case study using molecular dynamics simulations.

[BibT_eX]

[DOI]

Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing, 2009

Blue Gene/L Log Analysis and Time to Interrupt Estimation.

[BibT_eX]

[DOI]

Narate Taerat

Nichamon Naksinehaboon

Proceedings of the The Forth International Conference on Availability, 2009

2008

Virtual System Environments.

[BibT_eX]

[DOI]

Proceedings of the Systems and Virtualization Management. Standards and New Technologies, 2008

Proactive process-level live migration in HPC environments.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE Conference on High Performance Computing, 2008

System-Level Virtualization for High Performance Computing.

[BibT_eX]

[DOI]

Proceedings of the 16th Euromicro International Conference on Parallel, 2008

Virtualized Environments for the Harness High Performance Computing Workbench.

[BibT_eX]

[DOI]

Proceedings of the 16th Euromicro International Conference on Parallel, 2008

Effects of virtualization on a scientific application running a hyperspectral radiative transfer code on virtual machines.

[BibT_eX]

[DOI]

Proceedings of the 2nd Workshop on System-Level Virtualization for High Performance Computing, 2008

An Analysis of HPC Benchmarks in Virtual Machine Environments.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2008 Workshops, 2008

Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations.

[BibT_eX]

[DOI]

Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), 2008

A Framework for Proactive Fault Tolerance.

[BibT_eX]

[DOI]

Geoffroy Vallée

Proceedings of the The Third International Conference on Availability, 2008

Symmetric Active/Active Replication for Dependent Services.

[BibT_eX]

[DOI]

Proceedings of the The Third International Conference on Availability, 2008

2007

A unified multiple-level cache for high performance storage systems.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Netw., 2007

Distributed Real-Time Computing with Harness.

[BibT_eX]

[DOI]

Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 14th European PVM/MPI User's Group Meeting, Paris, France, September 30, 2007

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance.

[BibT_eX]

[DOI]

Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Proactive fault tolerance for HPC with Xen virtualization.

[BibT_eX]

[DOI]

Proceedings of the 21th Annual International Conference on Supercomputing, 2007

A Fast Delivery Protocol for Total Order Broadcasting.

[BibT_eX]

[DOI]

Proceedings of the 16th International Conference on Computer Communications and Networks, 2007

Middleware in Modern High Performance Computing System Architectures.

[BibT_eX]

[DOI]

Hong Ong

Proceedings of the Computational Science - ICCS 2007, 7th International Conference, Beijing, China, May 27, 2007

Transparent Symmetric Active/Active Replication for Service-Level High Availability.

[BibT_eX]

[DOI]

Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2007), 2007

On Programming Models for Service-Level High Availability.

[BibT_eX]

[DOI]

Proceedings of the The Second International Conference on Availability, 2007

2006

MOLAR: adaptive runtime support for high-end computing operating and runtime systems.

[BibT_eX]

[DOI]

Narasimha Raju Gottumukkala

David E. Bernholdt

ACM SIGOPS Oper. Syst. Rev., 2006

Symmetric Active/Active High Availability for High-Performance Computing System Services.

[BibT_eX]

[DOI]

J. Comput., 2006

Scalable, fault tolerant membership for MPI tasks on HPC systems.

[BibT_eX]

[DOI]

Proceedings of the 20th Annual International Conference on Supercomputing, 2006

RMIX: A Dynamic, Heterogeneous, Reconfigurable Communication Framework.

[BibT_eX]

[DOI]

Proceedings of the Computational Science, 2006

A Parallel Plug-In Programming Paradigm.

[BibT_eX]

[DOI]

Ronald Baumann

Proceedings of the High Performance Computing and Communications, 2006

JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management.

[BibT_eX]

[DOI]

Kai Uhlemann

Proceedings of the 2006 IEEE International Conference on Cluster Computing, 2006

Active/Active Replication for Highly Available HPC System Services.

[BibT_eX]

[DOI]

Proceedings of the The First International Conference on Availability, 2006

2005

UML-based Beowulf Cluster Availability Modeling.

[BibT_eX]

Proceedings of the International Conference on Software Engineering Research and Practice, 2005

A Lightweight Kernel for the Harness Metacomputing Framework.

[BibT_eX]

[DOI]

Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

Super-Scalable Algorithms for Computing on 100, 000 Processors.

[BibT_eX]

[DOI]

Proceedings of the Computational Science, 2005

Job-Site Level Fault Tolerance for Cluster and Grid environments.

[BibT_eX]

[DOI]

Proceedings of the 2005 IEEE International Conference on Cluster Computing (CLUSTER 2005), September 26, 2005

2003

A Diskless Checkpointing Algorithm for Super-scale Architectures Applied to the Fast Fourier Transform.

[BibT_eX]

[DOI]

Proceedings of the 1st International Workshop on Challenges of Large Applications in Distributed Environments, 2003

2002

Distributed Peer-to-Peer Control in Harness.

[BibT_eX]

[DOI]