Christian Engelmann

Orcid: 0000-0003-4365-6416

According to our database1, Christian Engelmann authored at least 100 papers between 2002 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Understanding GPU Memory Corruption at Extreme Scale: The Summit Case Study.
Proceedings of the 38th ACM International Conference on Supercomputing, 2024

2023
Science Use Case Design Patterns for Autonomous Experiments.
Proceedings of the 28th European Conference on Pattern Languages of Programs, 2023

2022
Resiliency in numerical algorithm design for extreme scale simulations.
Int. J. High Perform. Comput. Appl., 2022

The INTERSECT Open Federated Architecture for the Laboratory of the Future.
Proceedings of the Accelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation, 2022

2021
Study of interconnect errors, network congestion, and applications characteristics for throttle prediction on a large scale HPC system.
J. Parallel Distributed Comput., 2021

RDPM: An Extensible Tool for Resilience Design Patterns Modelling.
Proceedings of the Euro-Par 2021: Parallel Processing Workshops, 2021

2020
GPU lifetimes on titan supercomputer: survival analysis and reliability.
Proceedings of the International Conference for High Performance Computing, 2020

Models for Resilience Design Patterns.
Proceedings of the 10th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, 2020

PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems.
Proceedings of the 25th IEEE Pacific Rim International Symposium on Dependable Computing, 2020

3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication.
Proceedings of the Euro-Par 2020: Parallel Processing, 2020

2019
Self-stabilizing Connected Components.
Proceedings of the 9th IEEE/ACM Workshop on Fault Tolerance for HPC at eXtreme Scale, 2019

Concepts for OpenMP Target Offload Resilience.
Proceedings of the OpenMP: Conquering the Full Hardware Spectrum, 2019

2018
Epidemic failure detection and consensus for extreme parallelism.
Int. J. High Perform. Comput. Appl., 2018

Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing.
Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering, 2018

A Comprehensive Informative Metric for Analyzing HPC System Status Using the LogSCAN Platform.
Proceedings of the IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale, 2018

Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer.
Proceedings of the IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale, 2018

Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery.
Proceedings of the 26th Euromicro International Conference on Parallel, 2018

A Comprehensive Informative Metric for Summarizing HPC System Status.
Proceedings of the 8th IEEE Symposium on Large Data Analysis and Visualization, 2018

Performance Efficient Multiresilience Using Checkpoint Recovery in Iterative Algorithms.
Proceedings of the Euro-Par 2018: Parallel Processing Workshops, 2018

Machine Learning Models for GPU Error Prediction in a Large Scale HPC System.
Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2018

Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System.
Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2018

A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log.
Proceedings of the IEEE International Conference on Cluster Computing, 2018

Real-Time Assessment of Supercomputer Status by a Comprehensive Informative Metric through Streaming Processing.
Proceedings of the IEEE International Conference on Big Data (IEEE BigData 2018), 2018

2017
Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale.
Supercomput. Front. Innov., 2017

Failures in large scale systems: long-term measurement, analysis, and implications.
Proceedings of the International Conference for High Performance Computing, 2017

Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities.
Proceedings of the 25th IEEE International Symposium on Modeling, 2017

Towards New Metrics for High-Performance Computing Resilience.
Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, 2017

A Pattern Language for High-Performance Computing Resilience.
Proceedings of the 22nd European Conference on Pattern Languages of Programs, 2017

Pattern-Based Modeling of High-Performance Computing Resilience.
Proceedings of the Euro-Par 2017: Parallel Processing Workshops, 2017

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale.
Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

2016
A new deadlock resolution protocol and message matching algorithm for the extreme-scale simulator.
Concurr. Comput. Pract. Exp., 2016

Language Support for Reliable Memory Regions.
Proceedings of the Languages and Compilers for Parallel Computing, 2016

Reducing Waste in Extreme Scale Systems through Introspective Analysis.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Mini-Ckpts: Surviving OS Failures in Persistent Memory.
Proceedings of the 2016 International Conference on Supercomputing, 2016

Havens: Explicit reliable memory regions for HPC applications.
Proceedings of the 2016 IEEE High Performance Extreme Computing Conference, 2016

Adding Fault Tolerance to NPB Benchmarks Using ULFM.
Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, 2016

A Cooperative Approach to Virtual Machine Based Fault Injection.
Proceedings of the Euro-Par 2016: Parallel Processing Workshops, 2016

Benchmark Generation and Simulation at Extreme Scale.
Proceedings of the 20th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications, 2016

Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy.
Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2016

2015
Scalable and Fault Tolerant Failure Detection and Consensus.
Proceedings of the 22nd European MPI Users' Group Meeting, 2015

2014
Addressing failures in exascale computing.
Int. J. High Perform. Comput. Appl., 2014

Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale.
Future Gener. Comput. Syst., 2014

Supporting the Development of Resilient Message Passing Applications Using Simulation.
Proceedings of the 22nd Euromicro International Conference on Parallel, 2014

What Is the Right Balance for Performance and Isolation with Virtualization in HPC?
Proceedings of the Euro-Par 2014: Parallel Processing Workshops, 2014

Improving the Performance of the Extreme-Scale Simulator.
Proceedings of the 18th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications, 2014

2013
Tools for Simulation and Benchmark Generation at Exascale.
Proceedings of the Tools for High Performance Computing 2013, 2013

Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems.
Proceedings of the 42nd International Conference on Parallel Processing, 2013

A Runtime Environment for Supporting Research in Resilient HPC System Software & Tools.
Proceedings of the First International Symposium on Computing and Networking, 2013

Using Performance Tools to Support Experiments in HPC Resilience.
Proceedings of the Euro-Par 2013: Parallel Processing Workshops, 2013

2012
Proactive process-level live migration and back migration in HPC environments.
J. Parallel Distributed Comput., 2012

Detection and correction of silent data corruption for large-scale high-performance computing.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

File I/O for MPI Applications in Redundant Execution Scenarios.
Proceedings of the 20th Euromicro International Conference on Parallel, 2012

NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

Combining Partial Redundancy and Checkpointing for HPC.
Proceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems, 2012

2011
Poster: detection and correction of silent data corruption for large-scale high-performance computing.
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2011

Poster: a tunable, software-based DRAM error detection and correction library for HPC.
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2011

xSim: The extreme-scale simulator.
Proceedings of the 2011 International Conference on High Performance Computing & Simulation, 2011

Simulation of Large-Scale HPC Architectures.
Proceedings of the 2011 International Conference on Parallel Processing Workshops, 2011

A Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment.
Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

A Tunable, Software-Based DRAM Error Detection and Correction Library for HPC.
Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

2010
System-level virtualization research at Oak Ridge National Laboratory.
Future Gener. Comput. Syst., 2010

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures.
Proceedings of the Conference on High Performance Computing Networking, 2010

Hybrid Checkpointing for MPI Jobs in HPC Environments.
Proceedings of the 16th IEEE International Conference on Parallel and Distributed Systems, 2010

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments.
Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications, 2010

2009
Symmetric active/active metadata service for high availability parallel file systems.
J. Parallel Distributed Comput., 2009

A tunable holistic resiliency approach for high-performance computing systems.
Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009

High Performance Computing with Harness over InfiniBand.
Proceedings of the 17th Euromicro International Conference on Parallel, 2009

Proactive Fault Tolerance Using Preemptive Migration.
Proceedings of the 17th Euromicro International Conference on Parallel, 2009

Performance comparison of two virtual machine scenarios using an HPC application: a case study using molecular dynamics simulations.
Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing, 2009

Blue Gene/L Log Analysis and Time to Interrupt Estimation.
Proceedings of the The Forth International Conference on Availability, 2009

2008
Virtual System Environments.
Proceedings of the Systems and Virtualization Management. Standards and New Technologies, 2008

Proactive process-level live migration in HPC environments.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2008

System-Level Virtualization for High Performance Computing.
Proceedings of the 16th Euromicro International Conference on Parallel, 2008

Virtualized Environments for the Harness High Performance Computing Workbench.
Proceedings of the 16th Euromicro International Conference on Parallel, 2008

Effects of virtualization on a scientific application running a hyperspectral radiative transfer code on virtual machines.
Proceedings of the 2nd Workshop on System-Level Virtualization for High Performance Computing, 2008

An Analysis of HPC Benchmarks in Virtual Machine Environments.
Proceedings of the Euro-Par 2008 Workshops, 2008

Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations.
Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), 2008

A Framework for Proactive Fault Tolerance.
Proceedings of the The Third International Conference on Availability, 2008

Symmetric Active/Active Replication for Dependent Services.
Proceedings of the The Third International Conference on Availability, 2008

2007
A unified multiple-level cache for high performance storage systems.
Int. J. High Perform. Comput. Netw., 2007

Distributed Real-Time Computing with Harness.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 14th European PVM/MPI User's Group Meeting, Paris, France, September 30, 2007

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Proactive fault tolerance for HPC with Xen virtualization.
Proceedings of the 21th Annual International Conference on Supercomputing, 2007

A Fast Delivery Protocol for Total Order Broadcasting.
Proceedings of the 16th International Conference on Computer Communications and Networks, 2007

Middleware in Modern High Performance Computing System Architectures.
Proceedings of the Computational Science - ICCS 2007, 7th International Conference, Beijing, China, May 27, 2007

Transparent Symmetric Active/Active Replication for Service-Level High Availability.
Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2007), 2007

On Programming Models for Service-Level High Availability.
Proceedings of the The Second International Conference on Availability, 2007

2006
MOLAR: adaptive runtime support for high-end computing operating and runtime systems.
ACM SIGOPS Oper. Syst. Rev., 2006

Symmetric Active/Active High Availability for High-Performance Computing System Services.
J. Comput., 2006

Scalable, fault tolerant membership for MPI tasks on HPC systems.
Proceedings of the 20th Annual International Conference on Supercomputing, 2006

RMIX: A Dynamic, Heterogeneous, Reconfigurable Communication Framework.
Proceedings of the Computational Science, 2006

A Parallel Plug-In Programming Paradigm.
Proceedings of the High Performance Computing and Communications, 2006

JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management.
Proceedings of the 2006 IEEE International Conference on Cluster Computing, 2006

Active/Active Replication for Highly Available HPC System Services.
Proceedings of the The First International Conference on Availability, 2006

2005
UML-based Beowulf Cluster Availability Modeling.
Proceedings of the International Conference on Software Engineering Research and Practice, 2005

A Lightweight Kernel for the Harness Metacomputing Framework.
Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

Super-Scalable Algorithms for Computing on 100, 000 Processors.
Proceedings of the Computational Science, 2005

Job-Site Level Fault Tolerance for Cluster and Grid environments.
Proceedings of the 2005 IEEE International Conference on Cluster Computing (CLUSTER 2005), September 26, 2005

2003
A Diskless Checkpointing Algorithm for Super-scale Architectures Applied to the Fast Fourier Transform.
Proceedings of the 1st International Workshop on Challenges of Large Applications in Distributed Environments, 2003

2002
Distributed Peer-to-Peer Control in Harness.
Proceedings of the Computational Science - ICCS 2002, 2002


  Loading...