Franck Cappello

Orcid: 0000-0002-7890-3934

Affiliations:
  • University of Illinois, INRIA-Illinois Joint Laboratory on PetaScale Computing


According to our database1, Franck Cappello authored at least 316 papers between 1990 and 2025.

Collaborative distances:

Awards

IEEE Fellow

IEEE Fellow 2017, "For contributions to high-performance computing, fault tolerance, and grid-based computing".

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Multifacets of lossy compression for scientific data in the Joint-Laboratory of Extreme Scale Computing.
Future Gener. Comput. Syst., 2025

2024
High-performance Effective Scientific Error-bounded Lossy Compression with Auto-tuned Multi-component Interpolation.
Proc. ACM Manag. Data, February, 2024

LCP: Enhancing Scientific Data Management with Lossy Compression for Particles.
CoRR, 2024

To Compress or Not To Compress: Energy Trade-Offs and Benefits of Lossy Compressed I/O.
CoRR, 2024

DGRO: Diameter-Guided Ring Optimization for Integrated Research Infrastructure Membership.
CoRR, 2024

FRSZ2 for In-Register Block Compression Inside GMRES on GPUs.
CoRR, 2024

HoSZp: An Efficient Homomorphic Error-bounded Lossy Compressor for Scientific Data.
CoRR, 2024

TurboFFT: A High-Performance Fast Fourier Transform with Fault Tolerance on GPU.
CoRR, 2024

A Survey on Error-Bounded Lossy Compression for Scientific Datasets.
CoRR, 2024

Understanding The Effectiveness of Lossy Compression in Machine Learning Training Sets.
CoRR, 2024

POSTER: Optimizing Collective Communications with Error-bounded Lossy Compression for GPU Clusters.
Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024

Towards Affordable Reproducibility Using Scalable Capture and Comparison of Intermediate Multi-Run Results.
Proceedings of the 25th International Middleware Conference, 2024

Deep Optimizer States: Towards Scalable Training of Transformer Models using Interleaved Offloading.
Proceedings of the 25th International Middleware Conference, 2024

CliZ: Optimizing Lossy Compression for Climate Datasets with Adaptive Fine-tuned Data Prediction.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

Druto: Upper-Bounding Silent Data Corruption Vulnerability in GPU Applications.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters.
Proceedings of the 38th ACM International Conference on Supercomputing, 2024

Preserving Topological Feature with Sign-of-Determinant Predicates in Lossy Compression: A Case Study of Vector Field Critical Points.
Proceedings of the 40th IEEE International Conference on Data Engineering, 2024

FedSZ: Leveraging Error-Bounded Lossy Compression for Federated Learning Communications.
Proceedings of the 44th IEEE International Conference on Distributed Computing Systems, 2024

CereSZ: Enabling and Scaling Error-bounded Lossy Compression on Cerebras CS-2.
Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, 2024

A Portable, Fast, DCT-based Compressor for AI Accelerators.
Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, 2024

DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models.
Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, 2024

Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers.
Proceedings of the 14th Workshop on AI and Scientific Computing at Scale using Flexible Computing Infrastructures, 2024

Concealing Compression-accelerated I/O for HPC Applications through In Situ Task Scheduling.
Proceedings of the Nineteenth European Conference on Computer Systems, 2024

FT K-Means: A High-Performance K-Means on GPU with Fault Tolerance.
Proceedings of the IEEE International Conference on Cluster Computing, 2024

2023
Toward Feature-Preserving Vector Field Compression.
IEEE Trans. Vis. Comput. Graph., December, 2023

Black-box statistical prediction of lossy compression ratios for scientific data.
Int. J. High Perform. Comput. Appl., July, 2023

SZ3: A Modular Framework for Composing Prediction-Based Error-Bounded Lossy Compressors.
IEEE Trans. Big Data, April, 2023

Efficient Communication in Federated Learning Using Floating-Point Lossy Compression.
CoRR, 2023

cuSZ-I: High-Fidelity Error-Bounded Lossy Compression for Scientific Data on GPUs.
CoRR, 2023

SRN-SZ: Deep Leaning-Based Scientific Error-bounded Lossy Compression with Super-resolution Neural Networks.
CoRR, 2023

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters.
CoRR, 2023

C-Coll: Introducing Error-bounded Lossy Compression into MPI Collectives.
CoRR, 2023

Streaming Hardware Compressor Generator Framework.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

AMRIC: A Novel In Situ Lossy Compression Framework for Efficient I/O in Adaptive Mesh Refinement Applications.
Proceedings of the International Conference for High Performance Computing, 2023

LibPressio-Predict: Flexible and Fast Infrastructure For Inferring Compression Performance.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

cuSZp: An Ultra-fast GPU Error-bounded Lossy Compression Framework with Optimized End-to-End Performance.
Proceedings of the International Conference for High Performance Computing, 2023

GPU-Accelerated Error-Bounded Compression Framework for Quantum Circuit Simulations.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs.
Proceedings of the 37th International Conference on Supercomputing, 2023

Lightweight Huffman Coding for Efficient GPU Compression.
Proceedings of the 37th International Conference on Supercomputing, 2023

FAZ: A flexible auto-tuned modular error-bounded compression framework for scientific data.
Proceedings of the 37th International Conference on Supercomputing, 2023

Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication.
Proceedings of the 52nd International Conference on Parallel Processing, 2023

A Feature-Driven Fixed-Ratio Lossy Compression Framework for Real-World Scientific Datasets.
Proceedings of the 39th IEEE International Conference on Data Engineering, 2023

Optimizing Scientific Data Transfer on Globus with Error-Bounded Lossy Compression.
Proceedings of the 43rd IEEE International Conference on Distributed Computing Systems, 2023

FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs.
Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, 2023

GPU-Enabled Asynchronous Multi-level Checkpoint Caching and Prefetching.
Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, 2023

Towards Efficient I/O Pipelines Using Accumulated Compression.
Proceedings of the 30th IEEE International Conference on High Performance Computing, 2023

SECRE: Surrogate-Based Error-Controlled Lossy Compression Ratio Estimation Framework.
Proceedings of the 30th IEEE International Conference on High Performance Computing, 2023

Characterization and Detection of Artifacts for Error-Controlled Lossy Compressors.
Proceedings of the 30th IEEE International Conference on High Performance Computing, 2023

Time Machine: Generative Real-Time Model for Failure (and Lead Time) Prediction in HPC Systems.
Proceedings of the 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Network, 2023

An Efficient and Accurate Compression Ratio Estimation Model for SZx.
Proceedings of the IEEE International Conference on Cluster Computing, 2023

A Lightweight, Effective Compressibility Estimation Method for Error-bounded Lossy Compression.
Proceedings of the IEEE International Conference on Cluster Computing, 2023

Towards Improving Reverse Time Migration Performance by High-speed Lossy Compression.
Proceedings of the 23rd IEEE/ACM International Symposium on Cluster, 2023

Scientific Error-bounded Lossy Compression with Super-resolution Neural Networks.
Proceedings of the IEEE International Conference on Big Data, 2023

Exploring Wavelet Transform Usages for Error-bounded Scientific Data Compression.
Proceedings of the IEEE International Conference on Big Data, 2023

2022
OptZConfig: Efficient Parallel Optimization of Lossy Compression Configuration.
IEEE Trans. Parallel Distributed Syst., 2022

Optimizing Error-Bounded Lossy Compression for Scientific Data With Diverse Constraints.
IEEE Trans. Parallel Distributed Syst., 2022

Toward Quantity-of-Interest Preserving Lossy Compression for Scientific Data.
Proc. VLDB Endow., 2022

ROIBIN-SZ: Fast and Science-Preserving Compression for Serial Crystallography.
CoRR, 2022

SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets.
CoRR, 2022

SIMD Lossy Compression for Scientific Data.
CoRR, 2022

Understanding the Effects of Modern Compressors on the Community Earth Science Model.
Proceedings of the 8th IEEE/ACM International Workshop on Data Analysis and Reduction for Big Scientific Data, 2022

Understanding Impact of Lossy Compression on Derivative-related Metrics in Scientific Datasets.
Proceedings of the 8th IEEE/ACM International Workshop on Data Analysis and Reduction for Big Scientific Data, 2022

Dynamic Quality Metric Oriented Error Bounded Lossy Compression for Scientific Datasets.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

Accelerating Parallel Write via Deeply Integrating Predictive Lossy Compression with HDF5.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

Mitigating Silent Data Corruptions in HPC Applications across Multiple Program Inputs.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

Hardening selective protection across multiple program inputs for HPC applications.
Proceedings of the PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2, 2022

Efficient Error-Bounded Lossy Compression for CPU Architectures.
Proceedings of the 30th International Symposium on Modeling, 2022

Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs.
Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

Clairvoyant: a log-based transformer-decoder for failure prediction in large-scale systems.
Proceedings of the ICS '22: 2022 International Conference on Supercomputing, Virtual Event, June 28, 2022

MDZ: An Efficient Error-bounded Lossy Compressor for Molecular Dynamics.
Proceedings of the 38th IEEE International Conference on Data Engineering, 2022

Improving Prediction-Based Lossy Compression Dramatically via Ratio-Quality Modeling.
Proceedings of the 38th IEEE International Conference on Data Engineering, 2022

Ultrafast Error-bounded Lossy Compression for Scientific Datasets.
Proceedings of the HPDC '22: The 31st International Symposium on High-Performance Parallel and Distributed Computing, Minneapolis, MN, USA, 27 June 2022, 2022

A Reflection on Methodologies, Algorithms and Software for HPDC.
Proceedings of the HPDC '22: The 31st International Symposium on High-Performance Parallel and Distributed Computing, Minneapolis, MN, USA, 27 June 2022, 2022

Towards Efficient Cache Allocation for High-Frequency Checkpointing.
Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

Exploring Light-weight Cryptography for Efficient and Secure Lossy Data Compression.
Proceedings of the IEEE International Conference on Cluster Computing, 2022

2021
FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks.
IEEE Trans. Parallel Distributed Syst., 2021

Demystifying asynchronous I/O Interference in HPC applications.
Int. J. High Perform. Comput. Appl., 2021

Online data analysis and reduction: An important Co-design motif for extreme-scale computers.
Int. J. High Perform. Comput. Appl., 2021

Towards Aggregated Asynchronous Checkpointing.
CoRR, 2021

cuSZ(x): Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs.
CoRR, 2021

VELOC: VEry Low Overhead Checkpointing in the Age of Exascale.
CoRR, 2021

Productive and Performant Generic Lossy Data Compression with LibPressio.
Proceedings of the 2021 7th International Workshop on Data Analysis and Reduction for Big Scientific Data, 2021

Understanding Effectiveness of Multi-Error-Bounded Lossy Compression for Preserving Ranges of Interest in Scientific Analysis.
Proceedings of the 2021 7th International Workshop on Data Analysis and Reduction for Big Scientific Data, 2021

Resilient error-bounded lossy compressor for data transfer.
Proceedings of the International Conference for High Performance Computing, 2021

Exploring Lossy Compressibility through Statistical Correlations of Scientific Datasets.
Proceedings of the 2021 7th International Workshop on Data Analysis and Reduction for Big Scientific Data, 2021

Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing.
Proceedings of the 29th International Symposium on Modeling, 2021

Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures.
Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021

Optimizing Error-Bounded Lossy Compression for Scientific Data by Dynamic Spline Interpolation.
Proceedings of the 37th IEEE International Conference on Data Engineering, 2021

Towards Combining Error-bounded Lossy Compression and Cryptography for Scientific Data.
Proceedings of the 2021 IEEE High Performance Extreme Computing Conference, 2021

Optimizing Multi-Range based Error-Bounded Lossy Compression for Scientific Datasets.
Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021

Towards High Performance Resilience Using Performance Portable Abstractions.
Proceedings of the Euro-Par 2021: Parallel Processing, 2021

Sentiment Analysis based Error Detection for Large-Scale Systems.
Proceedings of the 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2021

cuZ-Checker: A GPU-Based Ultra-Fast Assessment System for Lossy Compressions.
Proceedings of the IEEE International Conference on Cluster Computing, 2021

Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs.
Proceedings of the IEEE International Conference on Cluster Computing, 2021

Accelerating DNN Architecture Search at Scale Using Selective Weight Transfer.
Proceedings of the IEEE International Conference on Cluster Computing, 2021

Exploring Autoencoder-based Error-bounded Compression for Scientific Data.
Proceedings of the IEEE International Conference on Cluster Computing, 2021

Improving Lossy Compression for SZ by Exploring the Best-Fit Lossless Compression Techniques.
Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), 2021

2020
Performance Optimization for Relative-Error-Bounded Lossy Compression on Scientific Data.
IEEE Trans. Parallel Distributed Syst., 2020

SDC Resilient Error-bounded Lossy Compressor.
CoRR, 2020

Algorithm-Based Fault Tolerance for Convolutional Neural Networks.
CoRR, 2020

Fulfilling the Promises of Lossy Compression for Scientific Applications.
Proceedings of the Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI, 2020

waveSZ: a hardware-algorithm co-design of efficient lossy compression for scientific data.
Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

FRaZ: A Generic High-Fidelity Fixed-Ratio Lossy Compression Framework for Scientific Floating-point Data.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

Significantly Improving Lossy Compression for HPC Datasets with Second-Order Prediction and Parameter Optimization.
Proceedings of the HPDC '20: The 29th International Symposium on High-Performance Parallel and Distributed Computing, 2020

DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training.
Proceedings of the IEEE International Conference on Cluster Computing, 2020

Towards End-to-end SDC Detection for HPC Applications Equipped with Lossy Compression.
Proceedings of the IEEE International Conference on Cluster Computing, 2020

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models.
Proceedings of the 20th IEEE/ACM International Symposium on Cluster, 2020

SDRBench: Scientific Data Reduction Benchmark for Lossy Compressors.
Proceedings of the 2020 IEEE International Conference on Big Data (IEEE BigData 2020), 2020

Toward Feature-Preserving 2D and 3D Vector Field Compression.
Proceedings of the 2020 IEEE Pacific Visualization Symposium, 2020

cuSZ: An Efficient GPU-Based Error-Bounded Lossy Compression Framework for Scientific Data.
Proceedings of the PACT '20: International Conference on Parallel Architectures and Compilation Techniques, 2020

2019
Optimizing Lossy Compression Rate-Distortion from Automatic Online Selection between SZ and ZFP.
IEEE Trans. Parallel Distributed Syst., 2019

Efficient Lossy Compression for Scientific Data Based on Pointwise Relative Error Bound.
IEEE Trans. Parallel Distributed Syst., 2019

Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System.
IEEE Trans. Parallel Distributed Syst., 2019

Z-checker: A framework for assessing lossy compression of scientific data.
Int. J. High Perform. Comput. Appl., 2019

Use cases of lossy compression for floating-point data in scientific data sets.
Int. J. High Perform. Comput. Appl., 2019

Exploring the feasibility of lossy compression for PDE simulations.
Int. J. High Perform. Comput. Appl., 2019

Full-state quantum circuit simulation by using data compression.
Proceedings of the International Conference for High Performance Computing, 2019

Significantly improving lossy compression quality based on an optimized hybrid prediction model.
Proceedings of the International Conference for High Performance Computing, 2019

FT-iSort: efficient fault tolerance for introsort.
Proceedings of the International Conference for High Performance Computing, 2019

Accelerating Relative-error Bounded Lossy Compression for HPC datasets with Precomputation-Based Mechanisms.
Proceedings of the 35th Symposium on Mass Storage Systems and Technologies, 2019

VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale.
Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression.
Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, 2019

Towards Portable Online Prediction of Network Utilization Using MPI-Level Monitoring.
Proceedings of the Euro-Par 2019: Parallel Processing, 2019

Characterizing and Understanding HPC Job Failures Over The 2K-Day Life of IBM BlueGene/Q System.
Proceedings of the 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2019

Improving Performance of Data Dumping with Lossy Compression for Scientific Simulation.
Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019

2018
Optimization of Error-Bounded Lossy Compression for Hard-to-Compress HPC Data.
IEEE Trans. Parallel Distributed Syst., 2018

Exploring the capabilities of support vector machines in detecting silent data corruptions.
Sustain. Comput. Informatics Syst., 2018

Coping with silent and fail-stop errors at scale by combining replication and checkpointing.
J. Parallel Distributed Comput., 2018

Unified fault-tolerance framework for hybrid task-parallel message-passing applications.
Int. J. High Perform. Comput. Appl., 2018

Big data and extreme-scale computing.
Int. J. High Perform. Comput. Appl., 2018

Transferring a petabyte in a day.
Future Gener. Comput. Syst., 2018

Memory-Efficient Quantum Circuit Simulation by Using Lossy Data Compression.
CoRR, 2018

Amplitude-Aware Lossy Compression for Quantum Circuit Simulation.
CoRR, 2018

Parallel Partial Reduction for Large-Scale Data Analysis and Visualization.
Proceedings of the 8th IEEE Symposium on Large Data Analysis and Visualization, 2018

CEBDA 2018 Keynote.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

Improving performance of iterative methods by lossy checkponting.
Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, 2018

La VALSE: Scalable Log Visualization for Fault Characterization in Supercomputers.
Proceedings of the 18th Eurographics Symposium on Parallel Graphics and Visualization, 2018


Neural Network Based Silent Error Detector.
Proceedings of the IEEE International Conference on Cluster Computing, 2018

Fixed-PSNR Lossy Compression for Scientific Data.
Proceedings of the IEEE International Conference on Cluster Computing, 2018

An Efficient Transformation Scheme for Lossy Data Compression with Point-Wise Relative Error Bound.
Proceedings of the IEEE International Conference on Cluster Computing, 2018

PaSTRI: Error-Bounded Lossy Compression for Two-Electron Integrals in Quantum Chemistry.
Proceedings of the IEEE International Conference on Cluster Computing, 2018

Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets.
Proceedings of the IEEE International Conference on Big Data (IEEE BigData 2018), 2018

Optimizing Lossy Compression with Adjacent Snapshots for N-body Simulation Data.
Proceedings of the IEEE International Conference on Big Data (IEEE BigData 2018), 2018

2017
Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model.
IEEE Trans. Parallel Distributed Syst., 2017

Toward General Software Level Silent Data Corruption Detection for Parallel Applications.
IEEE Trans. Parallel Distributed Syst., 2017

Exploration of Pattern-Matching Techniques for Lossy Compression on Cosmology Simulation Data Sets.
Proceedings of the High Performance Computing, 2017

Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale.
Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, 2017

Evaluating irregular memory access on OpenCL FPGA platforms: A case study with XSBench.
Proceedings of the 27th International Conference on Field Programmable Logic and Applications, 2017

Evaluation of a Floating-Point Intensive Kernel on FPGA - A Case Study of Geodesic Distance Kernel.
Proceedings of the Euro-Par 2017: Parallel Processing Workshops, 2017

Computing Just What You Need: Online Data Analysis and Reduction at Extreme Scales.
Proceedings of the Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28, 2017

Understanding and Improving the Trust in Results of Numerical Simulations and Scientific Data Analytics.
Proceedings of the Euro-Par 2017: Parallel Processing Workshops, 2017

MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection.
Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

Detection of Silent Data Corruption in Adaptive Numerical Integration Solvers.
Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

LogAider: A tool for mining potential correlations of HPC log events.
Proceedings of the 17th IEEE/ACM International Symposium on Cluster, 2017

In-depth exploration of single-snapshot lossy compression techniques for N-body simulations.
Proceedings of the 2017 IEEE International Conference on Big Data (IEEE BigData 2017), 2017

2016
Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications.
IEEE Trans. Parallel Distributed Syst., 2016

Damaris: Addressing Performance Variability in Data Management for Post-Petascale Simulations.
ACM Trans. Parallel Comput., 2016

Self-Adaptive Density Estimation of Particle Data.
SIAM J. Sci. Comput., 2016

Preface: Visualization and data analytics for scientific discovery.
Parallel Comput., 2016

Fast Error-Bounded Lossy HPC Data Compression with SZ.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Reducing Waste in Extreme Scale Systems through Introspective Analysis.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Lightweight and Accurate Silent Data Corruption Detection in Ordinary Differential Equation Solvers.
Proceedings of the Euro-Par 2016: Parallel Processing, 2016

Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications.
Proceedings of the Euro-Par 2016: Parallel Processing, 2016

DSN 2016 Tutorial: Resilience for Scientific Computing: From Theory to Practice.
Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2016

Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era.
Proceedings of the IEEE/ACM 16th International Symposium on Cluster, 2016

2015
GloudSim: Google trace based cloud simulator with virtual machines.
Softw. Pract. Exp., 2015

Detecting Silent Data Corruption for Extreme-Scale MPI Applications.
Proceedings of the 22nd European MPI Users' Group Meeting, 2015

Scheduling the I/O of HPC Applications Under Congestion.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications.
Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, 2015

Exploiting Spatial Smoothness in HPC Applications to Detect Silent Data Corruption.
Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications, 2015

Addressing the Last Roadblock for Message Logging in HPC: Alleviating the Memory Requirement Using Dedicated Resources.
Proceedings of the Euro-Par 2015: Parallel Processing Workshops, 2015

Distributed Monitoring and Management of Exascale Systems in the Argo Project.
Proceedings of the Distributed Applications and Interoperable Systems, 2015

Fault-Tolerant Protocol for Hybrid Task-Parallel Message-Passing Applications.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

2014
Characterizing and modeling cloud applications/jobs on a Google data center.
J. Supercomput., 2014

Adaptive Algorithm for Minimizing Cloud Task Length with Prediction Errors.
IEEE Trans. Cloud Comput., 2014

Toward Exascale Resilience: 2014 update.
Supercomput. Front. Innov., 2014

Addressing failures in exascale computing.
Int. J. High Perform. Comput. Appl., 2014

Unified model for assessing checkpointing protocols at extreme-scale.
Concurr. Comput. Pract. Exp., 2014

Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales.
Proceedings of the International Conference for High Performance Computing, 2014

Analysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing.
Proceedings of the High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, 2014

Detecting silent data corruption through data dynamic monitoring for scientific applications.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

GPGPUs: How to combine high computational power with high reliability.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2014

Energy-performance tradeoffs in multilevel checkpoint strategies.
Proceedings of the 2014 IEEE International Conference on Cluster Computing, 2014

2013
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds.
J. Parallel Distributed Comput., 2013

Failure prediction for HPC systems and applications: Current situation and open issues.
Int. J. High Perform. Comput. Appl., 2013

Scalable data management for map-reduce-based data-intensive applications: a view for cloud and hybrid infrastructures.
Int. J. Cloud Comput., 2013

SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing.
Proceedings of the International Conference for High Performance Computing, 2013

Optimization of cloud task processing with checkpoint-restart mechanism.
Proceedings of the International Conference for High Performance Computing, 2013

Towards an energy estimator for fault tolerance protocols.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2013

Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing.
Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

HPCS 2013 panel: The era of exascale sciences: Challenges, needs and requirements.
Proceedings of the International Conference on High Performance Computing & Simulation, 2013

Characterizing Cloud Applications on a Google Data Center.
Proceedings of the 42nd International Conference on Parallel Processing, 2013

AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing.
Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, 2013

Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization.
Proceedings of the Euro-Par 2013 Parallel Processing, 2013

ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance Protocols for HPC Applications.
Proceedings of the 13th IEEE/ACM International Symposium on Cluster, 2013

Improving floating point compression through binary masks.
Proceedings of the 2013 IEEE International Conference on Big Data (IEEE BigData 2013), 2013

2012
HydEE, vers un protocole de recouvrement arrière hiérarchique pour les machines exascales. De l'exploitation du déterminisme des émissions dans les protocoles de recouvrement arrière.
Tech. Sci. Informatiques, 2012

Fault prediction under the microscope: a closer look into HPC systems.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

A hybrid local storage transfer scheme for live migration of I/O intensive workloads.
Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, 2012

Scalable Reed-Solomon-Based Reliable Local Storage for HPC Applications on IaaS Clouds.
Proceedings of the Euro-Par 2012 Parallel Processing - 18th International Conference, 2012

Energy considerations in checkpointing and fault tolerance protocols.
Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2012

Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O.
Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012

Hierarchical Clustering Strategies for Fault Tolerance in Large Scale HPC Systems.
Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012

2011
Preventive Migration vs. Preventive Checkpointing for Extreme Scale Supercomputers.
Parallel Process. Lett., 2011

The International Exascale Software Project roadmap.
Int. J. High Perform. Comput. Appl., 2011

QCG-OMPI: MPI applications on grids.
Future Gener. Comput. Syst., 2011

BlobCR: efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots.
Proceedings of the Conference on High Performance Computing Networking, 2011

Modeling and tolerating heterogeneous failures in large parallel systems.
Proceedings of the Conference on High Performance Computing Networking, 2011

FTI: high performance fault tolerance interface for hybrid systems.
Proceedings of the Conference on High Performance Computing Networking, 2011

Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

DPDNS Keynote.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Comparing archival policies for Blue Waters.
Proceedings of the 18th International Conference on High Performance Computing, 2011

On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications.
Proceedings of the Euro-Par 2011 Parallel Processing - 17th International Conference, 2011

Optimizing Multi-deployment on Clouds by Means of Self-adaptive Prefetching.
Proceedings of the Euro-Par 2011 Parallel Processing - 17th International Conference, 2011

Event Log Mining Tool for Large Scale HPC Systems.
Proceedings of the Euro-Par 2011 Parallel Processing - 17th International Conference, 2011

2010
Special section: Peer-to-peer grid technologies.
Future Gener. Comput. Syst., 2010

Checkpointing vs. Migration for Post-Petascale Supercomputers.
Proceedings of the 39th International Conference on Parallel Processing, 2010

On Communication Determinism in Parallel HPC Applications.
Proceedings of the 19th International Conference on Computer Communications and Networks, 2010

Low-overhead diskless checkpoint for hybrid computing systems.
Proceedings of the 2010 International Conference on High Performance Computing, 2010

Distributed Diskless Checkpoint for Large Scale Systems.
Proceedings of the 10th IEEE/ACM International Conference on Cluster, 2010

Planning Large Data Transfers in Institutional Grids.
Proceedings of the 10th IEEE/ACM International Conference on Cluster, 2010

2009
Foreword.
Parallel Comput., 2009

Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment.
J. Interconnect. Networks, 2009

BitDew: A data management and distribution service with multi-protocol file transfer and metadata abstraction.
J. Netw. Comput. Appl., 2009

The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community.
Int. J. High Perform. Comput. Appl., 2009

Toward Exascale Resilience.
Int. J. High Perform. Comput. Appl., 2009

Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities.
Int. J. High Perform. Comput. Appl., 2009

Checkpointing vs. Migration for Post-Petascale Machines
CoRR, 2009

An Information Brokering Service Provider (IBSP) for Virtual Clusters.
Proceedings of the On the Move to Meaningful Internet Systems: OTM 2009, 2009

Cost-benefit analysis of Cloud Computing versus desktop grids.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

MPI Applications on Grids: A Topology Aware Approach.
Proceedings of the Euro-Par 2009 Parallel Processing, 2009

09191 Abstracts Collection - Fault Tolerance in High-Performance Computing and Grids.
Proceedings of the Fault Tolerance in High-Performance Computing and Grids, 03.05., 2009

High accuracy failure injection in parallel and distributed systems using virtualization.
Proceedings of the 6th Conference on Computing Frontiers, 2009

BLAST Application with Data-Aware Desktop Grid Middleware.
Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009

2008
Integrating Computing Resources on Multiple Grid-Enabled Job Scheduling Systems Through a Grid RPC System.
J. Grid Comput., 2008

Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols.
Future Gener. Comput. Syst., 2008

BitDew: a programmable environment for large-scale data management and distribution.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2008

Fault Tolerance for PetaScale Systems: Current Knowledge, Challenges and Opportunities.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2008

Distributing and managing data on desktop grids with BitDew.
Proceedings of the 3rd Workshop on the Use of P2P, 2008

Emulation platform for high accuracy failure injection in grids.
Proceedings of the High Speed and Large Scale Scientific Computing - Selected Papers from the High Performance Computing Workshop, Cetraro, Italy, June 30, 2008

OpenWP: Combining annotation language and workflow environments for porting existing applications on grids.
Proceedings of the 9th IEEE/ACM International Conference on Grid Computing (Grid 2008), Tsukuba, Japan, September 29, 2008

A File Transfer Service with Client/Server, P2P and Wide Area Storage Protocols.
Proceedings of the Data Management in Grid and Peer-to-Peer Systems, 2008

Grid Services for MPI.
Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), 2008

2007
Scalability Comparison of Four Host Virtualization Tools.
J. Grid Comput., 2007

Towards efficient data distribution on computational desktop grids with BitTorrent.
Future Gener. Comput. Syst., 2007

Characterizing resource availability in enterprise desktop grids.
Future Gener. Comput. Syst., 2007

Towards an International Computer Science Grid.
Proceedings of the 16th IEEE International Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises (WETICE 2007), 2007

Virtual Parallel Machines Through Virtualization: Impact on MPI Executions.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 14th European PVM/MPI User's Group Meeting, Paris, France, September 30, 2007

Grid Services for MPI.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 14th European PVM/MPI User's Group Meeting, Paris, France, September 30, 2007

Characterizing Result Errors in Internet Desktop Grids.
Proceedings of the Euro-Par 2007, 2007

A Distributed and Replicated Service for Checkpoint Storage.
Proceedings of the Making Grids Work: Proceedings of the CoreGRID Workshop on Programming Models Grid and P2P System Architecture Grid Systems, 2007

Toward an International "Computer Science Grid".
Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2007), 2007

2006
MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI.
Int. J. High Perform. Comput. Appl., 2006

Hybrid Preemptive Scheduling of Message Passing Interface Applications on Grids.
Int. J. High Perform. Comput. Appl., 2006

Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed.
Int. J. High Perform. Comput. Appl., 2006

Editorial: Special Issue on Global and Peer-to-Peer Computing.
J. Grid Comput., 2006

Performance comparison of MPI and OpenMP on shared memory multiprocessors.
Concurr. Comput. Pract. Exp., 2006

MPI tools and performance studies - Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI.
Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

Computer Science Grids.
Proceedings of the High Performance Computing and Grids in Action, 2006

Availability Traces of Enterprise Desktop Grids.
Proceedings of the 7th IEEE/ACM International Conference on Grid Computing (GRID 2006), 2006

Private Virtual Cluster: Infrastructure and Protocol for Instant Grids.
Proceedings of the Euro-Par 2006, Parallel Processing, 12th International Euro-Par Conference, Dresden, Germany, August 28, 2006

On Resource Volatility in Enterprise Desktop Grids.
Proceedings of the Second International Conference on e-Science and Grid Technologies (e-Science 2006), 2006

FAIL-MPI: How Fault-Tolerant Is Fault-Tolerant MPI?
Proceedings of the 2006 IEEE International Conference on Cluster Computing, 2006

Towards Soft Real-Time Applications on Enterprise Desktop Grids.
Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2006), 2006

2005
An algorithmic model for heterogeneous hyper-clusters: rationale and experience.
Int. J. Found. Comput. Sci., 2005

Computing on large-scale distributed systems: XtremWeb architecture, programming models, security, tests and convergence with grid.
Future Gener. Comput. Syst., 2005

Collaborative Data Distribution with BitTorrent for Computational Desktop Grids.
Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC 2005), 2005

Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI.
Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

Scheduling independent tasks sharing large data distributed with BitTorrent.
Proceedings of the 6th IEEE/ACM International Conference on Grid Computing (GRID 2005), 2005

Grid'5000: a large scale and highly reconfigurable grid experimental testbed.
Proceedings of the 6th IEEE/ACM International Conference on Grid Computing (GRID 2005), 2005

2004
Coordinated checkpoint versus message log for fault tolerant MPI.
Int. J. High Perform. Comput. Netw., 2004

RPC-V: Toward Fault-Tolerant RPC for Internet Connected Desktop Grids with Volatile Nodes.
Proceedings of the ACM/IEEE SC2004 Conference on High Performance Networking and Computing, 2004

Hybrid Preemptive Scheduling of MPI Applications on the Grids.
Proceedings of the 5th International Workshop on Grid Computing (GRID 2004), 2004

Improved message logging versus improved coordinated checkpointing for fault tolerant MPI.
Proceedings of the 2004 IEEE International Conference on Cluster Computing (CLUSTER 2004), 2004

2003
Augernome & XtremWeb: Monte Carlos computation on a global computing platform
CoRR, 2003

MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging.
Proceedings of the ACM/IEEE SC2003 Conference on High Performance Networking and Computing, 2003

Topic Introduction.
Proceedings of the Euro-Par 2003. Parallel Processing, 2003

XtremWeb & Condor sharing resources between Internet connected Condor pools.
Proceedings of the 3rd IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2003), 2003

2002
MPI ou MPI+OpenMP sur grappes de multiprocesseurs?
Tech. Sci. Informatiques, 2002

OVM: Out-of-order execution parallel virtual machine.
Future Gener. Comput. Syst., 2002

MPICH-V: toward a scalable fault tolerant MPI for volatile nodes.
Proceedings of the 2002 ACM/IEEE conference on Supercomputing, 2002

MPICH-CM: A Communication Library Design for a P2P MPI Implementation.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 9th European PVM/MPI Users' Group Meeting, Linz, Austria, September 29, 2002

SPMD OpenMP versus MPI on a IBM SMP for 3 Kernels of the NAS Benchmarks.
Proceedings of the High Performance Computing, 4th International Symposium, 2002

2001
Understanding performance of SMP clusters running MPI programs.
Future Gener. Comput. Syst., 2001

Global Computing Systems.
Proceedings of the Large-Scale Scientific Computing, Third International Conference, 2001

HiHCoHP: Toward a Realistic Communication Model for Hierarchical HyperClusters of Heterogeneous Processors.
Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS-01), 2001

XtremWeb: A Generic Global Computing System.
Proceedings of the First IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2001), 2001

2000
MPI versus MPI+OpenMP on IBM SP for the NAS Benchmarks.
Proceedings of the Proceedings Supercomputing 2000, 2000

Investigating the Performance of Two Programming Models for Clusters of SMP PCs.
Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, 2000

XtremWeb: Building an Experimental Platform for Global Computing.
Proceedings of the Grid Computing, 2000

1999
Performance Evaluation of Two Programming Models for a Cluster of PC Biprocessors.
Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, 1999

Performance of the NAS Benchmarks on a Cluster of SMP PCs Using a Parallelization of the MPI Programs with OpenMP.
Proceedings of the Parallel Computing Technologies, 1999

A Client/Broker/Server Substrate with µs Round-Trip Overhead.
Proceedings of the Euro-Par '99 Parallel Processing, 5th International Euro-Par Conference, Toulouse, France, August 31, 1999

Performance Characteristics of a Network of Commodity Multiprocessors for the NAS Benchmarks Using a Hybrid Memory Model.
Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, 1999

1998
On the Self-Similar Nature of Workstations and WWW Servers Workload.
Proceedings of the Euro-Par '98 Parallel Processing, 1998

1997
Communications in Parallel Architectures and Networks of Workstations: From Standardisation to New Standards.
Proceedings of the Parallel Computing Technologies, 1997

1995
The Static Network: A High Performance Reconfigurable Communication Network.
Parallel Process. Lett., 1995

Toward High Communication Performance through Compiled Communications on a Circuit Switched Interconnection Network.
Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture (HPCA 1995), 1995

1993
Hardware features of the static communication network of a parallel architecture.
Microprocess. Microprogramming, 1993

Static computation of standard linear algebra subroutines for PTAH.
Proceedings of the 1993 Euromicro Workshop on Parallel and Distributed Processing, 1993

A Parralel Architecture Based on Compiled Communication Schemes.
Proceedings of the Parallel Computing: Trends and Applications, 1993

Balanced Distributed Memory Parallel Computers.
Proceedings of the 1993 International Conference on Parallel Processing, 1993

1992
Data layouts impacts on the compilation of the communications for a synchronous MSIMD machine.
Microprocess. Microprogramming, 1992

Design of the processing node of the PTAH 64 parallel computer.
Microprocess. Microprogramming, 1992

PTAH: Introduction to a New Parallel Architecture for Highly Numeric Processing.
Proceedings of the PARLE '92: Parallel Architectures and Languages Europe, 1992

1991
3D hardware packages for parallel architectures.
Microprocessing and Microprogramming, 1991

1990
A risc central processing unit for a massivelly parallel architecture.
Microprocessing and Microprogramming, 1990


  Loading...