Franck Cappello

Proceedings of the HPDC '22: The 31st International Symposium on High-Performance Parallel and Distributed Computing, Minneapolis, MN, USA, 27 June 2022, 2022

Towards Efficient Cache Allocation for High-Frequency Checkpointing.

[BibT_eX]

[DOI]

Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

Exploring Light-weight Cryptography for Efficient and Secure Lossy Data Compression.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2022

2021

FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2021

Demystifying asynchronous I/O Interference in HPC applications.

[BibT_eX]

[DOI]

Shu-Mei Tseng

Aparna Chandramowlishwaran

Int. J. High Perform. Comput. Appl., 2021

Online data analysis and reduction: An important Co-design motif for extreme-scale computers.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2021

Towards Aggregated Asynchronous Checkpointing.

[BibT_eX]

[DOI]

CoRR, 2021

cuSZ(x): Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs.

[BibT_eX]

[DOI]

CoRR, 2021

VELOC: VEry Low Overhead Checkpointing in the Age of Exascale.

[BibT_eX]

[DOI]

CoRR, 2021

Productive and Performant Generic Lossy Data Compression with LibPressio.

[BibT_eX]

[DOI]

Proceedings of the 2021 7th International Workshop on Data Analysis and Reduction for Big Scientific Data, 2021

Understanding Effectiveness of Multi-Error-Bounded Lossy Compression for Preserving Ranges of Interest in Scientific Analysis.

[BibT_eX]

[DOI]

Proceedings of the 2021 7th International Workshop on Data Analysis and Reduction for Big Scientific Data, 2021

Resilient error-bounded lossy compressor for data transfer.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2021

Exploring Lossy Compressibility through Statistical Correlations of Scientific Datasets.

[BibT_eX]

[DOI]

Proceedings of the 2021 7th International Workshop on Data Analysis and Reduction for Big Scientific Data, 2021

Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing.

[BibT_eX]

[DOI]

Proceedings of the 29th International Symposium on Modeling, 2021

Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures.

[BibT_eX]

[DOI]

Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021

Optimizing Error-Bounded Lossy Compression for Scientific Data by Dynamic Spline Interpolation.

[BibT_eX]

[DOI]

Kai Zhao

Thierry-Laurent D. Tonellot

Maxim Dmitriev

Zizhong Chen

Proceedings of the 37th IEEE International Conference on Data Engineering, 2021

Towards Combining Error-bounded Lossy Compression and Cryptography for Scientific Data.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE High Performance Extreme Computing Conference, 2021

Optimizing Multi-Range based Error-Bounded Lossy Compression for Scientific Datasets.

[BibT_eX]

[DOI]

Proceedings of the 28th IEEE International Conference on High Performance Computing, 2021

Towards High Performance Resilience Using Performance Portable Abstractions.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2021: Parallel Processing, 2021

Sentiment Analysis based Error Detection for Large-Scale Systems.

[BibT_eX]

[DOI]

Khalid Ayedh Alharthi

Proceedings of the 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2021

cuZ-Checker: A GPU-Based Ultra-Fast Assessment System for Lossy Compressions.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2021

Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2021

Accelerating DNN Architecture Search at Scale Using Selective Weight Transfer.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2021

Exploring Autoencoder-based Error-bounded Compression for Scientific Data.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2021

Improving Lossy Compression for SZ by Exploring the Best-Fit Lossless Compression Techniques.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), 2021

2020

Performance Optimization for Relative-Error-Bounded Lossy Compression on Scientific Data.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2020

SDC Resilient Error-bounded Lossy Compressor.

[BibT_eX]

[DOI]

CoRR, 2020

Algorithm-Based Fault Tolerance for Convolutional Neural Networks.

[BibT_eX]

[DOI]

CoRR, 2020

Fulfilling the Promises of Lossy Compression for Scientific Applications.

[BibT_eX]

[DOI]

Ali Murat Gok

Proceedings of the Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI, 2020

waveSZ: a hardware-algorithm co-design of efficient lossy compression for scientific data.

[BibT_eX]

[DOI]

Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

FRaZ: A Generic High-Fidelity Fixed-Ratio Lossy Compression Framework for Scientific Floating-point Data.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

Significantly Improving Lossy Compression for HPC Datasets with Second-Order Prediction and Parameter Optimization.

[BibT_eX]

[DOI]

Proceedings of the HPDC '20: The 29th International Symposium on High-Performance Parallel and Distributed Computing, 2020

DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2020

Towards End-to-end SDC Detection for HPC Applications Equipped with Lossy Compression.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2020

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models.

[BibT_eX]

[DOI]

Proceedings of the 20th IEEE/ACM International Symposium on Cluster, 2020

SDRBench: Scientific Data Reduction Benchmark for Lossy Compressors.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Big Data (IEEE BigData 2020), 2020

Toward Feature-Preserving 2D and 3D Vector Field Compression.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE Pacific Visualization Symposium, 2020

cuSZ: An Efficient GPU-Based Error-Bounded Lossy Compression Framework for Scientific Data.

[BibT_eX]

[DOI]

Proceedings of the PACT '20: International Conference on Parallel Architectures and Compilation Techniques, 2020

2019

Optimizing Lossy Compression Rate-Distortion from Automatic Online Selection between SZ and ZFP.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2019

Efficient Lossy Compression for Scientific Data Based on Pointwise Relative Error Bound.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2019

Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2019

Z-checker: A framework for assessing lossy compression of scientific data.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2019

Use cases of lossy compression for floating-point data in scientific data sets.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2019

Exploring the feasibility of lossy compression for PDE simulations.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2019

Full-state quantum circuit simulation by using data compression.

[BibT_eX]

[DOI]

Xin-Chuan Wu

Aparna Chandramowlishwaran

Emma Maitreyee Dasgupta

Proceedings of the International Conference for High Performance Computing, 2019

Significantly improving lossy compression quality based on an optimized hybrid prediction model.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2019

FT-iSort: efficient fault tolerance for introsort.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2019

Accelerating Relative-error Bounded Lossy Compression for HPC datasets with Precomputation-Based Mechanisms.

[BibT_eX]

[DOI]

Proceedings of the 35th Symposium on Mass Storage Systems and Technologies, 2019

VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

DeepSZ: A Novel Framework to Compress Deep Neural Networks by Using Error-Bounded Lossy Compression.

[BibT_eX]

[DOI]

Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, 2019

Towards Portable Online Prediction of Network Utilization Using MPI-Level Monitoring.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2019: Parallel Processing, 2019

Characterizing and Understanding HPC Job Failures Over The 2K-Day Life of IBM BlueGene/Q System.

[BibT_eX]

[DOI]

Proceedings of the 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2019

Improving Performance of Data Dumping with Lossy Compression for Scientific Simulation.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019

2018

Optimization of Error-Bounded Lossy Compression for Hard-to-Compress HPC Data.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2018

Exploring the capabilities of support vector machines in detecting silent data corruptions.

[BibT_eX]

[DOI]

Omer Subasi

Sriram Krishnamoorthy

Sustain. Comput. Informatics Syst., 2018

Coping with silent and fail-stop errors at scale by combining replication and checkpointing.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2018

Unified fault-tolerance framework for hybrid task-parallel message-passing applications.

[BibT_eX]

[DOI]

Omer Subasi

Int. J. High Perform. Comput. Appl., 2018

Big data and extreme-scale computing.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2018

Transferring a petabyte in a day.

[BibT_eX]

[DOI]

Future Gener. Comput. Syst., 2018

Memory-Efficient Quantum Circuit Simulation by Using Lossy Data Compression.

[BibT_eX]

[DOI]

CoRR, 2018

Amplitude-Aware Lossy Compression for Quantum Circuit Simulation.

[BibT_eX]

[DOI]

CoRR, 2018

Parallel Partial Reduction for Large-Scale Data Analysis and Visualization.

[BibT_eX]

[DOI]

Proceedings of the 8th IEEE Symposium on Large Data Analysis and Visualization, 2018

CEBDA 2018 Keynote.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

Improving performance of iterative methods by lossy checkponting.

[BibT_eX]

[DOI]

Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, 2018

La VALSE: Scalable Log Visualization for Fault Characterization in Supercomputers.

[BibT_eX]

[DOI]

Proceedings of the 18th Eurographics Symposium on Parallel Graphics and Visualization, 2018

Coupling Exascale Multiphysics Applications: Methods and Lessons Learned.

[BibT_eX]

[DOI]

Proceedings of the 14th IEEE International Conference on e-Science, 2018

Neural Network Based Silent Error Detector.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2018

Fixed-PSNR Lossy Compression for Scientific Data.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2018

An Efficient Transformation Scheme for Lossy Data Compression with Point-Wise Relative Error Bound.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2018

PaSTRI: Error-Bounded Lossy Compression for Two-Electron Integrals in Quantum Chemistry.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2018

Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Big Data (IEEE BigData 2018), 2018

Optimizing Lossy Compression with Adjacent Snapshots for N-body Simulation Data.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Big Data (IEEE BigData 2018), 2018

2017

Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2017

Toward General Software Level Silent Data Corruption Detection for Parallel Applications.

[BibT_eX]

[DOI]

Zhiling Lan

IEEE Trans. Parallel Distributed Syst., 2017

Exploration of Pattern-Matching Techniques for Lossy Compression on Cosmology Simulation Data Sets.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing, 2017

Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale.

[BibT_eX]

[DOI]

Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale, 2017

Evaluating irregular memory access on OpenCL FPGA platforms: A case study with XSBench.

[BibT_eX]

[DOI]

Proceedings of the 27th International Conference on Field Programmable Logic and Applications, 2017

Evaluation of a Floating-Point Intensive Kernel on FPGA - A Case Study of Geodesic Distance Kernel.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2017: Parallel Processing Workshops, 2017

Computing Just What You Need: Online Data Analysis and Reduction at Extreme Scales.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28, 2017

Understanding and Improving the Trust in Results of Numerical Simulations and Scientific Data Analytics.

[BibT_eX]

[DOI]

Rinku Gupta

Emil M. Constantinescu

Thomas Peterka

Stefan M. Wild

Proceedings of the Euro-Par 2017: Parallel Processing Workshops, 2017

MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection.

[BibT_eX]

[DOI]

Sriram Krishnamoorthy

Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

Detection of Silent Data Corruption in Adaptive Numerical Integration Solvers.

[BibT_eX]

[DOI]

Pierre-Louis Guhur

Emil M. Constantinescu

Debojyoti Ghosh

Tom Peterka

Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

LogAider: A tool for mining potential correlations of HPC log events.

[BibT_eX]

[DOI]

Proceedings of the 17th IEEE/ACM International Symposium on Cluster, 2017

In-depth exploration of single-snapshot lossy compression techniques for N-body simulations.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Conference on Big Data (IEEE BigData 2017), 2017

2016

Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2016

Damaris: Addressing Performance Variability in Data Management for Post-Petascale Simulations.

[BibT_eX]

[DOI]

ACM Trans. Parallel Comput., 2016

Self-Adaptive Density Estimation of Particle Data.

[BibT_eX]

[DOI]

SIAM J. Sci. Comput., 2016

Preface: Visualization and data analytics for scientific discovery.

[BibT_eX]

[DOI]

Hank Childs

Parallel Comput., 2016

Fast Error-Bounded Lossy HPC Data Compression with SZ.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Reducing Waste in Extreme Scale Systems through Introspective Analysis.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Lightweight and Accurate Silent Data Corruption Detection in Ordinary Differential Equation Solvers.

[BibT_eX]

[DOI]

Pierre-Louis Guhur

Hong Zhang

Tom Peterka

Emil M. Constantinescu

Proceedings of the Euro-Par 2016: Parallel Processing, 2016

Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications.

[BibT_eX]

[DOI]

Zhiling Lan

Proceedings of the Euro-Par 2016: Parallel Processing, 2016

DSN 2016 Tutorial: Resilience for Scientific Computing: From Theory to Practice.

[BibT_eX]

[DOI]

George Bosilca

Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2016

Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era.

[BibT_eX]

[DOI]

Omer Subasi

Proceedings of the IEEE/ACM 16th International Symposium on Cluster, 2016

2015

GloudSim: Google trace based cloud simulator with virtual machines.

[BibT_eX]

[DOI]

Softw. Pract. Exp., 2015

Detecting Silent Data Corruption for Extreme-Scale MPI Applications.

[BibT_eX]

[DOI]

Proceedings of the 22nd European MPI Users' Group Meeting, 2015

Scheduling the I/O of HPC Applications Under Congestion.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications.

[BibT_eX]

[DOI]

Zhiling Lan

Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, 2015

Exploiting Spatial Smoothness in HPC Applications to Detect Silent Data Corruption.

[BibT_eX]

[DOI]

Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications, 2015

Addressing the Last Roadblock for Message Logging in HPC: Alleviating the Memory Requirement Using Dedicated Resources.

[BibT_eX]

[DOI]

Thomas Ropars

Proceedings of the Euro-Par 2015: Parallel Processing Workshops, 2015

Distributed Monitoring and Management of Exascale Systems in the Argo Project.

[BibT_eX]

[DOI]

Proceedings of the Distributed Applications and Interoperable Systems, 2015

Fault-Tolerant Protocol for Hybrid Task-Parallel Message-Passing Applications.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications.

[BibT_eX]

[DOI]

Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

2014

Characterizing and modeling cloud applications/jobs on a Google data center.

[BibT_eX]

[DOI]

Derrick Kondo

J. Supercomput., 2014

Adaptive Algorithm for Minimizing Cloud Task Length with Prediction Errors.

[BibT_eX]

[DOI]

Cho-Li Wang

IEEE Trans. Cloud Comput., 2014

Toward Exascale Resilience: 2014 update.

[BibT_eX]

[DOI]

Supercomput. Front. Innov., 2014

Addressing failures in exascale computing.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2014

Unified model for assessing checkpointing protocols at extreme-scale.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2014

Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2014

Analysis of the Tradeoffs Between Energy and Run Time for Multilevel Checkpointing.

[BibT_eX]

[DOI]

Prasanna Balaprakash

Stefan M. Wild

Paul D. Hovland

Proceedings of the High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, 2014

Detecting silent data corruption through data dynamic monitoring for scientific applications.

[BibT_eX]

[DOI]

Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications.

[BibT_eX]

[DOI]

Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

GPGPUs: How to combine high computational power with high reliability.

[BibT_eX]

[DOI]

Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2014

Energy-performance tradeoffs in multilevel checkpoint strategies.

[BibT_eX]

[DOI]

Prasanna Balaprakash

Stefan M. Wild

Paul D. Hovland

Proceedings of the 2014 IEEE International Conference on Cluster Computing, 2014

2013

BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2013

Failure prediction for HPC systems and applications: Current situation and open issues.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2013

Scalable data management for map-reduce-based data-intensive applications: a view for cloud and hybrid infrastructures.

[BibT_eX]

[DOI]

Int. J. Cloud Comput., 2013

SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing.

[BibT_eX]

[DOI]

Thomas Ropars

Amina Guermouche

André Schiper

Proceedings of the International Conference for High Performance Computing, 2013

Optimization of cloud task processing with checkpoint-restart mechanism.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2013

Towards an energy estimator for fault tolerance protocols.

[BibT_eX]

[DOI]

Mohammed el Mehdi Diouri

Olivier Glück

Laurent Lefèvre

Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2013

Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing.

[BibT_eX]

[DOI]

Ana Gainaru

Satoshi Matsuoka

Naoya Maruyama

Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

HPCS 2013 panel: The era of exascale sciences: Challenges, needs and requirements.

[BibT_eX]

[DOI]

Proceedings of the International Conference on High Performance Computing & Simulation, 2013

Characterizing Cloud Applications on a Google Data Center.

[BibT_eX]

[DOI]

Derrick Kondo

Proceedings of the 42nd International Conference on Parallel Processing, 2013

AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing.

[BibT_eX]

[DOI]

Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, 2013

Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2013 Parallel Processing, 2013

ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance Protocols for HPC Applications.

[BibT_eX]

[DOI]

Mohammed el Mehdi Diouri

Olivier Glück

Laurent Lefèvre

Proceedings of the 13th IEEE/ACM International Symposium on Cluster, 2013

Improving floating point compression through binary masks.

[BibT_eX]

[DOI]

Proceedings of the 2013 IEEE International Conference on Big Data (IEEE BigData 2013), 2013

2012

HydEE, vers un protocole de recouvrement arrière hiérarchique pour les machines exascales. De l'exploitation du déterminisme des émissions dans les protocoles de recouvrement arrière.

[BibT_eX]

[DOI]

Amina Guermouche

Thomas Ropars

Tech. Sci. Informatiques, 2012

Fault prediction under the microscope: a closer look into HPC systems.

[BibT_eX]

[DOI]

Proceedings of the SC Conference on High Performance Computing Networking, 2012

HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications.

[BibT_eX]

[DOI]

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems.

[BibT_eX]

[DOI]

Ana Gainaru

William Kramer

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

A hybrid local storage transfer scheme for live migration of I/O intensive workloads.

[BibT_eX]

[DOI]

Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, 2012

Scalable Reed-Solomon-Based Reliable Local Storage for HPC Applications on IaaS Clouds.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2012 Parallel Processing - 18th International Conference, 2012

Energy considerations in checkpointing and fault tolerance protocols.

[BibT_eX]

[DOI]

Mohammed el Mehdi Diouri

Olivier Glück

Laurent Lefèvre

Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, 2012

Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O.

[BibT_eX]

[DOI]

Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012

Hierarchical Clustering Strategies for Fault Tolerance in Large Scale HPC Systems.

[BibT_eX]

[DOI]

Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012

2011

Preventive Migration vs. Preventive Checkpointing for Extreme Scale Supercomputers.

[BibT_eX]

[DOI]

Henri Casanova

Yves Robert

Parallel Process. Lett., 2011

The International Exascale Software Project roadmap.

[BibT_eX]

[DOI]

Bertrand Braunschweig

Int. J. High Perform. Comput. Appl., 2011

QCG-OMPI: MPI applications on grids.

[BibT_eX]

[DOI]

Future Gener. Comput. Syst., 2011

BlobCR: efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots.

[BibT_eX]

[DOI]

Proceedings of the Conference on High Performance Computing Networking, 2011

Modeling and tolerating heterogeneous failures in large parallel systems.

[BibT_eX]

[DOI]

Proceedings of the Conference on High Performance Computing Networking, 2011

FTI: high performance fault tolerance interface for hybrid systems.

[BibT_eX]

[DOI]

Proceedings of the Conference on High Performance Computing Networking, 2011

Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications.

[BibT_eX]

[DOI]

Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

DPDNS Keynote.

[BibT_eX]

[DOI]

Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Comparing archival policies for Blue Waters.

[BibT_eX]

[DOI]

Proceedings of the 18th International Conference on High Performance Computing, 2011

On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2011 Parallel Processing - 17th International Conference, 2011

Optimizing Multi-deployment on Clouds by Means of Self-adaptive Prefetching.

[BibT_eX]

[DOI]

Gabriel Antoniu

Proceedings of the Euro-Par 2011 Parallel Processing - 17th International Conference, 2011

Event Log Mining Tool for Large Scale HPC Systems.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2011 Parallel Processing - 17th International Conference, 2011

2010

Special section: Peer-to-peer grid technologies.

[BibT_eX]

[DOI]

Ching-Hsien Hsu

Hai Jin

Future Gener. Comput. Syst., 2010

Checkpointing vs. Migration for Post-Petascale Supercomputers.

[BibT_eX]

[DOI]

Henri Casanova

Yves Robert

Proceedings of the 39th International Conference on Parallel Processing, 2010

On Communication Determinism in Parallel HPC Applications.

[BibT_eX]

[DOI]

Amina Guermouche

Marc Snir

Proceedings of the 19th International Conference on Computer Communications and Networks, 2010

Low-overhead diskless checkpoint for hybrid computing systems.

[BibT_eX]

[DOI]

Proceedings of the 2010 International Conference on High Performance Computing, 2010

Distributed Diskless Checkpoint for Large Scale Systems.

[BibT_eX]

[DOI]

Naoya Maruyama

Satoshi Matsuoka

Proceedings of the 10th IEEE/ACM International Conference on Cluster, 2010

Planning Large Data Transfers in Institutional Grids.

[BibT_eX]

[DOI]

Proceedings of the 10th IEEE/ACM International Conference on Cluster, 2010

2009

Foreword.

[BibT_eX]

[DOI]

Thomas Hérault

Jack J. Dongarra

Parallel Comput., 2009

Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment.

[BibT_eX]

[DOI]

J. Interconnect. Networks, 2009

BitDew: A data management and distribution service with multi-protocol file transfer and metadata abstraction.

[BibT_eX]

[DOI]

J. Netw. Comput. Appl., 2009

The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2009

Toward Exascale Resilience.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2009

Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2009

Checkpointing vs. Migration for Post-Petascale Machines

[BibT_eX]

[DOI]

Henri Casanova

Yves Robert

CoRR, 2009

An Information Brokering Service Provider (IBSP) for Virtual Clusters.

[BibT_eX]

[DOI]

Proceedings of the On the Move to Meaningful Internet Systems: OTM 2009, 2009

Cost-benefit analysis of Cloud Computing versus desktop grids.

[BibT_eX]

[DOI]

Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

MPI Applications on Grids: A Topology Aware Approach.

[BibT_eX]

[DOI]

Camille Coti

Thomas Hérault

Proceedings of the Euro-Par 2009 Parallel Processing, 2009

09191 Abstracts Collection - Fault Tolerance in High-Performance Computing and Grids.

[BibT_eX]

[DOI]

Proceedings of the Fault Tolerance in High-Performance Computing and Grids, 03.05., 2009

High accuracy failure injection in parallel and distributed systems using virtualization.

[BibT_eX]

[DOI]

Proceedings of the 6th Conference on Computing Frontiers, 2009

BLAST Application with Data-Aware Desktop Grid Middleware.

[BibT_eX]

[DOI]

Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009

2008

Integrating Computing Resources on Multiple Grid-Enabled Job Scheduling Systems Through a Grid RPC System.

[BibT_eX]

[DOI]

J. Grid Comput., 2008

Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols.

[BibT_eX]

[DOI]

Future Gener. Comput. Syst., 2008

BitDew: a programmable environment for large-scale data management and distribution.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE Conference on High Performance Computing, 2008

Fault Tolerance for PetaScale Systems: Current Knowledge, Challenges and Opportunities.

[BibT_eX]

[DOI]

Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2008

Distributing and managing data on desktop grids with BitDew.

[BibT_eX]

[DOI]

Proceedings of the 3rd Workshop on the Use of P2P, 2008

Emulation platform for high accuracy failure injection in grids.

[BibT_eX]

[DOI]

Proceedings of the High Speed and Large Scale Scientific Computing - Selected Papers from the High Performance Computing Workshop, Cetraro, Italy, June 30, 2008

OpenWP: Combining annotation language and workflow environments for porting existing applications on grids.

[BibT_eX]

[DOI]

Matthieu Cargnelli

Guillaume Alléon

Proceedings of the 9th IEEE/ACM International Conference on Grid Computing (Grid 2008), Tsukuba, Japan, September 29, 2008

A File Transfer Service with Client/Server, P2P and Wide Area Storage Protocols.

[BibT_eX]

[DOI]

Proceedings of the Data Management in Grid and Peer-to-Peer Systems, 2008

Grid Services for MPI.

[BibT_eX]

[DOI]

Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), 2008

2007

Scalability Comparison of Four Host Virtualization Tools.

[BibT_eX]

[DOI]

Benjamin Quétier

Vincent Néri

J. Grid Comput., 2007

Towards efficient data distribution on computational desktop grids with BitTorrent.

[BibT_eX]

[DOI]

Baohua Wei

Future Gener. Comput. Syst., 2007

Characterizing resource availability in enterprise desktop grids.

[BibT_eX]

[DOI]

Future Gener. Comput. Syst., 2007

Towards an International Computer Science Grid.

[BibT_eX]

[DOI]

Proceedings of the 16th IEEE International Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises (WETICE 2007), 2007

Virtual Parallel Machines Through Virtualization: Impact on MPI Executions.

[BibT_eX]

[DOI]

Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 14th European PVM/MPI User's Group Meeting, Paris, France, September 30, 2007

Grid Services for MPI.

[BibT_eX]

[DOI]

Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 14th European PVM/MPI User's Group Meeting, Paris, France, September 30, 2007

Characterizing Result Errors in Internet Desktop Grids.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2007, 2007

A Distributed and Replicated Service for Checkpoint Storage.

[BibT_eX]

[DOI]

Proceedings of the Making Grids Work: Proceedings of the CoreGRID Workshop on Programming Models Grid and P2P System Architecture Grid Systems, 2007

Toward an International "Computer Science Grid".

[BibT_eX]

[DOI]

Henri E. Bal

Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2007), 2007

2006

MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2006

Hybrid Preemptive Scheduling of Message Passing Interface Applications on Grids.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2006

Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2006

Editorial: Special Issue on Global and Peer-to-Peer Computing.

[BibT_eX]

[DOI]

Adriana Iamnitchi

Mitsuhisa Sato

J. Grid Comput., 2006

Performance comparison of MPI and OpenMP on shared memory multiprocessors.

[BibT_eX]

[DOI]

Géraud Krawezik

Concurr. Comput. Pract. Exp., 2006

MPI tools and performance studies - Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

Computer Science Grids.

[BibT_eX]

Henri E. Bal

Proceedings of the High Performance Computing and Grids in Action, 2006

Availability Traces of Enterprise Desktop Grids.

[BibT_eX]

[DOI]

Proceedings of the 7th IEEE/ACM International Conference on Grid Computing (GRID 2006), 2006

Private Virtual Cluster: Infrastructure and Protocol for Instant Grids.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2006, Parallel Processing, 12th International Euro-Par Conference, Dresden, Germany, August 28, 2006

On Resource Volatility in Enterprise Desktop Grids.

[BibT_eX]

[DOI]

Proceedings of the Second International Conference on e-Science and Grid Technologies (e-Science 2006), 2006

FAIL-MPI: How Fault-Tolerant Is Fault-Tolerant MPI?

[BibT_eX]

[DOI]

Proceedings of the 2006 IEEE International Conference on Cluster Computing, 2006

Towards Soft Real-Time Applications on Enterprise Desktop Grids.

[BibT_eX]

[DOI]

Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2006), 2006

2005

An algorithmic model for heterogeneous hyper-clusters: rationale and experience.

[BibT_eX]

[DOI]

Int. J. Found. Comput. Sci., 2005

Computing on large-scale distributed systems: XtremWeb architecture, programming models, security, tests and convergence with grid.

[BibT_eX]

[DOI]

Future Gener. Comput. Syst., 2005

Collaborative Data Distribution with BitTorrent for Computational Desktop Grids.

[BibT_eX]

[DOI]

Baohua Wei

Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC 2005), 2005

Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI.

[BibT_eX]

[DOI]

Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

Scheduling independent tasks sharing large data distributed with BitTorrent.

[BibT_eX]

[DOI]

Baohua Wei

Pascale Vicat-Blanc Primet

Proceedings of the 6th IEEE/ACM International Conference on Grid Computing (GRID 2005), 2005

Grid'5000: a large scale and highly reconfigurable grid experimental testbed.

[BibT_eX]

[DOI]

Proceedings of the 6th IEEE/ACM International Conference on Grid Computing (GRID 2005), 2005

2004

Coordinated checkpoint versus message log for fault tolerant MPI.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Netw., 2004

RPC-V: Toward Fault-Tolerant RPC for Internet Connected Desktop Grids with Volatile Nodes.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE SC2004 Conference on High Performance Networking and Computing, 2004

Hybrid Preemptive Scheduling of MPI Applications on the Grids.

[BibT_eX]

[DOI]

Proceedings of the 5th International Workshop on Grid Computing (GRID 2004), 2004

Improved message logging versus improved coordinated checkpointing for fault tolerant MPI.

[BibT_eX]

[DOI]

Proceedings of the 2004 IEEE International Conference on Cluster Computing (CLUSTER 2004), 2004

2003

Augernome & XtremWeb: Monte Carlos computation on a global computing platform

[BibT_eX]

[DOI]

CoRR, 2003

MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE SC2003 Conference on High Performance Networking and Computing, 2003

Topic Introduction.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2003. Parallel Processing, 2003

XtremWeb & Condor sharing resources between Internet connected Condor pools.

[BibT_eX]

[DOI]

Proceedings of the 3rd IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2003), 2003

2002

MPI ou MPI+OpenMP sur grappes de multiprocesseurs?

[BibT_eX]

[DOI]

Tech. Sci. Informatiques, 2002

OVM: Out-of-order execution parallel virtual machine.

[BibT_eX]

[DOI]

George Bosilca

Future Gener. Comput. Syst., 2002

MPICH-V: toward a scalable fault tolerant MPI for volatile nodes.

[BibT_eX]

[DOI]

Proceedings of the 2002 ACM/IEEE conference on Supercomputing, 2002

MPICH-CM: A Communication Library Design for a P2P MPI Implementation.

[BibT_eX]

[DOI]

Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 9th European PVM/MPI Users' Group Meeting, Linz, Austria, September 29, 2002

SPMD OpenMP versus MPI on a IBM SMP for 3 Kernels of the NAS Benchmarks.

[BibT_eX]

[DOI]

Géraud Krawezik

Guillaume Alléon

Proceedings of the High Performance Computing, 4th International Symposium, 2002

2001

Understanding performance of SMP clusters running MPI programs.

[BibT_eX]

[DOI]

Future Gener. Comput. Syst., 2001

Global Computing Systems.

[BibT_eX]

[DOI]

Proceedings of the Large-Scale Scientific Computing, Third International Conference, 2001

HiHCoHP: Toward a Realistic Communication Model for Hierarchical HyperClusters of Heterogeneous Processors.

[BibT_eX]

[DOI]

Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS-01), 2001

XtremWeb: A Generic Global Computing System.

[BibT_eX]

[DOI]

Proceedings of the First IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2001), 2001

2000

MPI versus MPI+OpenMP on IBM SP for the NAS Benchmarks.

[BibT_eX]

[DOI]

Proceedings of the Proceedings Supercomputing 2000, 2000

Investigating the Performance of Two Programming Models for Clusters of SMP PCs.

[BibT_eX]

[DOI]

Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, 2000

XtremWeb: Building an Experimental Platform for Global Computing.

[BibT_eX]

[DOI]

Proceedings of the Grid Computing, 2000

1999

Performance Evaluation of Two Programming Models for a Cluster of PC Biprocessors.

[BibT_eX]

Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, 1999

Performance of the NAS Benchmarks on a Cluster of SMP PCs Using a Parallelization of the MPI Programs with OpenMP.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing Technologies, 1999

A Client/Broker/Server Substrate with µs Round-Trip Overhead.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par '99 Parallel Processing, 5th International Euro-Par Conference, Toulouse, France, August 31, 1999

Performance Characteristics of a Network of Commodity Multiprocessors for the NAS Benchmarks Using a Hybrid Memory Model.

[BibT_eX]

[DOI]

Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, 1999

1998

On the Self-Similar Nature of Workstations and WWW Servers Workload.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par '98 Parallel Processing, 1998

1997

Communications in Parallel Architectures and Networks of Workstations: From Standardisation to New Standards.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing Technologies, 1997

1995

The Static Network: A High Performance Reconfigurable Communication Network.

[BibT_eX]

[DOI]

Cécile Germain

Parallel Process. Lett., 1995

Toward High Communication Performance through Compiled Communications on a Circuit Switched Interconnection Network.

[BibT_eX]

[DOI]

Cécile Germain

Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture (HPCA 1995), 1995

1993

Hardware features of the static communication network of a parallel architecture.

[BibT_eX]

[DOI]

Microprocess. Microprogramming, 1993

Static computation of standard linear algebra subroutines for PTAH.

[BibT_eX]

[DOI]

E. Daugeras

Proceedings of the 1993 Euromicro Workshop on Parallel and Distributed Processing, 1993

A Parralel Architecture Based on Compiled Communication Schemes.

[BibT_eX]

Franck Delaplace

Damien Gautier de Lahaut

Proceedings of the Parallel Computing: Trends and Applications, 1993

Balanced Distributed Memory Parallel Computers.

[BibT_eX]

[DOI]

Proceedings of the 1993 International Conference on Parallel Processing, 1993

1992

Data layouts impacts on the compilation of the communications for a synchronous MSIMD machine.

[BibT_eX]

[DOI]

Franck Delaplace

Microprocess. Microprogramming, 1992

Design of the processing node of the PTAH 64 parallel computer.

[BibT_eX]

[DOI]

J.-L. Glavitto

Microprocess. Microprogramming, 1992

PTAH: Introduction to a New Parallel Architecture for Highly Numeric Processing.

[BibT_eX]

[DOI]

Jean-Louis Giavitto

Proceedings of the PARLE '92: Parallel Architectures and Languages Europe, 1992

1991

3D hardware packages for parallel architectures.

[BibT_eX]

[DOI]

Microprocessing and Microprogramming, 1991

1990

A risc central processing unit for a massivelly parallel architecture.

[BibT_eX]

[DOI]