Prolego: Time-Series Analysis for Predicting Failures in Complex Systems.
Proceedings of the IEEE International Conference on Autonomic Computing and Self-Organizing Systems, 2023
Performance Variability and Causality in Complex Systems.
Proceedings of the IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion, 2022
Systemic Assessment of Node Failures in HPC Production Platforms.
Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021
Aarohi: Making Real-Time Node Failure Prediction Feasible.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020
KeyValueServe†: Design and performance analysis of a multi-tenant data grid as a cloud service.
Concurr. Comput. Pract. Exp., 2018
Doomsday: predicting which node will fail when on supercomputers.
Proceedings of the International Conference for High Performance Computing, 2018
Desh: deep learning for system health prediction of lead times to failure in HPC.
Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, 2018
Performance Analysis of a Multi-tenant In-Memory Data Grid.
Proceedings of the 9th IEEE International Conference on Cloud Computing, 2016
Dynamic resource management using virtual machine migrations.
IEEE Commun. Mag., 2012