Devesh Tiwari

Proceedings of the ASPLOS '22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022, 2022

QUILT: Effective Multi-Class Classification on Quantum Computers Using an Ensemble of Diverse Quantum Classifiers.

[BibT_eX]

[DOI]

Daniel Silver

Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

2021

Study of interconnect errors, network congestion, and applications characteristics for throttle prediction on a large scale HPC system.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2021

Robust and Resource-Efficient Quantum Circuit Approximation.

[BibT_eX]

[DOI]

CoRR, 2021

RIBBON: cost-effective and qos-aware deep learning model inference using a diverse pool of cloud computing instances.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2021

Systematically inferring I/O performance variability by examining repetitive job behavior.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2021

Bliss: auto-tuning complex applications using a pool of diverse lightweight learning models.

[BibT_eX]

[DOI]

Proceedings of the PLDI '21: 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021

SATORI: Efficient and Fair Resource Partitioning by Sacrificing Short-Term Benefits for Long-Term Gains<sup>*</sup>.

[BibT_eX]

[DOI]

Rohan Basu Roy

Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture, 2021

Characterizing and Mitigating the I/O Scalability Challenges for Serverless Applications.

[BibT_eX]

[DOI]

Rohan Basu Roy

Proceedings of the IEEE International Symposium on Workload Characterization, 2021

The MIT Supercloud Dataset.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE High Performance Extreme Computing Conference, 2021

Serving Machine Learning Inference Using Heterogeneous Hardware.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE High Performance Extreme Computing Conference, 2021

Operating Liquid-Cooled Large-Scale Systems: Long-Term Monitoring, Reliability Analysis, and Efficiency Measures.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2021

Examining Failures and Repairs on Supercomputers with Multi-GPU Compute Nodes.

[BibT_eX]

[DOI]

Proceedings of the 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2021

Qraft: reverse your Quantum circuit and know the correct program output.

[BibT_eX]

[DOI]

Proceedings of the ASPLOS '21: 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021

2020

Resilience and coevolution of preferential interdependent networks.

[BibT_eX]

[DOI]

Soc. Netw. Anal. Min., 2020

Comparing Performances of Five Distinct Automatic Classifiers for Fin Whale Vocalizations in Beamformed Spectrograms of Coherent Hydrophone Array.

[BibT_eX]

[DOI]

Remote. Sens., 2020

UREQA: Leveraging Operation-Aware Error Rates for Effective Quantum Circuit Mapping on NISQ-Era Quantum Computers.

[BibT_eX]

[DOI]

Proceedings of the 2020 USENIX Annual Technical Conference, 2020

Exploring the Potential of using Power as a First Class Parameter for Resource Allocation in Apache Mesos Managed Clouds.

[BibT_eX]

[DOI]

Pradyumna Kaushik

Srinidhi Raghavendra

Proceedings of the 13th IEEE/ACM International Conference on Utility and Cloud Computing, 2020

Veritas: accurately estimating the correct output on noisy intermediate-scale quantum computers.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2020

Experimental evaluation of NISQ quantum computers: error measurement, characterization, and implications.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2020

Job characteristics on large-scale systems: long-term analysis, quantification, and implications.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2020

What does Power Consumption Behavior of HPC Jobs Reveal? : Demystifying, Quantifying, and Predicting Power Consumption Characteristics.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

Message from the Program Chairs : IISWC 2020.

[BibT_eX]

[DOI]

David R. Kaeli

Proceedings of the IEEE International Symposium on Workload Characterization, 2020

DisQ: A Novel Quantum Output State Classification Method on IBM Quantum Computers using OpenPulse.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM International Conference On Computer Aided Design, 2020

CLITE: Efficient and QoS-Aware Co-Location of Multiple Latency-Critical Jobs for Warehouse Scale Computers.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2020

Uncovering Access, Reuse, and Sharing Characteristics of I/O-Intensive Files on Large-Scale Production HPC Systems.

[BibT_eX]

[DOI]

Proceedings of the 18th USENIX Conference on File and Storage Technologies, 2020

GIFT: A Coupon Based Throttle-and-Reward Mechanism for Fair and Efficient I/O Bandwidth Management on Parallel Storage Systems.

[BibT_eX]

[DOI]

Rohan Garg

Proceedings of the 18th USENIX Conference on File and Storage Technologies, 2020

Making Disk Failure Predictions SMARTer!

[BibT_eX]

[DOI]

Proceedings of the 18th USENIX Conference on File and Storage Technologies, 2020

2019

An Analysis Workflow-Aware Storage System for Multi-Core Active Flash Arrays.

[BibT_eX]

[DOI]

Hyogi Sim

Geoffroy Vallée

Youngjae Kim

Ali Raza Butt

IEEE Trans. Parallel Distributed Syst., 2019

Two stage cluster for resource optimization with Apache Mesos.

[BibT_eX]

[DOI]

Gourav Rattihalli

Pankaj Saha

CoRR, 2019

Revisiting I/O behavior in large-scale storage systems: the expected and the unexpected.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2019

Characterizing Disk Health Degradation and Proactively Protecting Against Disk Failures for Reliable Storage Systems.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE International Conference on Autonomic Computing, 2019

PERQ: Fair and Efficient Power Management of Power-Constrained Large-Scale Computing Systems.

[BibT_eX]

[DOI]

Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, 2019

PCFI: Program Counter Guided Fault Injection for Accelerating GPU Reliability Assessment.

[BibT_eX]

[DOI]

Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2019

What does Vibration do to Your SSD?

[BibT_eX]

[DOI]

Proceedings of the 56th Annual Design Automation Conference 2019, 2019

Towards Enabling Dynamic Resource Estimation and Correction for Improving Utilization in an Apache Mesos Cloud Environment.

[BibT_eX]

[DOI]

Gourav Rattihalli

Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019

Exploring Potential for Non-Disruptive Vertical Auto Scaling and Resource Estimation in Kubernetes.

[BibT_eX]

[DOI]

Gourav Rattihalli

Hui Lu

Proceedings of the 12th IEEE International Conference on Cloud Computing, 2019

2018

Exploring the Optimal Platform Configuration for Power-Constrained HPC Workflows.

[BibT_eX]

[DOI]

Kun Tang

Xubin He

Saurabh Gupta

Proceedings of the 27th International Conference on Computer Communication and Networks, 2018

Machine Learning Models for GPU Error Prediction in a Large Scale HPC System.

[BibT_eX]

[DOI]

Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2018

Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System.

[BibT_eX]

[DOI]

Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2018

Shiraz: Exploiting System Reliability and Application Resilience Characteristics to Improve Large Scale System Throughput.

[BibT_eX]

[DOI]

Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2018

Reliability Characterization of Solid State Drives in a Scalable Production Datacenter.

[BibT_eX]

[DOI]

Bradley W. Settlemyer

David Richard Montoya

Proceedings of the IEEE International Conference on Big Data (IEEE BigData 2018), 2018

Resilience and the Coevolution of Interdependent Multiplex Networks.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM 2018 International Conference on Advances in Social Networks Analysis and Mining, 2018

2017

Obtaining and Managing Answer Quality for Online Data-Intensive Services.

[BibT_eX]

[DOI]

ACM Trans. Model. Perform. Evaluation Comput. Syst., 2017

Compiler-Directed Soft Error Detection and Recovery to Avoid DUE and SDC via Tail-DMR.

[BibT_eX]

[DOI]

ACM Trans. Embed. Comput. Syst., 2017

GUIDE: a scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2017

Failures in large scale systems: long-term measurement, analysis, and implications.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2017

Combining architectural fault-injection and neutron beam testing approaches toward better understanding of GPU soft-error resilience.

[BibT_eX]

[DOI]

Proceedings of the IEEE 60th International Midwest Symposium on Circuits and Systems, 2017

Toward Managing HPC Burst Buffers Effectively: Draining Strategy to Regulate Bursty I/O Behavior.

[BibT_eX]

[DOI]

Proceedings of the 25th IEEE International Symposium on Modeling, 2017

Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities.

[BibT_eX]

[DOI]

Proceedings of the 25th IEEE International Symposium on Modeling, 2017

Effective Running of End-to-End HPC Workflows on Emerging Heterogeneous Architectures.

[BibT_eX]

[DOI]

Kun Tang

Saurabh Gupta

Leonardo Arturo Bautista-Gomez

Xubin He

Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

2016

Application configuration selection for energy-efficient execution on multicore systems.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2016

Compiler-directed lightweight checkpointing for fine-grained guaranteed soft error recovery.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2016

Granularity and the cost of error recovery in resilient AMR scientific applications.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2016

Low-cost soft error resilience with unified data verification and fine-grained recovery for acoustic sensor based detection.

[BibT_eX]

[DOI]

Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016

Reducing Waste in Extreme Scale Systems through Introspective Analysis.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Adaptive Power Profiling for Many-Core HPC Architectures.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Conference on Autonomic Computing, 2016

A large-scale study of soft-errors on GPUs in the field.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture, 2016

Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy.

[BibT_eX]

[DOI]

Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2016

2015

A practical approach to reconciling availability, performance, and capacity in provisioning extreme-scale storage systems.

[BibT_eX]

[DOI]

Qing Cao

Proceedings of the International Conference for High Performance Computing, 2015

Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2015

AnalyzeThis: an analysis workflow-aware storage system.

[BibT_eX]

[DOI]

Hyogi Sim

Youngjae Kim

Proceedings of the International Conference for High Performance Computing, 2015

Clover: Compiler Directed Lightweight Soft Error Resilience.

[BibT_eX]

[DOI]

Proceedings of the 16th ACM SIGPLAN/SIGBED Conference on Languages, 2015

Measuring and Managing Answer Quality for Online Data-Intensive Services.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Conference on Autonomic Computing, 2015

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation.

[BibT_eX]

[DOI]

Philippe Olivier Alexandre Navaux

Daniel Oliveira

Dave Londo

Nathan DeBardeleben

Luigi Carro

Arthur S. Bland

Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture, 2015

Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems.

[BibT_eX]

[DOI]

Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2015

Low Power Job Scheduler for Supercomputers: A Rule-Based Power-Aware Scheduler.

[BibT_eX]

[DOI]

Ruijun Wang

Jun Wang

Proceedings of the IEEE International Conference on Data Science and Data Intensive Systems, 2015

2014

Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2014

MapReuse: Reusing Computation in an In-Memory MapReduce System.

[BibT_eX]

[DOI]

Yan Solihin

Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Improving large-scale storage system performance via topology-aware and balanced data placement.

[BibT_eX]

[DOI]

Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, 2014

Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems.

[BibT_eX]

[DOI]

Saurabh Gupta

Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2014

2013

Active flash: towards energy-efficient, in-situ data analytics on extreme-scale machines.

[BibT_eX]

[DOI]

Simona Boboila

Proceedings of the 11th USENIX conference on File and Storage Technologies, 2013

2012

Reducing Data Movement Costs Using Energy-Efficient, Active Computation on SSD.

[BibT_eX]

[DOI]

Proceedings of the 2012 Workshop on Power-Aware Computing Systems, HotPower'12, 2012

Architectural characterization and similarity analysis of sunspider and Google's V8 Javascript benchmarks.

[BibT_eX]

[DOI]

Yan Solihin

Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2012

Modeling and Analyzing Key Performance Factors of Shared Memory MapReduce.

[BibT_eX]

[DOI]