Jim M. Brandt

Orcid: 0000-0002-8605-5795

Affiliations:
  • Sandia National Laboratories


According to our database1, Jim M. Brandt authored at least 54 papers between 2005 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Runtime Performance Anomaly Diagnosis in Production HPC Systems Using Active Learning.
IEEE Trans. Parallel Distributed Syst., April, 2024

Toward Sustainable HPC: In-Production Deployment of Incentive-Based Power Efficiency Mechanism on the Fugaku Supercomputer.
Proceedings of the International Conference for High Performance Computing, 2024

Job Scheduling for HPC Clusters: Constraint Programming vs. Backfilling Approaches.
Proceedings of the 18th ACM International Conference on Distributed and Event-based Systems, 2024

Evolving Large Scale HPC Monitoring & Analysis to Track Modern Dynamic Environments.
Proceedings of the IEEE International Conference on Cluster Computing, 2024

2023
Driving HPC Operations With Holistic Monitoring and Operational Data Analytics (Dagstuhl Seminar 23171).
Dagstuhl Reports, 2023

Prodigy: Towards Unsupervised Anomaly Detection in Production HPC Systems.
Proceedings of the International Conference for High Performance Computing, 2023

Evaluating HPC Job Run Time Predictions Using Application Input Parameters.
Proceedings of the 17th ACM International Conference on Distributed and Event-based Systems, 2023


2022
Metrics for Packing Efficiency and Fairness of HPC Cluster Batch Job Scheduling.
Proceedings of the 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2022

ALBADross: Active Learning Based Anomaly Diagnosis for Production HPC Systems.
Proceedings of the IEEE International Conference on Cluster Computing, 2022

2021
Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems.
Proceedings of the High Performance Computing - 36th International Conference, 2021

Systematically inferring I/O performance variability by examining repetitive job behavior.
Proceedings of the International Conference for High Performance Computing, 2021

Delay sensitivity-driven congestion mitigation for HPC systems.
Proceedings of the ICS '21: 2021 International Conference on Supercomputing, 2021

Using Monitoring Data to Improve HPC Performance via Network-Data-Driven Allocation.
Proceedings of the 2021 IEEE High Performance Extreme Computing Conference, 2021

E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems.
Proceedings of the Euro-Par 2021: Parallel Processing, 2021

Backfilling HPC Jobs with a Multimodal-Aware Predictor.
Proceedings of the IEEE International Conference on Cluster Computing, 2021

2020
Application-aware Congestion Mitigation forHigh-Performance Computing Systems.
CoRR, 2020

ALAMO: Autonomous Lightweight Allocation, Management, and Optimization.
Proceedings of the Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI, 2020

Measuring Congestion in High-Performance Datacenter Interconnects.
Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation, 2020

HPC System Data Pipeline to Enable Meaningful Insights through Analysis-Driven Visualizations.
Proceedings of the IEEE International Conference on Cluster Computing, 2020

Towards workload-adaptive scheduling for HPC clusters.
Proceedings of the IEEE International Conference on Cluster Computing, 2020

2019
Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning.
IEEE Trans. Parallel Distributed Syst., 2019

Production Application Performance Data Streaming for System Monitoring.
ACM Trans. Model. Perform. Evaluation Comput. Syst., 2019

Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo.
CoRR, 2019

HPAS: An HPC Performance Anomaly Suite for Reproducing Performance Variations.
Proceedings of the 48th International Conference on Parallel Processing, 2019

A Study of Network Congestion in Two Supercomputing High-Speed Interconnects.
Proceedings of the 2019 IEEE Symposium on High-Performance Interconnects, 2019

2018
An Efficient Latch-free Database Index Based on Multi-dimensional Lists.
Proceedings of the 37th IEEE International Performance Computing and Communications Conference, 2018

Integrating Low-latency Analysis into HPC System Monitoring.
Proceedings of the 47th International Conference on Parallel Processing, 2018

Taxonomist: Application Detection Through Rich Monitoring Data.
Proceedings of the Euro-Par 2018: Parallel Processing, 2018

Characterizing Supercomputer Traffic Networks Through Link-Level Analysis.
Proceedings of the IEEE International Conference on Cluster Computing, 2018


2017
Diagnosing Performance Variations in HPC Applications Using Machine Learning.
Proceedings of the High Performance Computing - 32nd International Conference, 2017

Holistic Measurement-Driven System Assessment.
Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

2016
Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems.
Parallel Comput., 2016

Design and Implementation of a Scalable HPC Monitoring System.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

Large-Scale Persistent Numerical Data Source Monitoring System Experiences.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

HPCMASPA Introduction and Committees.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

2015
Infrastructure for In Situ System Monitoring and Application Data Analysis.
Proceedings of the First Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, 2015

Extending LDMS to Enable Performance Monitoring in Multi-core Applications.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

New Systems, New Behaviors, New Patterns: Monitoring Insights from System Standup.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

Toward Rapid Understanding of Production HPC Applications and Systems.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

2014
The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications.
Proceedings of the International Conference for High Performance Computing, 2014

Demonstrating improved application performance using dynamic monitoring and task mapping.
Proceedings of the 2014 IEEE International Conference on Cluster Computing, 2014

2012
Filtering log data: Finding the needles in the Haystack.
Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks, 2012

2011
Baler: deterministic, lossless log message clustering tool.
Comput. Sci. Res. Dev., 2011

Framework for Enabling System Understanding.
Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

2010
Combining Virtualization, resource characterization, and Resource management to enable efficient high performance compute platforms through intelligent dynamic resource allocation.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Quantifying effectiveness of failure prediction and response in HPC systems: Methodology and example.
Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W 2010), Chicago, Illinois, USA, June 28, 2010

Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems.
Proceedings of the 10th IEEE/ACM International Conference on Cluster, 2010

2009
Resource monitoring and management with OVIS to enable HPC in cloud computing environments.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

2008
Ovis-2: A robust distributed architecture for scalable RAS.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems.
Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), 2008

2006
OVIS: a tool for intelligent, real-time monitoring of computational clusters.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

2005
Meaningful Automated Statistical Analysis of Large Computational Clusters.
Proceedings of the 2005 IEEE International Conference on Cluster Computing (CLUSTER 2005), September 26, 2005


  Loading...