Gennady Pekhimenko

Orcid: 0000-0002-3839-0919

Affiliations:
  • University of Toronto
  • Microsoft Research
  • Carnegie Mellon University (former)


According to our database1, Gennady Pekhimenko authored at least 100 papers between 2010 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads.
CoRR, 2024

APPL: A Prompt Programming Language for Harmonious Integration of Programs and Large Language Model Prompts.
CoRR, 2024

Accelerating Graph Neural Networks on Real Processing-In-Memory Systems.
CoRR, 2024

Proteus: Preserving Model Confidentiality during Graph Optimizations.
Proceedings of the Seventh Annual Conference on Machine Learning and Systems, 2024

Sylva: Sparse Embedded Adapters via Hierarchical Approximate Second-Order Information.
Proceedings of the 38th ACM International Conference on Supercomputing, 2024

Guaranteed Approximation Bounds for Mixed-Precision Neural Operators.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Minuet: Accelerating 3D Sparse Convolutions on GPUs.
Proceedings of the Nineteenth European Conference on Computer Systems, 2024

BOOM: Use your Desktop to Accurately Predict the Performance of Large Deep Neural Networks.
Proceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques, 2024

2023
Federated benchmarking of medical artificial intelligence with MedPerf.
Nat. Mac. Intell., July, 2023

Lightweight Frequency-Based Tiering for CXL Memory Systems.
CoRR, 2023

The Synergy of Speculative Decoding and Batching in Serving Large Language Models.
CoRR, 2023

Speeding up Fourier Neural Operators via Mixed Precision.
CoRR, 2023

Arbitor: A Numerically Accurate Hardware Emulation Tool for DNN Accelerators.
Proceedings of the 2023 USENIX Annual Technical Conference, 2023

Hotline Profiler: Automatic Annotation and A Multi-Scale Timeline for Visualizing Time-Use in DNN Training.
Proceedings of the Sixth Conference on Machine Learning and Systems, 2023

Grape: Practical and Efficient Graphed Execution for Dynamic Deep Neural Networks on GPUs.
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023

TiLT: A Time-Centric Approach for Stream Query Optimization and Parallelization.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

TorchProbe: Fuzzing Dynamic Deep Learning Compilers.
Proceedings of the Programming Languages and Systems - 21st Asian Symposium, 2023

2022
Optimizing Data Collection in Deep Reinforcement Learning.
CoRR, 2022

ROLLER: Fast and Efficient Tensor Compilation for Deep Learning.
Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, 2022

Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

DietCode: Automatic Optimization for Dynamic Tensor Programs.
Proceedings of the Fifth Conference on Machine Learning and Systems, 2022

Keynote Talk 1: Efficient DNN Training at Scale: from Algorithms to Hardware.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

How to validate Machine Learning Models Prior to Deployment: Silent trial protocol for evaluation of real-time models at ICU.
Proceedings of the Conference on Health, Inference, and Learning, 2022

Automatic Horizontal Fusion for GPU Kernels.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2022

GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2022

Pavise: Integrating Fault Tolerance Support for Persistent Memory Applications.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2022

2021
Gretch: A Hardware Prefetcher for Graph Analytics.
ACM Trans. Archit. Code Optim., 2021

MedPerf: Open Benchmarking Platform for Medical Artificial Intelligence using Federated Evaluation.
CoRR, 2021

Computational Performance Predictions for Deep Neural Network Training: A Runtime-Based Approach.
CoRR, 2021

Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training.
Proceedings of the 2021 USENIX Annual Technical Conference, 2021

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Distributed Deep Learning In Open Collaborations.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Boveda: Building an On-Chip Deep Learning Memory Hierarchy Brick by Brick.
Proceedings of the Fourth Conference on Machine Learning and Systems, 2021

IOS: Inter-Operator Scheduler for CNN Acceleration.
Proceedings of the Fourth Conference on Machine Learning and Systems, 2021

Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models.
Proceedings of the Fourth Conference on Machine Learning and Systems, 2021

RL-Scope: Cross-stack Profiling for Deep Reinforcement Learning Workloads.
Proceedings of the Fourth Conference on Machine Learning and Systems, 2021

FPRaker: A Processing Element For Accelerating Neural Network Training.
Proceedings of the MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021

NVOverlay: Enabling Efficient and Scalable High-Frequency Snapshotting to NVM.
Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture, 2021

LifeStream: a high-performance stream processing engine for periodic streams.
Proceedings of the ASPLOS '21: 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021

2020
LifeStream: A High-performance Stream Processing Engine for Waveform Data.
CoRR, 2020

TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training and Inference.
CoRR, 2020

Multi-node Bert-pretraining: Cost-efficient Approach.
CoRR, 2020

Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training.
Proceedings of the 2020 USENIX Annual Technical Conference, 2020

Skyline: Interactive In-Editor Computational Performance Profiling for Deep Neural Network Training.
Proceedings of the UIST '20: The 33rd Annual ACM Symposium on User Interface Software and Technology, 2020


BPPSA: Scaling Back-propagation by Parallel Scan Algorithm.
Proceedings of the Third Conference on Machine Learning and Systems, 2020

TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training.
Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture, 2020

Echo: Compiler-based GPU Memory Footprint Reduction for LSTM RNN Training.
Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture, 2020


2019
MLPerf Training Benchmark.
CoRR, 2019

Scaling Back-propagation by Parallel Scan Algorithm.
CoRR, 2019

SysML: The New Frontier of Machine Learning Systems.
CoRR, 2019

Priority-based Parameter Propagation for Distributed DNN Training.
Proceedings of the Second Conference on Machine Learning and Systems, SysML 2019, 2019

Janus: optimizing memory and storage support for non-volatile memory systems.
Proceedings of the 46th International Symposium on Computer Architecture, 2019

StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory.
Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019

Towards Breaking the Memory Bandwidth Wall Using Approximate Value Prediction.
Proceedings of the Approximate Circuits, Methodologies and CAD., 2019

2018
EcoRNN: Fused LSTM RNN Implementation with Data Layout Optimization.
CoRR, 2018

Exploiting Row-Level Temporal Locality in DRAM to Reduce the Memory Access Latency.
CoRR, 2018

RowClone: Accelerating Data Movement and Initialization Using DRAM.
CoRR, 2018

SoftMC: Practical DRAM Characterization Using an FPGA-Based Infrastructure.
CoRR, 2018

Flexible-Latency DRAM: Understanding and Exploiting Latency Variation in Modern DRAM Chips.
CoRR, 2018

Adaptive-Latency DRAM: Reducing DRAM Latency by Exploiting Timing Margins.
CoRR, 2018

Decoupling GPU Programming Models from Resource Management for Enhanced Programming Ease, Portability, and Performance.
CoRR, 2018

TBD: Benchmarking and Analyzing Deep Neural Network Training.
CoRR, 2018

Zorua: Enhancing Programming Ease, Portability, and Performance in GPUs by Decoupling Programming Models from Resource Management.
CoRR, 2018

TerseCades: Efficient Data Compression in Stream Processing.
Proceedings of the 2018 USENIX Annual Technical Conference, 2018

A Case for Richer Cross-Layer Abstractions: Bridging the Semantic Gap with Expressive Memory.
Proceedings of the 45th ACM/IEEE Annual International Symposium on Computer Architecture, 2018

Gist: Efficient Data Encoding for Deep Neural Network Training.
Proceedings of the 45th ACM/IEEE Annual International Symposium on Computer Architecture, 2018

Benchmarking and Analyzing Deep Neural Network Training.
Proceedings of the 2018 IEEE International Symposium on Workload Characterization, 2018

Compiler-driven performance workshop.
Proceedings of the 28th Annual International Conference on Computer Science and Software Engineering, 2018

2017
Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms.
Proc. ACM Meas. Anal. Comput. Syst., 2017

StreamBox: Modern Stream Processing on a Multicore Machine.
Proceedings of the 2017 USENIX Annual Technical Conference, 2017

SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies.
Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture, 2017

2016
RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads.
ACM Trans. Archit. Code Optim., 2016

Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost.
ACM Trans. Archit. Code Optim., 2016

Mitigating the Memory Bottleneck With Approximate Load Value Prediction.
IEEE Des. Test, 2016

A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps.
CoRR, 2016

Practical Data Compression for Modern Memory Hierarchies.
CoRR, 2016

Reducing DRAM Latency by Exploiting Design-Induced Latency Variation in Modern DRAM Chips.
CoRR, 2016

Adaptive-Latency DRAM (AL-DRAM).
CoRR, 2016

Optimal seed solver: optimizing seed selection in read mapping.
Bioinform., 2016

Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization.
Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, 2016

Zorua: A holistic approach to resource virtualization in GPUs.
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016

A case for toggle-aware compression for GPU systems.
Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture, 2016

ChargeCache: Reducing DRAM latency by exploiting row access locality.
Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture, 2016

2015
Simultaneous Multi Layer Access: A High Bandwidth and Low Cost 3D-Stacked Memory Interface.
CoRR, 2015

Toggle-Aware Compression for GPUs.
IEEE Comput. Archit. Lett., 2015

Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping.
Bioinform., 2015

PocketTrend: Timely Identification and Delivery of Trending Search Content to Mobile Users.
Proceedings of the 24th International Conference on World Wide Web, 2015

A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps.
Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015

Page overlays: an enhanced virtual memory framework to enable fine-grained memory management.
Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015

Exploiting compressed block size as an indicator of future reuse.
Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture, 2015

Adaptive-latency DRAM: Optimizing DRAM timing for the common-case.
Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture, 2015

2014
Rollback-free value prediction with approximate loads.
Proceedings of the International Conference on Parallel Architectures and Compilation, 2014

2013
RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization.
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013

Linearly compressed pages: a low-complexity, low-latency main memory compression framework.
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013

2012
Base-delta-immediate compression: practical data compression for on-chip caches.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

Linearly compressed pages: a main memory compression framework with low complexity and low latency.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

2010
Efficient Program Compilation Through Machine Learning Techniques.
Proceedings of the Software Automatic Tuning, From Concepts to State-of-the-Art Results, 2010


  Loading...