Gabriel H. Loh

Orcid: 0000-0002-4616-0144

According to our database1, Gabriel H. Loh authored at least 126 papers between 1999 and 2024.

Collaborative distances:

Awards

ACM Fellow

ACM Fellow 2017, "For contributions to die-stacking technologies in computer architecture".

IEEE Fellow

IEEE Fellow 2017, "For contributions to high-performance die-stacked computer architectures".

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
AMD Instinct™ MI300X Accelerator: Packaging and Architecture Co-Optimization.
Proceedings of the IEEE Symposium on VLSI Technology and Circuits 2024, 2024

Realizing the AMD Exascale Heterogeneous Processor Vision : Industry Product.
Proceedings of the 51st ACM/IEEE Annual International Symposium on Computer Architecture, 2024

2023
AMD Instinct<sup>TM</sup> MI250X Accelerator enabled by Elevated Fanout Bridge Advanced Packaging Architecture.
Proceedings of the 2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), 2023


The Next Era for Chiplet Innovation.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2023

2021
Accelerating Variational Quantum Algorithms Using Circuit Concurrency.
CoRR, 2021

A New Era of Tailored Computing.
Proceedings of the 2021 Symposium on VLSI Circuits, Kyoto, Japan, June 13-19, 2021, 2021

Increasing GPU Translation Reach by Leveraging Under-Utilized On-Chip Resources.
Proceedings of the MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021

Pioneering Chiplet Technology and Design for the AMD EPYC™ and Ryzen™ Processor Families : Industrial Product.
Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture, 2021

Analyzing and Leveraging Decoupled L1 Caches in GPUs.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2021

Understanding Chiplets Today to Anticipate Future Integration Opportunities and Limits.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2021

2020
Experiences with ML-Driven Design: A NoC Case Study.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2020

Analyzing and Leveraging Shared L1 Caches in GPUs.
Proceedings of the PACT '20: International Conference on Parallel Architectures and Compilation Techniques, 2020

2019
Efficient System Architecture in the Era of Monolithic 3D: Dynamic Inter-tier Interconnect and Processing-in-Memory.
Proceedings of the 56th Annual Design Automation Conference 2019, 2019

2018
CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems.
ACM Trans. Archit. Code Optim., 2018

High-Performance and Energy-Effcient Memory Scheduler Design for Heterogeneous Systems.
CoRR, 2018

Holistic Management of the GPGPU Memory Hierarchy to Manage Warp-level Latency Tolerance.
CoRR, 2018

Challenges of High-Capacity DRAM Stacks and Potential Directions.
Proceedings of the Workshop on Memory Centric High Performance Computing, 2018

Modular Routing Design for Chiplet-Based Systems.
Proceedings of the 45th ACM/IEEE Annual International Symposium on Computer Architecture, 2018

Generic System Calls for GPUs.
Proceedings of the 45th ACM/IEEE Annual International Symposium on Computer Architecture, 2018

Scheduling Page Table Walks for Irregular GPU Applications.
Proceedings of the 45th ACM/IEEE Annual International Symposium on Computer Architecture, 2018

Machine learning for performance and power modeling of heterogeneous systems.
Proceedings of the International Conference on Computer-Aided Design, 2018

2017
CODA: Enabling Co-location of Computation and Data for Near-Data Processing.
CoRR, 2017

GPU System Calls.
CoRR, 2017

Leveraging near data processing for high-performance checkpoint/restart.
Proceedings of the International Conference for High Performance Computing, 2017

There and Back Again: Optimizing the Interconnect in Networks of Memory Cubes.
Proceedings of the 44th Annual International Symposium on Computer Architecture, 2017

Cost-effective design of scalable high-performance systems using active and passive interposers.
Proceedings of the 2017 IEEE/ACM International Conference on Computer-Aided Design, 2017


MemPod: A Clustered Architecture for Efficient and Scalable Migration in Flat Address Space Multi-level Memories.
Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture, 2017

Avoiding TLB Shootdowns Through Self-Invalidating TLB Entries.
Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques, 2017

2016
A case for hierarchical rings with deflection routing: An energy-efficient on-chip communication substrate.
Parallel Comput., 2016

Exploiting Interposer Technologies to Disintegrate and Reintegrate Multicore Processors.
IEEE Micro, 2016

Enabling Efficient Dynamic Resizing of Large DRAM Caches via A Hardware Consistent Hashing Mechanism.
CoRR, 2016

Achieving both High Energy Efficiency and High Performance in On-Chip Communication using Hierarchical Rings with Deflection Routing.
CoRR, 2016

Building a Low Latency, Highly Associative DRAM Cache with the Buffered Way Predictor.
Proceedings of the 28th International Symposium on Computer Architecture and High Performance Computing, 2016

OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures.
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016

Observations and opportunities in architecting shared virtual memory for heterogeneous systems.
Proceedings of the 2016 IEEE International Symposium on Performance Analysis of Systems and Software, 2016

Efficient synthetic traffic models for large, complex SoCs.
Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture, 2016

μC-States: Fine-grained GPU Datapath Power Management.
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, 2016

2015
Design and Analysis of 3D-MAPS (3D Massively Parallel Processor with Stacked Memory).
IEEE Trans. Computers, 2015

Achieving Exascale Capabilities through Heterogeneous Computing.
IEEE Micro, 2015

Large pages and lightweight memory management in virtualized environments: can you have it both ways?
Proceedings of the 48th International Symposium on Microarchitecture, 2015

Enabling interposer-based disintegration of multi-core processors.
Proceedings of the 48th International Symposium on Microarchitecture, 2015

HpMC: An Energy-aware Management System of Multi-level Memory Architectures.
Proceedings of the 2015 International Symposium on Memory Systems, 2015

Interconnect-Memory Challenges for Multi-chip, Silicon Interposer Systems.
Proceedings of the 2015 International Symposium on Memory Systems, 2015

Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories.
Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture, 2015

A Software-Managed Approach to Die-Stacked DRAM.
Proceedings of the 2015 International Conference on Parallel Architectures and Compilation, 2015

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance.
Proceedings of the 2015 International Conference on Parallel Architectures and Compilation, 2015

2014
A Configurable and Strong RAS Solution for Die-Stacked DRAM Caches.
IEEE Micro, 2014

Toward efficient programmer-managed two-level memory hierarchies in exascale computers.
Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance Computing, 2014

Managing DRAM Latency Divergence in Irregular GPGPU Applications.
Proceedings of the International Conference for High Performance Computing, 2014

Design and Evaluation of Hierarchical Rings with Deflection Routing.
Proceedings of the 26th IEEE International Symposium on Computer Architecture and High Performance Computing, 2014

Managing GPU Concurrency in Heterogeneous Architectures.
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014

Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache.
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014

NoC Architectures for Silicon Interposer Systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free?
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014

Efficient RAS support for die-stacked DRAM.
Proceedings of the 2014 International Test Conference, 2014

Last-level cache deduplication.
Proceedings of the 2014 International Conference on Supercomputing, 2014

Increasing TLB reach by exploiting clustering in page translations.
Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture, 2014

2013
Guest Editorial.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2013

Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface.
ACM Trans. Archit. Code Optim., 2013

Top Picks from the 2012 Computer Architecture Conferences.
IEEE Micro, 2013

Resilient die-stacked DRAM caches.
Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013

2012
Supporting Very Large DRAM Caches with Compound-Access Scheduling and MissMap.
IEEE Micro, 2012

Exploiting New Interconnect Technologies in On-Chip Communication.
IEEE J. Emerg. Sel. Topics Circuits Syst., 2012

Guest Editorial New Interconnect Technologies in On-Chip Communication.
IEEE J. Emerg. Sel. Topics Circuits Syst., 2012

Computer architecture for die stacking.
Proceedings of Technical Program of 2012 VLSI Design, Automation and Test, 2012

A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch.
Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012

Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design.
Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012


Energy-efficient GPU design with reconfigurable in-package graphics memory.
Proceedings of the International Symposium on Low Power Electronics and Design, 2012

Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems.
Proceedings of the 39th International Symposium on Computer Architecture (ISCA 2012), 2012

2011
Efficiently enabling conventional block sizes for very large die-stacked DRAM caches.
Proceedings of the 44rd Annual IEEE/ACM International Symposium on Microarchitecture, 2011

A register-file approach for row buffer caches in die-stacked DRAMs.
Proceedings of the 44rd Annual IEEE/ACM International Symposium on Microarchitecture, 2011

Preventing PCM banks from seizing too much power.
Proceedings of the 44rd Annual IEEE/ACM International Symposium on Microarchitecture, 2011

Thread-aware dynamic shared cache compression in multi-core processors.
Proceedings of the IEEE 29th International Conference on Computer Design, 2011

2010
3D Stacked Microprocessor: Are We There Yet?
IEEE Micro, 2010

Use ECP, not ECC, for hard failures in resistive memories.
Proceedings of the 37th International Symposium on Computer Architecture (ISCA 2010), 2010

Scalable Shared-Cache Management by Containing Thrashing Workloads.
Proceedings of the High Performance Embedded Architectures and Compilers, 2010

Quantifying and coping with parametric variations in 3D-stacked microarchitectures.
Proceedings of the 47th Design Automation Conference, 2010

Design and analysis of 3D-MAPS: A many-core 3D processor with stacked memory.
Proceedings of the IEEE Custom Integrated Circuits Conference, 2010

2009
3D-Integrated SRAM Components for High-Performance Microprocessors.
IEEE Trans. Computers, 2009

Design and optimization of the store vectors memory dependence predictor.
ACM Trans. Archit. Code Optim., 2009

Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy.
Proceedings of the 42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009), 2009

Zesto: A cycle-level simulator for highly detailed microarchitecture exploration.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2009

PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches.
Proceedings of the 36th International Symposium on Computer Architecture (ISCA 2009), 2009

Criticality-based optimizations for efficient load processing.
Proceedings of the 15th International Conference on High-Performance Computer Architecture (HPCA-15 2009), 2009

Thermal optimization in multi-granularity multi-core floorplanning.
Proceedings of the 14th Asia South Pacific Design Automation Conference, 2009

2008
Modulo Path History for the Reduction of Pipeline Overheads in Path-based Neural Branch Predictors.
Int. J. Parallel Program., 2008

A Segmented Bloom Filter Algorithm for Efficient Predictors.
Proceedings of the 20th International Symposium on Computer Architecture and High Performance Computing, 2008

3D-Stacked Memory Architectures for Multi-core Processors.
Proceedings of the 35th International Symposium on Computer Architecture (ISCA 2008), 2008

PEEP: Exploiting predictability of memory dependences in SMT processors.
Proceedings of the 14th International Conference on High-Performance Computer Architecture (HPCA-14 2008), 2008

A modular 3d processor for flexible product design and technology migration.
Proceedings of the 5th Conference on Computing Frontiers, 2008

2007
Static strands: Safely exposing dependence chains for increasing embedded power efficiency.
ACM Trans. Embed. Comput. Syst., 2007

Multiobjective Microarchitectural Floorplanning for 2-D and 3-D ICs.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2007

Processor Design in 3D Die-Stacking Technologies.
IEEE Micro, 2007

Matrix scheduler reloaded.
Proceedings of the 34th International Symposium on Computer Architecture (ISCA 2007), 2007

Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors.
Proceedings of the 13st International Conference on High-Performance Computer Architecture (HPCA-13 2007), 2007

Scalability of 3D-Integrated Arithmetic Units in High-Performance Microprocessors.
Proceedings of the 44th Design Automation Conference, 2007

2006
Design space exploration for 3D architectures.
ACM J. Emerg. Technol. Comput. Syst., 2006

Controlling the Power and Area of Neural Branch Predictors for Practical Implementation in High-Performance Processors.
Proceedings of the 18th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2006), 2006

Adaptive Caches: Effective Shaping of Cache Behavior to Workloads.
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39 2006), 2006

Fire-and-Forget: Load/Store Scheduling with No Store Queue at All.
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39 2006), 2006

Die Stacking (3D) Microarchitecture.
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39 2006), 2006

Implementing Register Files for High-Performance Microprocessors in a Die-Stacked (3D) Technology.
Proceedings of the 2006 IEEE Computer Society Annual Symposium on VLSI (ISVLSI 2006), 2006

Revisiting the performance impact of branch predictor latencies.
Proceedings of the 2006 IEEE International Symposium on Performance Analysis of Systems and Software, 2006

The impact of 3-dimensional integration on the design of arithmetic units.
Proceedings of the International Symposium on Circuits and Systems (ISCAS 2006), 2006

Store vectors for scalable memory dependence prediction and scheduling.
Proceedings of the 12th International Symposium on High-Performance Computer Architecture, 2006

Dynamic instruction schedulers in a 3-dimensional integration technology.
Proceedings of the 16th ACM Great Lakes Symposium on VLSI 2006, Philadelphia, PA, USA, April 30, 2006

Thermal analysis of a 3D die-stacked high-performance microprocessor.
Proceedings of the 16th ACM Great Lakes Symposium on VLSI 2006, Philadelphia, PA, USA, April 30, 2006

Microarchitectural floorplanning under performance and thermal tradeoff.
Proceedings of the Conference on Design, Automation and Test in Europe, 2006

Entropy-based low power data TLB design.
Proceedings of the 2006 International Conference on Compilers, 2006

2005
Deconstructing the Frankenpredictor for Implementable Branch Predictors.
J. Instr. Level Parallelism, 2005

Static strands: safely collapsing dependence chains for increasing embedded power efficiency.
Proceedings of the 2005 ACM SIGPLAN/SIGBED Conference on Languages, 2005

Simulation Differences Between Academia and Industry: A Branch Prediction Case Study.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005

Implementing Caches in a 3D Technology for High Performance Processors.
Proceedings of the 23rd International Conference on Computer Design (ICCD 2005), 2005

A Simple Divide-and-Conquer Approach for Neural-Class Branch Prediction.
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT 2005), 2005

2003
Exploiting Bias in the Hysteresis Bit of 2-bit Saturating Counters in Branch Predictors.
J. Instr. Level Parallelism, 2003

Width-Partitioned Load Value Predictors.
J. Instr. Level Parallelism, 2003

2002
A Comparison of Asymptotically Scalable Superscalar Processors.
Theory Comput. Syst., 2002

Exploiting data-width locality to increase superscalar execution bandwidth.
Proceedings of the 35th Annual International Symposium on Microarchitecture, 2002

Speculative Clustered Caches for Clustered Processors.
Proceedings of the High Performance Computing, 4th International Symposium, 2002

Applying Machine Learning for Ensemble Branch Predictors.
Proceedings of the Developments in Applied Artificial Intelligence, 2002

Predicting Conditional Branches With Fusion-Based Hybrid Predictors.
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT 2002), 2002

2001
A time-stamping algorithm for efficient performance estimation of superscalar processors.
Proceedings of the Joint International Conference on Measurements and Modeling of Computer Systems, 2001

2000
Circuits for wide-window superscalar processors.
Proceedings of the 27th International Symposium on Computer Architecture (ISCA 2000), 2000

1999
A Comparison of Scalable Superscalar Processors.
Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, 1999


  Loading...