Joel S. Emer

Orcid: 0000-0002-3459-5466

According to our database1, Joel S. Emer authored at least 150 papers between 1979 and 2024.

Collaborative distances:

Awards

ACM Fellow

ACM Fellow 2004, "For contributions to computer architecture and performance analysis.".

IEEE Fellow

IEEE Fellow 2004, "For contributions to computer architecture and quantitative analysis of processor performance.".

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
LoopTree: Exploring the Fused-layer Dataflow Accelerator Design Space.
CoRR, 2024

The Continuous Tensor Abstraction: Where Indices are Real.
CoRR, 2024

The EDGE Language: Extended General Einsums for Graph Algorithms.
CoRR, 2024

Modeling Analog-Digital-Converter Energy and Area for Compute-In-Memory Accelerator Design.
CoRR, 2024

Onyx: A 12nm 756 GOPS/W Coarse-Grained Reconfigurable Array for Accelerating Dense and Sparse Applications.
Proceedings of the IEEE Symposium on VLSI Technology and Circuits 2024, 2024

FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design.
Proceedings of the 57th IEEE/ACM International Symposium on Microarchitecture, 2024

Azul: An Accelerator for Sparse Iterative Solvers Leveraging Distributed On-Chip Memory.
Proceedings of the 57th IEEE/ACM International Symposium on Microarchitecture, 2024

DelayAVF: Calculating Architectural Vulnerability Factors for Delay Faults.
Proceedings of the 57th IEEE/ACM International Symposium on Microarchitecture, 2024

CiMLoop: A Flexible, Accurate, and Fast Compute-In-Memory Modeling Tool.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2024

Architecture-Level Modeling of Photonic Deep Neural Network Accelerators.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2024

Trapezoid: A Versatile Accelerator for Dense and Sparse Matrix Multiplications.
Proceedings of the 51st ACM/IEEE Annual International Symposium on Computer Architecture, 2024

Mind the Gap: Attainable Data Movement and Operational Intensity Bounds for Tensor Algorithms.
Proceedings of the 51st ACM/IEEE Annual International Symposium on Computer Architecture, 2024


TeAAL: A Declarative Framework for Modeling Sparse Tensor Accelerators (Abstract).
Proceedings of the 2024 ACM Workshop on Highlights of Parallel Computing, 2024

2023
Symphony: Orchestrating Sparse and Dense Tensors with Hierarchical Heterogeneous Processing.
ACM Trans. Comput. Syst., 2023

Penetrating Shields: A Systematic Analysis of Memory Corruption Mitigations in the Spectre Era.
CoRR, 2023

Unified Convolution Framework: A compiler-based approach to support sparse convolutions.
Proceedings of the Sixth Conference on Machine Learning and Systems, 2023

Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer Capacity.
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023

HighLight: Efficient and Flexible DNN Acceleration with Hierarchical Structured Sparsity.
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023

TeAAL: A Declarative Framework for Modeling Sparse Tensor Accelerators.
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023

SecureLoop: Design Space Exploration of Secure DNN Accelerators.
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023

Accelerating RTL Simulation with Hardware-Software Co-Design.
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023

LoopTree: Enabling Exploration of Fused-layer Dataflow Accelerators.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2023

Metior: A Comprehensive Model to Evaluate Obfuscating Side-Channel Defense Schemes.
Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023

RAELLA: Reforming the Arithmetic for Efficient, Low-Resolution, and Low-Loss Analog PIM: No Retraining Required!
Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023

ISOSceles: Accelerating Sparse CNNs through Inter-Layer Pipelining.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2023

Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling (Extended Abstract).
Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing, 2023

Optimizing Compression Schemes for Parallel Sparse Tensor Algebra.
Proceedings of the Data Compression Conference, 2023

WACO: Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

The Sparse Abstract Machine.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

2022
Sparseloop: An Analytical Approach To Sparse Tensor Accelerator Modeling.
Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture, 2022

Ruby: Improving Hardware Efficiency for Tensor Algebra Accelerators Through Imperfect Factorization.
Proceedings of the International IEEE Symposium on Performance Analysis of Systems and Software, 2022

DAGguise: mitigating memory timing side channels.
Proceedings of the ASPLOS '22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022, 2022

2021
Simba: scaling deep-learning inference with chiplet-based architecture.
Commun. ACM, 2021

Mentoring Opportunities in Computer Architecture: Analyzing the Past to Develop the Future.
Proceedings of the ACM/IEEE Workshop on Computer Architecture Education, 2021

Sparseloop: An Analytical, Energy-Focused Design Space Exploration Methodology for Sparse Tensor Accelerators.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2021

Architecture-Level Energy Estimation for Heterogeneous Computing Systems.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2021

SpZip: Architectural Support for Effective Data Compression In Irregular Applications.
Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture, 2021

Gamma: leveraging Gustavson's algorithm to accelerate sparse matrix multiplication.
Proceedings of the ASPLOS '21: 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021

2020
Efficient Processing of Deep Neural Networks
Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, ISBN: 978-3-031-01766-7, 2020

A 0.32-128 TOPS, Scalable Multi-Chip-Module-Based Deep Neural Network Inference Accelerator With Ground-Referenced Signaling in 16 nm.
IEEE J. Solid State Circuits, 2020

Freely scalable and reconfigurable optical hardware for deep learning.
CoRR, 2020

Estimating Silent Data Corruption Rates Using a Two-Level Model.
CoRR, 2020

CaSA: End-to-end Quantitative Security Analysis of Randomly Mapped Caches.
Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture, 2020

An Architecture-Level Energy and Area Estimator for Processing-In-Memory Accelerator Designs.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2020

2019
Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices.
IEEE J. Emerg. Sel. Topics Circuits Syst., 2019

A 0.11 pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm.
Proceedings of the 2019 Symposium on VLSI Circuits, Kyoto, Japan, June 9-14, 2019, 2019

Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture.
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019

ExTensor: An Accelerator for Sparse Tensor Algebra.
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019

Timeloop: A Systematic Approach to DNN Accelerator Evaluation.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2019

Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs.
Proceedings of the International Conference on Computer-Aided Design, 2019

MAGNet: A Modular Accelerator Generator for Neural Networks.
Proceedings of the International Conference on Computer-Aided Design, 2019

A 0.11 PJ/OP, 0.32-128 Tops, Scalable Multi-Chip-Module-Based Deep Neural Network Accelerator Designed with A High-Productivity vlsi Methodology.
Proceedings of the 2019 IEEE Hot Chips 31 Symposium (HCS), 2019

Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration.
Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019

2018
DAWG: A Defense Against Cache Timing Attacks in Speculative Execution Processors.
IACR Cryptol. ePrint Arch., 2018

Eyeriss v2: A Flexible and High-Performance Accelerator for Emerging Deep Neural Networks.
CoRR, 2018

Harmonizing Speculative and Non-Speculative Execution in Architectures for Ordered Parallelism.
Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, 2018


Hardware for machine learning: Challenges and opportunities.
Proceedings of the 2018 IEEE Custom Integrated Circuits Conference, 2018

2017
(FPL 2015) Scavenger: Automating the Construction of Application-Optimized Memory Hierarchies.
ACM Trans. Reconfigurable Technol. Syst., 2017

Efficient Processing of Deep Neural Networks: A Tutorial and Survey.
Proc. IEEE, 2017

Using Dataflow to Optimize Energy Efficiency of Deep Neural Network Accelerators.
IEEE Micro, 2017

Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks.
IEEE J. Solid State Circuits, 2017

Towards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision.
CoRR, 2017

Understanding error propagation in deep learning neural network (DNN) accelerators and applications.
Proceedings of the International Conference for High Performance Computing, 2017

SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation.
Proceedings of the 2017 IEEE International Symposium on Performance Analysis of Systems and Software, 2017

Towards closing the energy gap between HOG and CNN features for embedded vision (Invited paper).
Proceedings of the IEEE International Symposium on Circuits and Systems, 2017

Fractal: An Execution Model for Fine-Grain Nested Speculative Parallelism.
Proceedings of the 44th Annual International Symposium on Computer Architecture, 2017

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks.
Proceedings of the 44th Annual International Symposium on Computer Architecture, 2017

Automatic Construction of Program-Optimized FPGA Memory Networks.
Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017

A method to estimate the energy consumption of deep neural networks.
Proceedings of the 51st Asilomar Conference on Signals, Systems, and Computers, 2017

SAM: Optimizing Multithreaded Cores for Speculative Parallelism.
Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques, 2017

2016
Unlocking Ordered Parallelism with the Swarm Architecture.
IEEE Micro, 2016

Data-centric execution of speculative parallel programs.
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016

CLARA: Circular Linked-List Auto and Self Refresh Architecture.
Proceedings of the Second International Symposium on Memory Systems, 2016

14.5 Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks.
Proceedings of the 2016 IEEE International Solid-State Circuits Conference, 2016

Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks.
Proceedings of the 43rd ACM/IEEE Annual International Symposium on Computer Architecture, 2016

LMC: Automatic Resource-Aware Program-Optimized Memory Partitioning.
Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016

2015
Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures.
ACM Trans. Comput. Syst., 2015

A fast and accurate analytical technique to compute the AVF of sequential bits in a processor.
Proceedings of the 48th International Symposium on Microarchitecture, 2015

A scalable architecture for ordered parallelism.
Proceedings of the 48th International Symposium on Microarchitecture, 2015

High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches.
Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture, 2015

Scavenger: Automating the construction of application-optimized memory hierarchies.
Proceedings of the 25th International Conference on Field Programmable Logic and Applications, 2015

2014
Efficient Spatial Processing Element Control via Triggered Instructions.
IEEE Micro, 2014

Exploiting spatial architectures for edit distance algorithms.
Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software, 2014

The LEAP FPGA operating system.
Proceedings of the 24th International Conference on Field Programmable Logic and Applications, 2014

LEAP Shared Memories: Automating the Construction of FPGA Coherent Memories.
Proceedings of the 22nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2014

2013
Using in-flight chains to build a scalable cache coherence protocol.
ACM Trans. Archit. Code Optim., 2013

Triggered instructions: a control paradigm for spatially-programmed architectures.
Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013

A Hierarchical Architectural Framework for Reconfigurable Logic Computing.
Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

Optimizing under abstraction: Using prefetching to improve FPGA performance.
Proceedings of the 23rd International Conference on Field programmable Logic and Applications, 2013

2012
The gradient-based cache partitioning algorithm.
ACM Trans. Archit. Code Optim., 2012

Scheduling heterogeneous multi-cores through performance impact estimation (PIE).
Proceedings of the 39th International Symposium on Computer Architecture (ISCA 2012), 2012

ZIP-IO: Architecture for application-specific compression of Big Data.
Proceedings of the 2012 International Conference on Field-Programmable Technology, 2012

Leveraging latency-insensitivity to ease multiple FPGA design.
Proceedings of the ACM/SIGDA 20th International Symposium on Field Programmable Gate Arrays, 2012

CRUISE: cache replacement and utility-aware scheduling.
Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, 2012

2011
DEC Alpha.
Proceedings of the Encyclopedia of Parallel Computing, 2011

PACMan: prefetch-aware cache management for high performance caching.
Proceedings of the 44rd Annual IEEE/ACM International Symposium on Microarchitecture, 2011

SHiP: signature-based hit predictor for high performance caching.
Proceedings of the 44rd Annual IEEE/ACM International Symposium on Microarchitecture, 2011

HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing.
Proceedings of the 17th International Conference on High-Performance Computer Architecture (HPCA-17 2011), 2011

Leap scratchpads: automatic memory and cache management for reconfigurable logic.
Proceedings of the ACM/SIGDA 19th International Symposium on Field Programmable Gate Arrays, 2011

2010
The Future of Architectural Simulation.
IEEE Micro, 2010

Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies.
Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010

Design contest overview: Combined architecture for network stream categorization and intrusion detection (CANSCID).
Proceedings of the 8th ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE 2010), 2010

High performance cache replacement using re-reference interval prediction (RRIP).
Proceedings of the 37th International Symposium on Computer Architecture (ISCA 2010), 2010

2009
A-Port Networks: Preserving the Timed Behavior of Synchronous Systems for Modeling on FPGAs.
ACM Trans. Reconfigurable Technol. Syst., 2009

Guest Editors' Introduction: Top Picks from the 2008 Computer Architecture Conferences.
IEEE Micro, 2009

Accelerating architecture research.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2009

CAMP: A technique to estimate per-structure power at run-time using a few simple parameters.
Proceedings of the 15th International Conference on High-Performance Computer Architecture (HPCA-15 2009), 2009

Soft connections: addressing the hardware-design modularity problem.
Proceedings of the 46th Design Automation Conference, 2009

2008
Set-Dueling-Controlled Adaptive Insertion for High-Performance Caching.
IEEE Micro, 2008

Computing Accurate AVFs using ACE Analysis on Performance Models: A Rebuttal.
IEEE Comput. Archit. Lett., 2008

Quick Performance Models Quickly: Closely-Coupled Partitioned Simulation on FPGAs.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2008

A-Ports: an efficient abstraction for cycle-accurate performance models on FPGAs.
Proceedings of the ACM/SIGDA 16th International Symposium on Field Programmable Gate Arrays, 2008

Adaptive insertion policies for managing shared caches.
Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, 2008

2007
Single-Threaded vs. Multithreaded: Where Should We Focus?
IEEE Micro, 2007

Late-binding: enabling unordered load-store queues.
Proceedings of the 34th International Symposium on Computer Architecture (ISCA 2007), 2007

Adaptive insertion policies for high performance caching.
Proceedings of the 34th International Symposium on Computer Architecture (ISCA 2007), 2007

2005
Computing Architectural Vulnerability Factors for Address-Based Structures.
Proceedings of the 32st International Symposium on Computer Architecture (ISCA 2005), 2005

The Soft Error Problem: An Architectural Perspective.
Proceedings of the 11th International Conference on High-Performance Computer Architecture (HPCA-11 2005), 2005

2004
Reducing the Soft-Error Rate of a High-Performance Microprocessor.
IEEE Micro, 2004

Cache Scrubbing in Microprocessors: Myth or Necessity?
Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2004), 2004

Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor.
Proceedings of the 31st International Symposium on Computer Architecture (ISCA 2004), 2004

2003
Measuring Architectural Vulnerability Factors.
IEEE Micro, 2003

A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor.
Proceedings of the 36th Annual International Symposium on Microarchitecture, 2003

2002
Performance Simulation Tools.
Computer, 2002

Asim: A Performance Model Framework.
Computer, 2002

Tarantula: A Vector Extension to the Alpha Architecture.
Proceedings of the 29th International Symposium on Computer Architecture (ISCA 2002), 2002

Loose Loops Sink Chips.
Proceedings of the Eighth International Symposium on High-Performance Computer Architecture (HPCA'02), 2002

A comparative study of arbitration algorithms for the Alpha 21364 pipelined router.
Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-X), 2002

2000
Combining Static and Dynamic Branch Prediction to Reduce Destructive Aliasing.
Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, 2000

1999
The Use of Multithreading for Exception Handling.
Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture, 1999

Reducing cache misses using hardware and software page placement.
Proceedings of the 13th international conference on Supercomputing, 1999

1998
A Characterization of Processor Performance in the VAX-11/780.
Proceedings of the 25 Years of the International Symposia on Computer Architecture (Selected Papers)., 1998

Retrospective: Characterization of Processor Performance in the VAX-11/780.
Proceedings of the 25 Years of the International Symposia on Computer Architecture (Selected Papers)., 1998

Memory Dependence Prediction Using Store Sets.
Proceedings of the 25th Annual International Symposium on Computer Architecture, 1998

1997
Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading.
ACM Trans. Comput. Syst., 1997

Simultaneous multithreading: a platform for next-generation processors.
IEEE Micro, 1997

A Language for Describing Predictors and Its Application to Automatic Synthesis.
Proceedings of the 24th International Symposium on Computer Architecture, 1997

1996
Incremental Versus Revolutionary Research.
ACM Comput. Surv., 1996

Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor.
Proceedings of the 23rd Annual International Symposium on Computer Architecture, 1996

Predictive Sequential Associative Cache.
Proceedings of the Second International Symposium on High-Performance Computer Architecture, 1996

1995
A system level perspective on branch architecture performance.
Proceedings of the 28th Annual International Symposium on Microarchitecture, Ann Arbor, Michigan, USA, November 29, 1995

Instruction Fetching: Coping with Code Bloat.
Proceedings of the 22nd Annual International Symposium on Computer Architecture, 1995

1989
Performance Analysis of Mass Storage Service Alternatives for Distributed Systems.
IEEE Trans. Software Eng., 1989

1988
Performance Considerations for Distributed Services: A Case Study: Mass Storage.
Proceedings of the 8th International Conference on Distributed Computing Systems, 1988

1986
Design analysis of a heterogeneous distributed system.
Proceedings of the 2nd ACM SIGOPS European Workshop, 1986

1985
Performance of the VAX-11/780 Translation Buffer: Simulation and Measurement
ACM Trans. Comput. Syst., 1985

1979
Shared Resources for Multiple Instruction Stream Pipelined Processors
PhD thesis, 1979


  Loading...