Andreas Moshovos

Orcid: 0000-0001-7768-367X

Affiliations:
  • University of Toronto, Canada


According to our database1, Andreas Moshovos authored at least 135 papers between 1997 and 2024.

Collaborative distances:

Awards

ACM Fellow

ACM Fellow 2017, "For contributions to high-performance architecture including memory dependence prediction and snooping coherence".

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Low-Bitwidth Floating Point Quantization for Efficient High-Quality Diffusion Models.
CoRR, 2024

Schrodinger's FP Training Neural Networks with Dynamic Floating-Point Containers.
Proceedings of the Seventh Annual Conference on Machine Learning and Systems, 2024

BitPruning: Learning Bitlengths for Aggressive and Accurate Quantization.
Proceedings of the IEEE International Symposium on Circuits and Systems, 2024

Marple: Scalable Spike Sorting for Untethered Brain-Machine Interfacing.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024

Atalanta: A Bit is Worth a "Thousand" Tensor Values.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024

2023
39 000-Subexposures/s Dual-ADC CMOS Image Sensor With Dual-Tap Coded-Exposure Pixels for Single-Shot HDR and 3-D Computational Imaging.
IEEE J. Solid State Circuits, November, 2023

cuSCNN : an Efficient CUDA Implementation of Sparse CNNs.
Proceedings of the 13th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies, 2023

2022
Schrödinger's FP: Dynamic Adaptation of Floating-Point Containers for Deep Learning Training.
CoRR, 2022

APack: Off-Chip, Lossless Data Compression for Efficient Deep Learning Inference.
CoRR, 2022

A 39, 000 Subexposures/s CMOS Image Sensor with Dual-tap Coded-exposure Data-memory Pixel for Adaptive Single-shot Computational Imaging.
Proceedings of the IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits 2022), 2022

Mokey: enabling narrow fixed-point inference for out-of-the-box floating-point transformer models.
Proceedings of the ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18, 2022

A Massive-Scale Brain Activity Decoding Chip.
Proceedings of the 2022 IEEE Hot Chips 34 Symposium, 2022

2021
Boveda: Building an On-Chip Deep Learning Memory Hierarchy Brick by Brick.
Proceedings of the Fourth Conference on Machine Learning and Systems, 2021

FPRaker: A Processing Element For Accelerating Neural Network Training.
Proceedings of the MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021

Noema: Hardware-Efficient Template Matching for Neural Population Pattern Detection.
Proceedings of the MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021

2020
TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training and Inference.
CoRR, 2020

GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference.
CoRR, 2020

BitPruning: Learning Bitlengths for Aggressive and Accurate Quantization.
CoRR, 2020

GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference.
Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture, 2020

TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training.
Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture, 2020

Late Breaking Results: Building an On-Chip Deep Learning Memory Hierarchy Brick by Brick.
Proceedings of the 57th ACM/IEEE Design Automation Conference, 2020

2019
Accelerating Image-Sensor-Based Deep Learning Applications.
IEEE Micro, 2019

Training CNNs faster with Dynamic Input and Kernel Downsampling.
CoRR, 2019

ShapeShifter: Enabling Fine-Grain Data Width Adaptation in Deep Learning.
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019

Characterizing Sources of Ineffectual Computations in Deep Learning Networks.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2019

Laconic deep learning inference acceleration.
Proceedings of the 46th International Symposium on Computer Architecture, 2019

Deep Learning Language Modeling Workloads: Where Time Goes on Graphics Processors.
Proceedings of the IEEE International Symposium on Workload Characterization, 2019

SW+: On Accelerating Smith-Waterman Execution of GATK HaplotypeCaller.
Proceedings of the Computational Intelligence Methods for Bioinformatics and Biostatistics, 2019

MemAlign: A Memory Structure to Accelerate Gene Sequencing.
Proceedings of the 19th IEEE International Conference on Bioinformatics and Bioengineering, 2019

BWA-MEM Performance: Suffix Array Storage Size.
Proceedings of the 2019 IEEE EMBS International Conference on Biomedical & Health Informatics, 2019

Bit-Tactical: A Software/Hardware Approach to Exploiting Value and Bit Sparsity in Neural Networks.
Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019

2018
Proteus: Exploiting precision variability in deep neural networks.
Parallel Comput., 2018

Value-Based Deep-Learning Acceleration.
IEEE Micro, 2018

Laconic Deep Learning Computing.
CoRR, 2018

DPRed: Making Typical Activation Values Matter In Deep Learning Computing.
CoRR, 2018

Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How.
CoRR, 2018

Exploiting Typical Values to Accelerate Deep Learning.
Computer, 2018

Identifying and Exploiting Ineffectual Computations to Enable Hardware Acceleration of Deep Learning.
Proceedings of the 16th IEEE International New Circuits and Systems Conference, 2018

Value-Based Deep Learning Hardware Acceleration.
Proceedings of the 11th International Workshop on Network on Chip Architectures, 2018

Diffy: a Déjà vu-Free Differential Deep Neural Network Accelerator.
Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, 2018

Memory Requirements for Convolutional Neural Network Hardware Accelerators.
Proceedings of the 2018 IEEE International Symposium on Workload Characterization, 2018

Gene Sequencing: Where Time Goes.
Proceedings of the 2018 IEEE International Symposium on Workload Characterization, 2018

Characterizing Sources of Ineffectual Computations in Deep Learning Networks.
Proceedings of the 2018 IEEE International Symposium on Workload Characterization, 2018

Loom: exploiting weight and activation precisions to accelerate convolutional neural networks.
Proceedings of the 55th Annual Design Automation Conference, 2018

2017
Loom: Exploiting Weight and Activation Precisions to Accelerate Convolutional Neural Networks.
CoRR, 2017

Cnvlutin2: Ineffectual-Activation-and-Weight-Free Deep Neural Network Computing.
CoRR, 2017

Tartan: Accelerating Fully-Connected and Convolutional Layers in Deep Learning Networks by Exploiting Numerical Precision Variability.
CoRR, 2017

Dynamic Stripes: Exploiting the Dynamic Precision Requirements of Activation Values in Neural Networks.
CoRR, 2017

Stripes: Bit-Serial Deep Neural Network Computing.
IEEE Comput. Archit. Lett., 2017

IDEAL: image denoising accelerator.
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017

Bit-pragmatic deep neural network computing.
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017

Bit-Pragmatic Deep Neural Network Computing.
Proceedings of the 5th International Conference on Learning Representations, 2017

2016
Stripes: Bit-serial deep neural network computing.
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016

Message from the program chair.
Proceedings of the 2016 IEEE International Symposium on Performance Analysis of Systems and Software, 2016

Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing.
Proceedings of the 43rd ACM/IEEE Annual International Symposium on Computer Architecture, 2016

Memory controller design under cloud workloads.
Proceedings of the 2016 IEEE International Symposium on Workload Characterization, 2016

Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks.
Proceedings of the 2016 International Conference on Supercomputing, 2016

2015
Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets.
CoRR, 2015

Doppelgänger: a cache for approximate computing.
Proceedings of the 48th International Symposium on Microarchitecture, 2015

Self-contained, accurate precomputation prefetching.
Proceedings of the 48th International Symposium on Microarchitecture, 2015

QTrace: a framework for customizable full system instrumentation.
Proceedings of the 2015 IEEE International Symposium on Performance Analysis of Systems and Software, 2015

Prediction-based superpage-friendly TLB designs.
Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture, 2015

2014
Optimizing Memory Translation Emulation in Full System Emulators.
ACM Trans. Archit. Code Optim., 2014

ADDICT: Advanced Instruction Chasing for Transactions.
Proc. VLDB Endow., 2014

Evaluating the memory system behavior of smartphone workloads.
Proceedings of the XIVth International Conference on Embedded Computer Systems: Architectures, 2014

Advanced branch predictors for soft processors.
Proceedings of the 2014 International Conference on ReConFigurable Computing and FPGAs, 2014

What limits the operating frequency of a soft processor design.
Proceedings of the 2014 International Conference on ReConFigurable Computing and FPGAs, 2014

Wormhole: Wisely Predicting Multidimensional Branches.
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014

BarTLB: Barren page resistant TLB for managed runtime languages.
Proceedings of the 32nd IEEE International Conference on Computer Design, 2014

Image Signal Processors on FPGAs.
Proceedings of the 22nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2014

An Architectural Approach to Characterizing and Eliminating Sources of Inefficiency in a Soft Processor Design.
Proceedings of the 22nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2014

2013
Multi-grain coherence directories.
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013

QTrace: An interface for customizable full system instrumentation.
Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2013

STREX: boosting instruction cache reuse in OLTP workloads through stratified transaction execution.
Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013

RECAP: A region-based cure for the common cold (cache).
Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture, 2013

Low-cost, high-performance branch predictors for soft processors.
Proceedings of the 23rd International Conference on Field programmable Logic and Applications, 2013

Characterizing the performance benefits of fused CPU/GPU systems using FusionSim.
Proceedings of the Design, Automation and Test in Europe, 2013

A dual grain hit-miss detector for large die-stacked DRAM caches.
Proceedings of the Design, Automation and Test in Europe, 2013

2012
NCOR: An FPGA-Friendly Nonblocking Data Cache for Soft Processors with Runahead Execution.
Int. J. Reconfigurable Comput., 2012

SPREX: A soft processor with Runahead execution.
Proceedings of the 2012 International Conference on Reconfigurable Computing and FPGAs, 2012

SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads.
Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012

Toward virtualizing branch direction prediction.
Proceedings of the 2012 Design, Automation & Test in Europe Conference & Exhibition, 2012

Reducing OLTP instruction misses with thread migration.
Proceedings of the Eighth International Workshop on Data Management on New Hardware, 2012

ReCaP: a region-based cure for the common cold cache.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

Pointy: a hybrid pointer prefetcher for managed runtime systems.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

2011
Two-Stage, Pipelined Register Renaming.
IEEE Trans. Very Large Scale Integr. Syst., 2011

2010
On the Latency and Energy of Checkpointed Superscalar Register Alias Tables.
IEEE Trans. Very Large Scale Integr. Syst., 2010

Making Address-Correlated Prefetching Practical.
IEEE Micro, 2010

An Efficient Non-blocking Data Cache for Soft Processors.
Proceedings of the ReConFig'10: 2010 International Conference on Reconfigurable Computing and FPGAs, 2010

Demystifying GPU microarchitecture through microbenchmarking.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2010

Design space exploration of instruction schedulers for out-of-order soft processors.
Proceedings of the International Conference on Field-Programmable Technology, 2010

2009
A physical-level study of the compacted matrix instruction scheduler for dynamically-scheduled superscalar processors.
Proceedings of the 2009 International Conference on Embedded Computer Systems: Architectures, 2009

A tagless coherence directory.
Proceedings of the 42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009), 2009

Practical off-chip meta-data for temporal memory streaming.
Proceedings of the 15th International Conference on High-Performance Computer Architecture (HPCA-15 2009), 2009

Towards a viable out-of-order soft core: Copy-Free, checkpointed register renaming.
Proceedings of the 19th International Conference on Field Programmable Logic and Applications, 2009

Phantom-BTB: a virtualized branch target buffer design.
Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, 2009

2008
L-CBF: A Low-Power, Fast Counting Bloom Filter Architecture.
IEEE Trans. Very Large Scale Integr. Syst., 2008

Temporal instruction fetch streaming.
Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41 2008), 2008

A physical level study and optimization of CAM-based checkpointed register alias table.
Proceedings of the 2008 International Symposium on Low Power Electronics and Design, 2008

Temporal streams in commercial server applications.
Proceedings of the 4th International Symposium on Workload Characterization (IISWC 2008), 2008

Turbo-ROB: A Low Cost Checkpoint/Restore Accelerator.
Proceedings of the High Performance Embedded Architectures and Compilers, 2008

Predictor virtualization.
Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, 2008

2007
A Building Block for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy.
IEEE Comput. Archit. Lett., 2007

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy.
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007), 2007

On the latency, energy and area of checkpointed, superscalar register alias tables.
Proceedings of the 2007 International Symposium on Low Power Electronics and Design, 2007

Mechanisms for store-wait-free multiprocessors.
Proceedings of the 34th International Symposium on Computer Architecture (ISCA 2007), 2007

2006
Coarse-Grain Coherence Tracking: RegionScout and Region Coherence Arrays.
IEEE Micro, 2006

Spatial Memory Streaming.
Proceedings of the 33rd International Symposium on Computer Architecture (ISCA 2006), 2006

BranchTap: improving performance with very few checkpoints through adaptive speculation control.
Proceedings of the 20th Annual International Conference on Supercomputing, 2006

2005
A Case for Asymmetric-Cell Cache Memories.
IEEE Trans. Very Large Scale Integr. Syst., 2005

RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence.
Proceedings of the 32st International Symposium on Computer Architecture (ISCA 2005), 2005

RECAST: Boosting Tag Line Buffer Coverage in Low-Power High-Level Caches "for Free".
Proceedings of the 23rd International Conference on Computer Design (ICCD 2005), 2005

Memory State Compressors for Giga-Scale Checkpoint/Restore.
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT 2005), 2005

2004
SEPAS: a highly accurate energy-efficient branch predictor.
Proceedings of the 2004 International Symposium on Low Power Electronics and Design, 2004

Accurate and Complexity-Effective Spatial Pattern Prediction.
Proceedings of the 10th International Conference on High-Performance Computer Architecture (HPCA-10 2004), 2004

2003
Low-leakage asymmetric-cell SRAM.
IEEE Trans. Very Large Scale Integr. Syst., 2003

Behavior and Performance of Interactive Multi-Player Game Servers.
Clust. Comput., 2003

Checkpointing alternatives for high performance, power-aware processors.
Proceedings of the 2003 International Symposium on Low Power Electronics and Design, 2003

2002
Reducing Memory Latency via Read-after-Read Memory Dependence Prediction.
IEEE Trans. Computers, 2002

Asymmetric-frequency clustering: a power-aware back-end for high-performance processors.
Proceedings of the 2002 International Symposium on Low Power Electronics and Design, 2002

Branch Predictor Prediction: A Power-Aware Branch Predictor for High-Performance Processors.
Proceedings of the 20th International Conference on Computer Design (ICCD 2002), 2002

2001
Microarchitectural innovations: boosting microprocessor performance beyond semiconductor technology scaling.
Proc. IEEE, 2001

Instruction flow-based front-end throttling for power-aware high-performance processors.
Proceedings of the 2001 International Symposium on Low Power Electronics and Design, 2001

Slice-processors: an implementation of operation-based prediction.
Proceedings of the 15th international conference on Supercomputing, 2001

JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers.
Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA'01), 2001

2000
Memory Dependence Prediction in Multimedia Applications.
J. Instr. Level Parallelism, 2000

Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors.
Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, 2000

CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit.
Proceedings of the 27th International Symposium on Computer Architecture (ISCA 2000), 2000

Memory Dependence Speculation Tradeoffs in Centralized, Continuous-Window Superscalar Processors.
Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, 2000

1999
Speculative Memory Cloaking and Bypassing.
Int. J. Parallel Program., 1999

Read-After-Read Memory Dependence Prediction.
Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture, 1999

Improving virtual function call target prediction via dependence-based pre-computation.
Proceedings of the 13th international conference on Supercomputing, 1999

1998
Dependance Based Prefetching for Linked Data Structures.
Proceedings of the ASPLOS-VIII Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, 1998

1997
Streamlining Inter-Operation Memory Communication via Data Dependence Prediction.
Proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, 1997

Dynamic Speculation and Synchronization of Data Dependences.
Proceedings of the 24th International Symposium on Computer Architecture, 1997


  Loading...