Antonio González

Orcid: 0000-0002-0009-0996

  • Polytechnic University of Catalonia (UPC), Department of Computer Architecture
  • Intel Labs, Intel Barcelona Research Center

According to our database1, Antonio González authored at least 338 papers between 1988 and 2025.

Collaborative distances:



In proceedings 
PhD thesis 


Online presence:



An energy-efficient near-data processing accelerator for DNNs to optimize memory accesses.
J. Syst. Archit., 2025

On the Development of Vehicle Dynamics Active Systems: The Handling Stability Ratio as a Strategic Indicator for Integrating Multiple Actuators.
IEEE Access, 2025

Exploiting beam search confidence for energy-efficient speech recognition.
J. Supercomput., November, 2024

Mixture-of-Rookies: Saving DNN computations by predicting ReLU outputs.
Microprocess. Microsystems, 2024

SIMIL: SIMple Issue Logic for GPUs.
Microprocess. Microsystems, 2024

ARAS: An Adaptive Low-Cost ReRAM-Based Accelerator for DNNs.
CoRR, 2024

Control Flow Management in Modern GPUs.
CoRR, 2024

WaSP: Warp Scheduling to Mimic Prefetching in Graphics Workloads.
CoRR, 2024

Analyzing and Improving Hardware Modeling of Accel-Sim.
CoRR, 2024

LIBRA: Memory Bandwidth- and Locality-Aware Parallel Tile Rendering.
Proceedings of the 57th IEEE/ACM International Symposium on Microarchitecture, 2024

Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs.
Proceedings of the 51st ACM/IEEE Annual International Symposium on Computer Architecture, 2024

SLIDEX: A Novel Architecture for Sliding Window Processing.
Proceedings of the 38th ACM International Conference on Supercomputing, 2024

ReDy: A Novel ReRAM-centric Dynamic Quantization Approach for Energy-efficient CNNs.
Proceedings of the 53rd International Conference on Parallel Processing, 2024

LOCATOR: Low-power ORB accelerator for autonomous cars.
J. Parallel Distributed Comput., April, 2023

SHARP: An Adaptable, Energy-Efficient Accelerator for Recurrent Neural Networks.
ACM Trans. Embed. Comput. Syst., March, 2023

Irregular accesses reorder unit: improving GPGPU memory coalescing for graph-based workloads.
J. Supercomput., 2023

ReuseSense: With Great Reuse Comes Greater Efficiency; Effectively Employing Computation Reuse on General-Purpose CPUs.
CoRR, 2023

An Energy-Efficient Near-Data Processing Accelerator for DNNs that Optimizes Data Accesses.
CoRR, 2023

A Lightweight, Compiler-Assisted Register File Cache for GPGPU.
CoRR, 2023

ReDy: A Novel ReRAM-centric Dynamic Quantization Approach for Energy-efficient CNN Inference.
CoRR, 2023

δLTA: Decoupling Camera Sampling from Processing to Avoid Redundant Computations in the Vision Pipeline.
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023

K-D Bonsai: ISA-Extensions to Compress K-D Trees for Autonomous Driving Tasks.
Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023

DNA-TEQ: An Adaptive Exponential Quantization of Tensors for DNN Inference.
Proceedings of the 30th IEEE International Conference on High Performance Computing, 2023

Lightweight Register File Caching in Collector Units for GPUs.
Proceedings of the 15th Workshop on General Purpose Processing Using GPU, 2023

Simple Out of Order Core for GPGPUs.
Proceedings of the 15th Workshop on General Purpose Processing Using GPU, 2023

Exploiting Kernel Compression on BNNs.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2023

SLIDEX: Sliding Window Extension for Image Processing.
Proceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques, 2023

QeiHaN: An Energy-Efficient DNN Accelerator that Leverages Log Quantization in NDP Architectures.
Proceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques, 2023

Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs.
Proceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques, 2023

Omega-Test: A Predictive Early-Z Culling to Improve the Graphics Pipeline Energy-Efficiency.
IEEE Trans. Vis. Comput. Graph., 2022

Dynamic sampling rate: harnessing frame coherence in graphics applications for energy-efficient GPUs.
J. Supercomput., 2022

Energy-Efficient Stream Compaction Through Filtering and Coalescing Accesses in GPGPU Memory Partitions.
IEEE Trans. Computers, 2022

E-BATCH: Energy-Efficient and High-Throughput RNN Batching.
ACM Trans. Archit. Code Optim., 2022

Triangle Dropping: An Occluded-geometry Predictor for Energy-efficient Mobile GPUs.
ACM Trans. Archit. Code Optim., 2022

A Survey of Near-Data Processing Architectures for Neural Networks.
Mach. Learn. Knowl. Extr., 2022

CREW: Computation reuse and efficient weight storage for hardware-accelerated MLPs and RNNs.
J. Syst. Archit., 2022

DNN pruning with principal component analysis and connection importance estimation.
J. Syst. Archit., 2022

Saving RNN Computations with a Neuron-Level Fuzzy Memoization Scheme.
CoRR, 2022

Mixture-of-Rookies: Saving DNN Computations by Predicting ReLU Outputs.
CoRR, 2022

ASRPU: A Programmable Accelerator for Low-Power Automatic Speech Recognition.
CoRR, 2022

DTM-NUCA: Dynamic Texture Mapping-NUCA for Energy-Efficient Graphics Rendering.
Proceedings of the 30th Euromicro International Conference on Parallel, 2022

DTexL: Decoupled Raster Pipeline for Texture Locality.
Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture, 2022

MEGsim: A Novel Methodology for Efficient Simulation of Graphics Workloads in GPUs.
Proceedings of the International IEEE Symposium on Performance Analysis of Systems and Software, 2022

XFeatur: Hardware Feature Extraction for DNN Auto-tuning.
Proceedings of the International IEEE Symposium on Performance Analysis of Systems and Software, 2022

TCOR: A Tile Cache with Optimal Replacement.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2022

Fast and Accurate SER Estimation for Large Combinational Blocks in Early Stages of the Design.
IEEE Trans. Sustain. Comput., 2021

Exploiting Beam Search Confidence for Energy-Efficient Speech Recognition.
CoRR, 2021

A Low-Power Hardware Accelerator for ORB Feature Extraction in Self-Driving Cars.
Proceedings of the 33rd IEEE International Symposium on Computer Architecture and High Performance Computing, 2021

LAWS: Locality-AWare Scheme for Automatic Speech Recognition.
IEEE Trans. Computers, 2020

Design and Evaluation of an Ultra Low-power Human-quality Speech Recognition System.
ACM Trans. Archit. Code Optim., 2020

Demystifying Power and Performance Bottlenecks in Autonomous Driving Systems.
Proceedings of the IEEE International Symposium on Workload Characterization, 2020

Boosting LSTM Performance Through Dynamic Precision Selection.
Proceedings of the 27th IEEE International Conference on High Performance Computing, 2020

Visibility Rendering Order: Improving Energy Efficiency on Mobile GPUs through Frame Coherence.
IEEE Trans. Parallel Distributed Syst., 2019

A Low-Power, High-Performance Speech Recognition Accelerator.
IEEE Trans. Computers, 2019

SyRA: Early System Reliability Analysis for Cross-Layer Soft Errors Resilience in Memory Arrays of Microprocessor Systems.
IEEE Trans. Computers, 2019

CGPA: Coarse-Grained Pruning of Activations for Energy-Efficient RNN Inference.
IEEE Micro, 2019

LSTM-Sharp: An Adaptable, Energy-Efficient Hardware Accelerator for Long Short-Term Memory.
CoRR, 2019

(Pen-) Ultimate DNN Pruning.
CoRR, 2019

Neuron-Level Fuzzy Memoization in RNNs.
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019

SCU: a GPU stream compaction unit for graph processing.
Proceedings of the 46th International Symposium on Computer Architecture, 2019

Rendering Elimination: Early Discard of Redundant Tiles in the Graphics Pipeline.
Proceedings of the 25th IEEE International Symposium on High Performance Computer Architecture, 2019

Early Visibility Resolution for Removing Ineffectual Computations in the Graphics Pipeline.
Proceedings of the 25th IEEE International Symposium on High Performance Computer Architecture, 2019

POSTER: Leveraging Run-Time Feedback for Efficient ASR Acceleration.
Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques, 2019

Performance Analysis and Optimization of Automatic Speech Recognition.
IEEE Trans. Multi Scale Comput. Syst., 2018

2018 International Symposium on Computer Architecture Influential Paper Award.
IEEE Micro, 2018

Trends in Processor Architecture.
CoRR, 2018

The Dark Side of DNN Pruning.
Proceedings of the 45th ACM/IEEE Annual International Symposium on Computer Architecture, 2018

Computation Reuse in DNNs by Exploiting Input Similarity.
Proceedings of the 45th ACM/IEEE Annual International Symposium on Computer Architecture, 2018

A Novel Register Renaming Technique for Out-of-Order Processors.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2018

E-PUR: an energy-efficient processing unit for recurrent neural networks.
Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, 2018

Low-Power Automatic Speech Recognition Through a Mobile GPU and a Viterbi Accelerator.
IEEE Micro, 2017

Shared resource aware scheduling on power-constrained tiled many-core processors.
J. Parallel Distributed Comput., 2017

UNFOLD: a memory-efficient speech recognizer using on-the-fly WFST composition.
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017

MeRLiN: Exploiting Dynamic Instruction Behavior for Fast and Accurate Microarchitecture Level Reliability Assessment.
Proceedings of the 44th Annual International Symposium on Computer Architecture, 2017

Removing checks in dynamically typed languages through efficient profiling.
Proceedings of the 2017 International Symposium on Code Generation and Optimization, 2017

An Ultra Low-Power Hardware Accelerator for Acoustic Scoring in Speech Recognition.
Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques, 2017

Scalability of Broadcast Performance in Wireless Network-on-Chip.
IEEE Trans. Parallel Distributed Syst., 2016

Assisting Static Compiler Vectorization with a Speculative Dynamic Vectorizer in an HW/SW Codesigned Environment.
ACM Trans. Comput. Syst., 2016

A Case for Acoustic Wave Detectors for Soft-Errors.
IEEE Trans. Computers, 2016

An Energy-Efficient Memory Unit for Clustered Microarchitectures.
IEEE Trans. Computers, 2016

An ultra low-power hardware accelerator for automatic speech recognition.
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016

Cross-layer system reliability assessment framework for hardware faults.
Proceedings of the 2016 IEEE International Test Conference, 2016

Quantitative characterization of the software layer of a HW/SW co-designed processor.
Proceedings of the 2016 IEEE International Symposium on Workload Characterization, 2016

MASkIt: Soft error rate estimation for combinational circuits.
Proceedings of the 34th IEEE International Conference on Computer Design, 2016

ERICO: Effective Removal of Inline Caching Overhead in Dynamic Typed Languages.
Proceedings of the 23rd IEEE International Conference on High Performance Computing, 2016

A detailed methodology to compute Soft Error Rates in advanced technologies.
Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition, 2016

Analysis and Optimization of Engines for Dynamically Typed Languages.
Proceedings of the 27th International Symposium on Computer Architecture and High Performance Computing, 2015

Ultra-low power render-based collision detection for CPU/GPU systems.
Proceedings of the 48th International Symposium on Microarchitecture, 2015

Chrysso: an integrated power manager for constrained many-core processors.
Proceedings of the 12th ACM International Conference on Computing Frontiers, 2015

Efficient Power Gating of SIMD Accelerators Through Dynamic Selective Devectorization in an HW/SW Codesigned Environment.
ACM Trans. Archit. Code Optim., 2014

Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery.
Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture, 2014

Framework for economical error recovery in embedded cores.
Proceedings of the 2014 IEEE 20th International On-Line Testing Symposium, 2014

Cross-layer early reliability evaluation: Challenges and promises.
Proceedings of the 2014 IEEE 20th International On-Line Testing Symposium, 2014

Author retrospective for the dual data cache.
Proceedings of the ACM International Conference on Supercomputing 25th Anniversary Volume, 2014

iRMW: A low-cost technique to reduce NBTI-dependent parametric failures in L1 data caches.
Proceedings of the 32nd IEEE International Conference on Computer Design, 2014

INFORMER: An integrated framework for early-stage memory robustness analysis.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2014

Warm-Up Simulation Methodology for HW/SW Co-Designed Processors.
Proceedings of the 12th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2014

Accurate off-line phase classification for HW/SW co-designed processors.
Proceedings of the Computing Frontiers Conference, CF'14, 2014

Replacement techniques for dynamic NUCA cache designs on CMPs.
J. Supercomput., 2013

Dynamic Selective Devectorization for Efficient Power Gating of SIMD Units in a HW/SW Co-Designed Environment.
Proceedings of the 25th International Symposium on Computer Architecture and High Performance Computing, 2013

Effectiveness of hybrid recovery techniques on parametric failures.
Proceedings of the International Symposium on Quality Electronic Design, 2013

Deconfigurable microprocessor architectures for silicon debug acceleration.
Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013

Reducing DUE-FIT of caches by exploiting acoustic wave detectors for error recovery.
Proceedings of the 2013 IEEE 19th International On-Line Testing Symposium (IOLTS), 2013

Vectorizing for Wider Vector Units in a HW/SW Co-designed Environment.
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013

Speculative dynamic vectorization to assist static vectorization in a HW/SW co-designed environment.
Proceedings of the 20th Annual International Conference on High Performance Computing, 2013

Performance analysis and predictability of the software layer in dynamic binary translators/optimizers.
Proceedings of the Computing Frontiers Conference, 2013

The migration prefetcher: Anticipating data promotion in dynamic NUCA caches.
ACM Trans. Archit. Code Optim., 2012

Impact of positive bias temperature instability (PBTI) on 3T1D-DRAM cells.
Integr., 2012

A HW/SW Co-designed Programmable Functional Unit.
IEEE Comput. Archit. Lett., 2012

DDGacc: boosting dynamic DDG-based binary optimizations through specialized hardware support.
Proceedings of the 8th International Conference on Virtual Execution Environments, 2012

Improving the Resilience of an IDS against Performance Throttling Attacks.
Proceedings of the Security and Privacy in Communication Networks, 2012

Improving the Performance Efficiency of an IDS by Exploiting Temporal Locality in Network Traffic.
Proceedings of the 20th IEEE International Symposium on Modeling, 2012

Exploiting temporal locality in network traffic using commodity multi-cores.
Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2012

Setting an error detection infrastructure with low cost acoustic wave detectors.
Proceedings of the 39th International Symposium on Computer Architecture (ISCA 2012), 2012

Hardware/Software Mechanisms for Protecting an IDS against Algorithmic Complexity Attacks.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

A novel variation-tolerant 4T-DRAM cell with enhanced soft-error tolerance.
Proceedings of the 30th International IEEE Conference on Computer Design, 2012

Speculative dynamic vectorization for HW/SW co-designed processors.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

CROB: Implementing a Large Instruction Window through Compression.
Trans. High Perform. Embed. Archit. Compil., 2011

Compiler Directed Issue Queue Energy Reduction.
Trans. High Perform. Embed. Archit. Compil., 2011

Implementing End-to-End Register Data-Flow Continuous Self-Test.
IEEE Trans. Computers, 2011

TRAMS Project: Variability and Reliability of SRAM Memories in sub-22 nm Bulk-CMOS Technologies.
Proceedings of the 2nd European Future Technologies Conference and Exhibition, 2011

Design of complex circuits using the Via-Configurable transistor array regular layout fabric.
Proceedings of the IEEE 24th International SoC Conference, SOCC 2011, Taipei, Taiwan, 2011

A Power-Efficient Co-designed Out-of-Order Processor.
Proceedings of the 23rd International Symposium on Computer Architecture and High Performance Computing, 2011

Accelerating microprocessor silicon validation by exposing ISA diversity.
Proceedings of the 44rd Annual IEEE/ACM International Symposium on Microarchitecture, 2011

Global productiveness propagation: a code optimization technique to speculatively prune useless narrow computations.
Proceedings of the ACM SIGPLAN/SIGBED 2011 conference on Languages, 2011

Thread shuffling: combining DVFS and thread migration toreduce energy consumptions for multi-core systems.
Proceedings of the 2011 International Symposium on Low Power Electronics and Design, 2011

A Performance and Area Efficient Architecture for Intrusion Detection Systems.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

HK-NUCA: Boosting Data Searches in Dynamic Non-Uniform Cache Architectures for Chip Multiprocessors.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

New reliability mechanisms in memory design for sub-22nm technologies.
Proceedings of the 17th IEEE International On-Line Testing Symposium (IOLTS 2011), 2011

Dynamic fine-grain body biasing of caches with latency and leakage 3T1D-based monitors.
Proceedings of the IEEE 29th International Conference on Computer Design, 2011

Fg-STP: Fine-Grain Single Thread Partitioning on Multicores.
Proceedings of the 17th International Conference on High-Performance Computer Architecture (HPCA-17 2011), 2011

Hardware/software-based diagnosis of load-store queues using expandable activity logs.
Proceedings of the 17th International Conference on High-Performance Computer Architecture (HPCA-17 2011), 2011

Moore's law implications on energy reduction.
Proceedings of the High Performance Embedded Architectures and Compilers, 2011

Implementing a hybrid SRAM / eDRAM NUCA architecture.
Proceedings of the 18th International Conference on High Performance Computing, 2011

SoftHV: a HW/SW co-designed processor with horizontal and vertical fusion.
Proceedings of the 8th Conference on Computing Frontiers, 2011

Beforehand Migration on D-NUCA Caches.
Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011

A Co-designed HW/SW Approach to General Purpose Program Acceleration Using a Programmable Functional Unit.
Proceedings of the 15th Workshop on Interaction between Compilers and Computer Architectures, 2011

Processor Microarchitecture: An Implementation Perspective
Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, ISBN: 978-3-031-01729-2, 2010

Leveraging Register Windows to Reduce Physical Registers to the Bare Minimum.
IEEE Trans. Computers, 2010

Thread-management techniques to maximize efficiency in multicore and simultaneous multithreaded microprocessors.
ACM Trans. Archit. Code Optim., 2010

Energy efficiency via thread fusion and value reuse.
IET Comput. Digit. Tech., 2010

VCTA: A Via-Configurable Transistor Array regular fabric.
Proceedings of the 18th IEEE/IFIP VLSI-SoC 2010, 2010

A Dynamically Adaptable Hardware Transactional Memory.
Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010

MT-SBST: Self-test optimization in multithreaded multicore architectures.
Proceedings of the 2011 IEEE International Test Conference, 2010

MODEST: a model for energy estimation under spatio-temporal variability.
Proceedings of the 2010 International Symposium on Low Power Electronics and Design, 2010

The auction: optimizing banks usage in Non-Uniform Cache Architectures.
Proceedings of the 24th International Conference on Supercomputing, 2010

High-Performance low-vcc in-order core.
Proceedings of the 16th International Conference on High-Performance Computer Architecture (HPCA-16 2010), 2010

Circuit propagation delay estimation through multivariate regression-based modeling under spatio-temporal variability.
Proceedings of the Design, Automation and Test in Europe, 2010

Selective replication: A lightweight technique for soft errors.
ACM Trans. Comput. Syst., 2009

Reducing Soft Errors through Operand Width Aware Policies.
IEEE Trans. Dependable Secur. Comput., 2009

AGAMOS: A Graph-Based Approach to Modulo Scheduling for Clustered Microarchitectures.
IEEE Trans. Computers, 2009

Energy-efficient register caching with compiler assistance.
ACM Trans. Archit. Code Optim., 2009

Exploring the limits of early register release: Exploiting compiler analysis.
ACM Trans. Archit. Code Optim., 2009

Low Vccmin fault-tolerant cache with highly predictable performance.
Proceedings of the 42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009), 2009

Boosting single-thread performance in multi-core systems through fine-grain multi-threading.
Proceedings of the 36th International Symposium on Computer Architecture (ISCA 2009), 2009

End-to-end register data-flow continuous self-test.
Proceedings of the 36th International Symposium on Computer Architecture (ISCA 2009), 2009

Online error detection and correction of erratic bits in register files.
Proceedings of the 15th IEEE International On-Line Testing Symposium (IOLTS 2009), 2009

Using Coherence Information and Decay Techniques to Optimize L2 Cache Leakage in CMPs.
Proceedings of the ICPP 2009, 2009

LRU-PEA: A smart replacement policy for non-uniform cache architectures on chip multiprocessors.
Proceedings of the 27th International Conference on Computer Design, 2009

P-slice based efficient speculative multithreading.
Proceedings of the 16th International Conference on High Performance Computing, 2009

Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs.
Proceedings of the Euro-Par 2009 Parallel Processing, 2009

Key Microarchitectural Innovations for Future Microprocessors.
Proceedings of the Architecture of Computing Systems, 2009

Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading.
Proceedings of the PACT 2009, 2009

FASTM: A Log-based Hardware Transactional Memory with Fast Abort Recovery.
Proceedings of the PACT 2009, 2009

Power/Performance/Thermal Design-Space Exploration for Multicore Architectures.
IEEE Trans. Parallel Distributed Syst., 2008

Mitosis: A Speculative Multithreaded Processor Based on Precomputation Slices.
IEEE Trans. Parallel Distributed Syst., 2008

Refueling: Preventing Wire Degradation due to Electromigration.
IEEE Micro, 2008

Version management alternatives for hardware transactional memory.
Proceedings of the 9th workshop on MEmory performance, 2008

Thread fusion.
Proceedings of the 2008 International Symposium on Low Power Electronics and Design, 2008

Efficient resources assignment schemes for clustered multithreaded processors.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

A software-hardware hybrid steering mechanism for clustered microarchitectures.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

On-Line Failure Detection and Confinement in Caches.
Proceedings of the 14th IEEE International On-Line Testing Symposium (IOLTS 2008), 2008

Meeting points: using thread criticality to adapt multicore hardware to parallel regions.
Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, 2008

Understanding the Thermal Implications of Multi-Core Architectures.
IEEE Trans. Parallel Distributed Syst., 2007

Guest Editors' Introduction: Micro's Top Picks from the Microarchitecture Conferences.
IEEE Micro, 2007

Reliability: Fallacy or Reality?
IEEE Micro, 2007

Penelope: The NBTI-Aware Processor.
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007), 2007

Building a large instruction window through ROB compression.
Proceedings of the 2007 workshop on MEmory performance, 2007

Fuse: A Technique to Anticipate Failures due to Degradation in ALUs.
Proceedings of the 13th IEEE International On-Line Testing Symposium (IOLTS 2007), 2007

Improving Branch Prediction and Predicated Execution in Out-of-Order Processors.
Proceedings of the 13st International Conference on High-Performance Computer Architecture (HPCA-13 2007), 2007

Virtual Cluster Scheduling Through the Scheduling Graph.
Proceedings of the Fifth International Symposium on Code Generation and Optimization (CGO 2007), 2007

Heterogeneous Clustered VLIW Microarchitectures.
Proceedings of the Fifth International Symposium on Code Generation and Optimization (CGO 2007), 2007

Early Register Release for Out-of-Order Processors with RegisterWindows.
Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT 2007), 2007

Control Speculation for Energy-Efficient Next-Generation Superscalar Processors.
IEEE Trans. Computers, 2006

Impact of Parameter Variations on Circuits and Microarchitecture.
IEEE Micro, 2006

A dynamically reconfigurable cache for multithreaded processors.
J. Embed. Comput., 2006

Instruction scheduling for a clustered VLIW processor with a word-interleaved cache.
Concurr. Comput. Pract. Exp., 2006

Exploiting Narrow Values for Soft Error Tolerance.
IEEE Comput. Archit. Lett., 2006

Independent front-end and back-end dynamic voltage scaling for a GALS microarchitecture.
Proceedings of the 2006 International Symposium on Low Power Electronics and Design, 2006

Empowering a helper cluster through data-width aware instruction selection policies.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

SAMIE-LSQ: set-associative multiple-instruction entry load/store queue.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Selective predicate prediction for out-of-order processors.
Proceedings of the 20th Annual International Conference on Supercomputing, 2006

Design space exploration for multicore architectures: a power/performance/thermal view.
Proceedings of the 20th Annual International Conference on Supercomputing, 2006

Heterogeneous way-size cache.
Proceedings of the 20th Annual International Conference on Supercomputing, 2006

On-Chip Interconnects and Instruction Steering Schemes for Clustered Microarchitectures.
IEEE Trans. Parallel Distributed Syst., 2005

An accurate cost model for guiding data locality transformations.
ACM Trans. Program. Lang. Syst., 2005

Distributed Data Cache Designs for Clustered VLIW Processors.
IEEE Trans. Computers, 2005

IATAC: a smart predictor to turn-off L2 cache lines.
ACM Trans. Archit. Code Optim., 2005

Speculative execution for hiding memory latency.
SIGARCH Comput. Archit. News, 2005

Hardware support for early register release.
Int. J. High Perform. Comput. Netw., 2005

Mitosis compiler: an infrastructure for speculative threading based on pre-computation slices.
Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, 2005

Demystifying on-the-fly spill code.
Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, 2005

The Mitosis Speculative Multithreaded Architectures.
Proceedings of the Parallel Computing: Current & Future Issues of High-End Computing, 2005

Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures.
Proceedings of the High-Performance Computing - 6th International Symposium, 2005

Control-Flow Independence Reuse via Dynamic Vectorization.
Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

Inherently Workload-Balanced Clustered Microarchitecture.
Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

Memory Bank Predictors.
Proceedings of the 23rd International Conference on Computer Design (ICCD 2005), 2005

Software Directed Issue Queue Power Reduction.
Proceedings of the 11th International Conference on High-Performance Computer Architecture (HPCA-11 2005), 2005

Distributing the Frontend for Temperature Reduction.
Proceedings of the 11th International Conference on High-Performance Computer Architecture (HPCA-11 2005), 2005

Value Compression for Efficient Computation.
Proceedings of the Euro-Par 2005, Parallel Processing, 11th International Euro-Par Conference, Lisbon, Portugal, August 30, 2005

Compiler Directed Early Register Release.
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT 2005), 2005

Variable-Based Multi-module Data Caches for Clustered VLIW Processors.
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT 2005), 2005

Compiler analysis for trace-level speculative multithreaded architectures.
Proceedings of the 9th Annual Workshop on Interaction between Compilers and Computer Architectures, 2005

A fast and accurate framework to analyze and optimize cache memory behavior.
ACM Trans. Program. Lang. Syst., 2004

Late Allocation and Early Release of Physical Registers.
IEEE Trans. Computers, 2004

Thread Partitioning and Value Prediction for Exploiting Speculative Thread-Level Parallelism.
IEEE Trans. Computers, 2004

Removing communications in clustered microarchitectures through instruction replication.
ACM Trans. Archit. Code Optim., 2004

Cache organizations for clustered microarchitectures.
Proceedings of the 3rd Workshop on Memory Performance Issues, 2004

Back-end assignment schemes for clustered multithreaded processors.
Proceedings of the 18th Annual International Conference on Supercomputing, 2004

Frontend Frequency-Voltage Adaptation for Optimal Energy-Delay^2.
Proceedings of the 22nd IEEE International Conference on Computer Design: VLSI in Computers & Processors (ICCD 2004), 2004

Thermal-Aware Clustered Microarchitectures.
Proceedings of the 22nd IEEE International Conference on Computer Design: VLSI in Computers & Processors (ICCD 2004), 2004

Low-Complexity Distributed Issue Queue.
Proceedings of the 10th International Conference on High-Performance Computer Architecture (HPCA-10 2004), 2004

Software-Controlled Operand-Gating.
Proceedings of the 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 2004

Power- and Complexity-Aware Issue Queue Designs.
IEEE Micro, 2003

A framework for modeling and optimization of prescient instruction prefetch.
Proceedings of the International Conference on Measurements and Modeling of Computer Systems, 2003

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors.
Proceedings of the 36th Annual International Symposium on Microarchitecture, 2003

Instruction Replication for Clustered Microarchitectures.
Proceedings of the 36th Annual International Symposium on Microarchitecture, 2003

Non redundant data cache.
Proceedings of the 2003 International Symposium on Low Power Electronics and Design, 2003

Dynamic Cluster Resizing.
Proceedings of the 21st International Conference on Computer Design (ICCD 2003), 2003

On Reducing Register Pressure and Energy in Multiple-Banked Register Files.
Proceedings of the 21st International Conference on Computer Design (ICCD 2003), 2003

Power Efficient Data Cache Designs.
Proceedings of the 21st International Conference on Computer Design (ICCD 2003), 2003

Power-Aware Control Speculation through Selective Throttling.
Proceedings of the Ninth International Symposium on High-Performance Computer Architecture (HPCA'03), 2003

Power-Aware Adaptive Issue Queue and Register File.
Proceedings of the High Performance Computing - HiPC 2003, 10th International Conference, 2003

Value Compression to Reduce Power in Data Caches.
Proceedings of the Euro-Par 2003. Parallel Processing, 2003

Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache.
Proceedings of the 1st IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2003), 2003

Optimizing Program Locality Through CMEs and GAs.
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques (PACT 2003), 27 September, 2003

Hypercube Algorithms on Mesh Connected Multicomputers.
IEEE Trans. Parallel Distributed Syst., 2002

Errata on "Measuring Experimental Error in Microprocessor Simulation".
SIGARCH Comput. Archit. News, 2002

Effective instruction scheduling techniques for an interleaved cache clustered VLIW processor.
Proceedings of the 35th Annual International Symposium on Microarchitecture, 2002

Near-Optimal Padding for Removing Conflict Misses.
Proceedings of the Languages and Compilers for Parallel Computing, 15th Workshop, 2002

Speculative Dynamic Vectorization.
Proceedings of the 29th International Symposium on Computer Architecture (ISCA 2002), 2002

An interleaved cache clustered VLIW processor.
Proceedings of the 16th international conference on Supercomputing, 2002

A comparative study of modulo scheduling techniques.
Proceedings of the 16th international conference on Supercomputing, 2002

Dual path instruction processing.
Proceedings of the 16th international conference on Supercomputing, 2002

Near-Optimal Loop Tiling by Means of Cache Miss Equations and Genetic Algorithms.
Proceedings of the 31st International Conference on Parallel Processing Workshops (ICPP 2002 Workshops), 2002

Hardware Schemes for Early Register Release.
Proceedings of the 31st International Conference on Parallel Processing (ICPP 2002), 2002

Trace-Level Speculative Multithreaded Architecture.
Proceedings of the 20th International Conference on Computer Design (ICCD 2002), 2002

Thread-Spawning Schemes for Speculative Multithreading.
Proceedings of the Eighth International Symposium on High-Performance Computer Architecture (HPCA'02), 2002

Efficient Interconnects for Clustered Microarchitectures.
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT 2002), 2002

Exploiting Pseudo-Schedules to Guide Data Dependence Graph Partitioning.
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT 2002), 2002

Improving Latency Tolerance of Multithreading through Decoupling.
IEEE Trans. Computers, 2001

Lifetime-Sensitive Modulo Scheduling in a Production Environment.
IEEE Trans. Computers, 2001

Control-Flow Speculation through Value Prediction.
IEEE Trans. Computers, 2001

Implementing the one-sided Jacobi method on a 2D/3D mesh multicomputer.
Parallel Comput., 2001

Clustered Modulo Scheduling in a VLIW Architecture with Distributed Cache .
J. Instr. Level Parallelism, 2001

Dynamic Code Partitioning for Clustered Architectures.
Int. J. Parallel Program., 2001

CALMANT: A Systematic Method for the Execution of Hypercube Algorithms in Multiprocessor Systems.
Computación y Sistemas, 2001

CALMANT: Un Método Sistemático para la Ejecución de Algoritmos Hipercubo en Sistemas Multiprocesador.
Computación y Sistemas, 2001

Graph-partitioning based instruction scheduling for clustered processors.
Proceedings of the 34th Annual International Symposium on Microarchitecture, 2001

Energy-effective issue logic.
Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001

Reducing the complexity of the issue logic.
Proceedings of the 15th international conference on Supercomputing, 2001

Selective Branch Prediction Reversal By Correlating with Data Values and Control Flow.
Proceedings of the 19th International Conference on Computer Design (ICCD 2001), 2001

Confidence Estimation for Branch Prediction Reversal.
Proceedings of the High Performance Computing - HiPC 2001, 8th International Conference, 2001

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors.
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques (PACT 2001), 2001

Optimizing cache miss equations polyhedra.
SIGARCH Comput. Archit. News, 2000

Analyzing Data Locality in Numeric Applications.
IEEE Micro, 2000

Dynamic Register Renaming Through Virtual-Physical Registers.
J. Instr. Level Parallelism, 2000

Modulo scheduling for a fully-distributed clustered VLIW architecture.
Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, 2000

Reducing wire delay penalty through value prediction.
Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, 2000

Very low power pipelines using significance compression.
Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, 2000

Instruction Scheduling for Clustered VLIW Architectures.
Proceedings of the 13th International Symposium on System Synthesis, 2000

An efficient solver for Cache Miss Equations.
Proceedings of the 2000 IEEE International Symposium on Performance Analysis of Systems and Software, 2000

Multiple-banked register file architectures.
Proceedings of the 27th International Symposium on Computer Architecture (ISCA 2000), 2000

A Quantitative Assessment of Thread-Level Speculation Techniques.
Proceedings of the 14th International Parallel & Distributed Processing Symposium (IPDPS'00), 2000

A low-complexity issue logic.
Proceedings of the 14th international conference on Supercomputing, 2000

The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures.
Proceedings of the 2000 International Conference on Parallel Processing, 2000

Dynamic Cluster Assignment Mechanisms.
Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, 2000

A Fast and Accurate Approach to Analyze Cache Memory Behavior (Research Note).
Proceedings of the Euro-Par 2000, Parallel Processing, 6th International Euro-Par Conference, Munich, Germany, August 29, 2000

Complete Exchange Algorithms for Meshes and Tori Using a Systematic Approach (Research Note).
Proceedings of the Euro-Par 2000, Parallel Processing, 6th International Euro-Par Conference, Munich, Germany, August 29, 2000

Low Communication Overhead Jacobi Algorithms for Eigenvalues Computation on Hypercubes.
J. Supercomput., 1999

Randomized Cache Placement for Eliminating Conflicts.
IEEE Trans. Computers, 1999

Software Data Prefetching for Software Pipelined Loops.
J. Parallel Distributed Comput., 1999

Delaying Physical Register Allocation through Virtual-Physical Registers.
Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture, 1999

Value Prediction for Speculative Multithreaded Architectures.
Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture, 1999

A locality sensitive multi-module cache with explicit management.
Proceedings of the 13th international conference on Supercomputing, 1999

Dynamic removal of redundant computations.
Proceedings of the 13th international conference on Supercomputing, 1999

Clustered speculative multithreaded processors.
Proceedings of the 13th international conference on Supercomputing, 1999

Trace-Level Reuse.
Proceedings of the International Conference on Parallel Processing 1999, 1999

Reducing Memory Traffic Via Redundant Store Instructions.
Proceedings of the High-Performance Computing and Networking, 7th International Conference, 1999

Exploiting Speculative Thread-Level Parallelism on a SMT Processor.
Proceedings of the High-Performance Computing and Networking, 7th International Conference, 1999

The Synergy of Multithreading and Access/Execute Decoupling.
Proceedings of the Fifth International Symposium on High-Performance Computer Architecture, 1999

Control-Flow Speculation through Value Prediction for Superscalar Processors.
Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, 1999

A Cost-Effective Clustered Architecture.
Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, 1999

Modulo Scheduling with Reduced Register Pressure.
IEEE Trans. Computers, 1998

A Method for Exploiting Communication/Computation Overlap in Hypercubes.
Parallel Comput., 1998

Data value speculation in superscalar processors.
Microprocess. Microsystems, 1998

Limits of Instruction Level Parallelism with Data Value Speculation.
Proceedings of the Vector and Parallel Processing, 1998

A Jacobi-based algorithm for computing symmetric eigenvalues and eigenvectors in a two-dimensional mesh.
Proceedings of the Sixth Euromicro Workshop on Parallel and Distributed Processing, 1998

Jacobi Orderings for Multi-Port Hypercubes.
Proceedings of the 12th International Parallel Processing Symposium / 9th Symposium on Parallel and Distributed Processing (IPPS/SPDP '98), March 30, 1998

Speculative Multithreaded Processors.
Proceedings of the 12th international conference on Supercomputing, 1998

The Potential of Data Value Speculation to Boost ILP.
Proceedings of the 12th international conference on Supercomputing, 1998

Control Speculation in Multithreaded Processors through Dynamic Loop Detection.
Proceedings of the Fourth International Symposium on High-Performance Computer Architecture, Las Vegas, Nevada, USA, January 31, 1998

Virtual-Physical Registers.
Proceedings of the Fourth International Symposium on High-Performance Computer Architecture, Las Vegas, Nevada, USA, January 31, 1998

Software Prefetching for Software Pipelined Loops.
Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences, 1998

Divide-and-Conquer Algorithms on Two-Dimensional Meshes.
Proceedings of the Euro-Par '98 Parallel Processing, 1998

The Latency Hiding Effectiveness of Decoupled Access/Execute Processors.
Proceedings of the 24th EUROMICRO '98 Conference, 1998

Data Speculative Multithreaded Architecture.
Proceedings of the 24th EUROMICRO '98 Conference, 1998

Fast, Accurate and Flexible Data Locality Analysis.
Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, 1998

The Design and Performance of a Conflict-Avoiding Cache.
Proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, 1997

Cache Sensitive Modulo Scheduling.
Proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, 1997

Eliminating Cache Conflict Misses through XOR-Based Placement Functions.
Proceedings of the 11th international conference on Supercomputing, 1997

Speculative Execution via Address Prediction and Data Prefetching.
Proceedings of the 11th international conference on Supercomputing, 1997

Virtual registers.
Proceedings of the Fourth International on High-Performance Computing, 1997

PARSAR: Parallelisation of a Chirp Scaling Algorithm SAR Processor.
Proceedings of the Euro-Par '97 Parallel Processing, 1997

Memory Address Prediction for Data Speculation.
Proceedings of the Euro-Par '97 Parallel Processing, 1997

A Methodology for User-Oriented Scalability Analysis.
Proceedings of the 1997 International Conference on Application-Specific Systems, 1997

Static Locality Analysis for Cache Management.
Proceedings of the 1997 Conference on Parallel Architectures and Compilation Techniques (PACT '97), 1997

Communication Pipelining in Hypercubes.
Parallel Process. Lett., 1996

The Multipath Architecture for Prolog Programs.
Comput. J., 1996

Overlapping Communication and Computation in Hypercubes.
Proceedings of the Euro-Par '96 Parallel Processing, 1996

Swing module scheduling: a lifetime-sensitive approach.
Proceedings of the Fifth International Conference on Parallel Architectures and Compilation Techniques, 1996

Executing Algorithms with Hypercube Topology on Torus Multicomputers.
IEEE Trans. Parallel Distributed Syst., 1995

Exploiting path parallelism in logic programming.
Proceedings of the 3rd Euromicro Workshop on Parallel and Distributed Processing (PDP '95), 1995

Load Balancing in a Network Flow Optimization Code.
Proceedings of the Applied Parallel Computing, 1995

Hypernode reduction modulo scheduling.
Proceedings of the 28th Annual International Symposium on Microarchitecture, Ann Arbor, Michigan, USA, November 29, 1995

A Data Cache with Multiple Caching Strategies Tuned to Different Types of Locality.
Proceedings of the 9th international conference on Supercomputing, 1995

Design and Evaluation of an Instruction Cache for Reducing the Cost of Branches.
Perform. Evaluation, 1994

Parallel Numerical Algorithms.
Proceedings of the Second Euromicro Workshop on Parallel and Distributed Processing, 1994

The Multipath Parallel Execution Model for Prolog.
Proceedings of the First International Symposium on Parallel Symbolic Computation, 1994

A Partial Breadth-First Execution Model for Prolog.
Proceedings of the Sixth International Conference on Tools with Artificial Intelligence, 1994

Combining depth-first and breadth-first search in Prolog execution.
Proceedings of the 1994 Joint Conference on Declarative Programming, 1994

Reducing Branch Delay to Zero in Pipelined Processors.
IEEE Trans. Computers, 1993

Chairmen's introduction.
Microprocess. Microprogramming, 1993

MEM: A new execution model for Prolog.
Microprocess. Microprogramming, 1993

A survey of branch techniques in pipelined processors.
Microprocess. Microprogramming, 1993

The Xor embedding: An embedding of hypercubes onto rings and toruses.
Proceedings of the International Conference on Application-Specific Array Processors, 1993

Instruction fetch unit for parallel execution of branch instructions.
Proceedings of the 3rd international conference on Supercomputing, 1989

A mechanism for reducing the cost of branches in RISC architectures.
Microprocess. Microprogramming, 1988
