Babak Falsafi

Orcid: 0000-0001-5916-8068

Affiliations:
  • EPFL, EcoCloud research center, Lausanne, Switzerland
  • Carnegie Mellon University, Pittsburgh, PA, USA
  • University of Wisconsin-Madison, Madison, WI, USA (PhD)


According to our database1, Babak Falsafi authored at least 161 papers between 1993 and 2024.

Collaborative distances:

Awards

ACM Fellow

ACM Fellow 2015, "For contributions to multiprocessor and memory architecture design and evaluation.".

IEEE Fellow

IEEE Fellow 2012, "For contributions to multiprocessor architecture and memory systems".

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Server Architecture From Enterprise to Post-Moore.
IEEE Micro, 2024

Effective Interplay between Sparsity and Quantization: From Theory to Practice.
CoRR, 2024

2023
What's Missing in Agile Hardware Design? Verification!
J. Comput. Sci. Technol., July, 2023

Scale-out Systolic Arrays.
ACM Trans. Archit. Code Optim., June, 2023

SecureCells: A Secure Compartmentalized Architecture.
Proceedings of the 44th IEEE Symposium on Security and Privacy, 2023

Imprecise Store Exceptions.
Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023

AstriFlash A Flash-Based System for Online Services.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2023

Cooperative Concurrency Control for Write-Intensive Key-Value Workloads.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

2022
Accuracy Boosters: Epoch-Driven Mixed-Mantissa Block Floating-Point for DNN Training.
CoRR, 2022

2021
Efficient Nearest-Neighbor Data Sharing in GPUs.
ACM Trans. Archit. Code Optim., 2021

Approximate Systems (Dagstuhl Seminar 21302).
Dagstuhl Reports, 2021

Exploiting Errors for Efficiency: A Survey from Circuits to Applications.
ACM Comput. Surv., 2021

Cerebros: Evading the RPC Tax in Datacenters.
Proceedings of the MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021

Equinox: Training (for Free) on a Custom Inference Accelerator.
Proceedings of the MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021

Rebooting Virtual Memory with Midgard.
Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture, 2021

2020
Enabling High-Capacity, Latency-Tolerant, and Highly-Concurrent GPU Register Files via Software/Hardware Cooperation.
CoRR, 2020

SPARTA: A Divide and Conquer Approach to Address Translation for Accelerators.
CoRR, 2020

The NEBULA RPC-Optimized Architecture.
Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture, 2020

Post-moore server architecture.
Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020

Optimus Prime: Accelerating Data Transformation in Servers.
Proceedings of the ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, 2020

2019
Highly Concurrent Latency-tolerant Register Files for GPUs.
ACM Trans. Comput. Syst., 2019

Analog Neural Networks With Deep-Submicrometer Nonlinear Synapses.
IEEE Micro, 2019

Distributed Logless Atomic Durability with Persistent Memory.
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019

SMoTherSpectre: Exploiting Speculative Execution through Port Contention.
Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, 2019

RPCValet: NI-Driven Tail-Aware Balancing of µs-Scale RPCs.
Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019

2018
Mitigating Load Imbalance in Distributed Data Serving with Rack-Scale Memory Pooling.
ACM Trans. Comput. Syst., 2018

Algorithm/Architecture Co-Design for Near-Memory Processing.
ACM SIGOPS Oper. Syst. Rev., 2018

Exploiting Errors for Efficiency: A Survey from Circuits to Algorithms.
CoRR, 2018

End-to-End DNN Training with Block Floating Point Arithmetic.
CoRR, 2018

Storage-Class Memory Hierarchies for Scale-Out Servers.
CoRR, 2018

Training DNNs with Hybrid Block Floating Point.
Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, 2018

Design guidelines for high-performance SCM hierarchies.
Proceedings of the International Symposium on Memory Systems, 2018

LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching.
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018

2017
Fat Caches for Scale-Out Servers.
IEEE Micro, 2017

FPGAs versus GPUs in Data centers.
IEEE Micro, 2017

The Mondrian Data Engine.
Proceedings of the 44th Annual International Symposium on Computer Architecture, 2017

Warehouse-Scale Computing in the Post-Moore Era.
Proceedings of the Algorithmic Aspects of Cloud Computing - Third International Workshop, 2017

Near-Memory Address Translation.
Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques, 2017

2016
Near-Memory Data Services.
IEEE Micro, 2016

Unlocking Energy.
Proceedings of the 2016 USENIX Annual Technical Conference, 2016

An Analysis of Load Imbalance in Scale-out Data Serving.
Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, 2016

SABRes: Atomic object reads for in-memory rack-scale computing.
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016

Towards near-threshold server processors.
Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition, 2016

The Case for RackOut: Scalable Data Serving Using Rack-Scale Systems.
Proceedings of the Seventh ACM Symposium on Cloud Computing, 2016

2015
Asynchronous Memory Access Chaining.
Proc. VLDB Endow., 2015

Rack-scale Computing (Dagstuhl Seminar 15421).
Dagstuhl Reports, 2015

Confluence: unified instruction supply for scale-out servers.
Proceedings of the 48th International Symposium on Microarchitecture, 2015

Contention detection by throttling: A black-box on-line approach.
Proceedings of the 23rd IEEE International Symposium on Quality of Service, 2015

Manycore network interfaces for in-memory rack-scale computing.
Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015

2014
A Primer on Hardware Prefetching
Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, ISBN: 978-3-031-01743-8, 2014

A Case for Specialized Processors for Scale-Out Workloads.
IEEE Micro, 2014

Big Data [Guest editors' introduction].
IEEE Micro, 2014

BuMP: Bulk Memory Access Prediction and Streaming.
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014

Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache.
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014

FADE: A programmable filtering accelerator for instruction-grain monitoring.
Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture, 2014

Scale-out NUMA.
Proceedings of the Architectural Support for Programming Languages and Operating Systems, 2014


2013
Top Picks from the 2012 Computer Architecture Conferences.
IEEE Micro, 2013

DeSyRe: On-demand system reliability.
Microprocess. Microsystems, 2013

Building Fast, Dense, Low-Power Caches Using Erasure-Based Inline Multi-bit ECC.
Proceedings of the IEEE 19th Pacific Rim International Symposium on Dependable Computing, 2013

Multi-grain coherence directories.
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013

Meet the walkers: accelerating index traversals for in-memory databases.
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013

SHIFT: shared history instruction fetch for lean-core server processors.
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013

Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache.
Proceedings of the 40th Annual International Symposium on Computer Architecture, 2013

From embedded multi-core SoCs to scale-out processors.
Proceedings of the Design, Automation and Test in Europe, 2013

2012
Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors.
ACM Trans. Comput. Syst., 2012

Optimizing Data-Center TCO with Scale-Out Processors.
IEEE Micro, 2012

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers.
Proceedings of the 2012 Sixth IEEE/ACM International Symposium on Networks-on-Chip (NoCS), 2012

NOC-Out: Microarchitecting a Scale-Out Processor.
Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012

Scale-out processors.
Proceedings of the 39th International Symposium on Computer Architecture (ISCA 2012), 2012

Thermal characterization of cloud workloads on a power-efficient server-on-chip.
Proceedings of the 30th International IEEE Conference on Computer Design, 2012

The DeSyRe Project: On-Demand System Reliability.
Proceedings of the 15th Euromicro Conference on Digital System Design, 2012

Clearing the clouds: a study of emerging scale-out workloads on modern hardware.
Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, 2012

2011
Data Centers.
Proceedings of the Encyclopedia of Parallel Computing, 2011

Toward Dark Silicon in Servers.
IEEE Micro, 2011

Spatial Memory Streaming.
J. Instr. Level Parallelism, 2011

Proactive instruction fetch.
Proceedings of the 44rd Annual IEEE/ACM International Symposium on Microarchitecture, 2011

Cuckoo directory: A scalable directory for many-core systems.
Proceedings of the 17th International Conference on High-Performance Computer Architecture (HPCA-17 2011), 2011

2010
Making Address-Correlated Prefetching Practical.
IEEE Micro, 2010

Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures.
IEEE Micro, 2010

TurboTag: lookup filtering to reduce coherence directory power.
Proceedings of the 2010 International Symposium on Low Power Electronics and Design, 2010

ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications.
Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, 2010

Using dead blocks as a virtual victim cache.
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 2010

2009
ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs.
ACM Trans. Reconfigurable Technol. Syst., 2009

Flexible Hardware Acceleration for Instruction-Grain Lifeguards.
IEEE Micro, 2009

Chip-Level Redundancy in Distributed Shared-Memory Multiprocessors.
Proceedings of the 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing, 2009

Spatio-temporal memory streaming.
Proceedings of the 36th International Symposium on Computer Architecture (ISCA 2009), 2009

Reactive NUCA: near-optimal block placement and replication in distributed caches.
Proceedings of the 36th International Symposium on Computer Architecture (ISCA 2009), 2009

Practical off-chip meta-data for temporal memory streaming.
Proceedings of the 15th International Conference on High-Performance Computer Architecture (HPCA-15 2009), 2009

Shore-MT: a scalable storage manager for the multicore era.
Proceedings of the EDBT 2009, 2009

2008
Introduction.
ACM SIGPLAN Notices, 2008

Temporal instruction fetch streaming.
Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-41 2008), 2008

Flexible Hardware Acceleration for Instruction-Grain Program Monitoring.
Proceedings of the 35th International Symposium on Computer Architecture (ISCA 2008), 2008

Temporal streams in commercial server applications.
Proceedings of the 4th International Symposium on Workload Characterization (IISWC 2008), 2008

A complexity-effective architecture for accelerating full-system multiprocessor simulations using FPGAs.
Proceedings of the ACM/SIGDA 16th International Symposium on Field Programmable Gate Arrays, 2008

Predictor virtualization.
Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, 2008

2007
To Share or Not To Share?
Proceedings of the 33rd International Conference on Very Large Data Bases, 2007

Scheduling threads for constructive cache sharing on CMPs.
Proceedings of the SPAA 2007: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, 2007

PAI: A Lightweight Mechanism for Single-Node Memory Recovery in DSM Servers.
Proceedings of the 13th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2007), 2007

Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding.
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007), 2007

Last-Touch Correlated Data Streaming.
Proceedings of the 2007 IEEE International Symposium on Performance Analysis of Systems and Software, 2007

Mechanisms for store-wait-free multiprocessors.
Proceedings of the 34th International Symposium on Computer Architecture (ISCA 2007), 2007

PROToFLEX: FPGA-accelerated Hybrid Functional Simulator.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Database Servers on Chip Multiprocessors: Limitations and Opportunities.
Proceedings of the Third Biennial Conference on Innovative Data Systems Research, 2007

2006
Exploiting reference idempotency to reduce speculative storage overflow.
ACM Trans. Program. Lang. Syst., 2006

Statistical sampling of microarchitecture simulation.
ACM Trans. Model. Comput. Simul., 2006

SimFlex: Statistical Sampling of Computer System Simulation.
IEEE Micro, 2006

Coarse-Grain Coherence Tracking: RegionScout and Region Coherence Arrays.
IEEE Micro, 2006

Dynamic feature selection for hardware prediction.
J. Syst. Archit., 2006

Parallel depth first vs. work stealing schedulers on CMP architectures.
Proceedings of the SPAA 2006: Proceedings of the 18th Annual ACM Symposium on Parallelism in Algorithms and Architectures, Cambridge, Massachusetts, USA, July 30, 2006

Reunion: Complexity-Effective Multicore Redundancy.
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39 2006), 2006

Simulation sampling with live-points.
Proceedings of the 2006 IEEE International Symposium on Performance Analysis of Systems and Software, 2006

Spatial Memory Streaming.
Proceedings of the 33rd International Symposium on Computer Architecture (ISCA 2006), 2006

Log-based architectures for general-purpose monitoring of deployed code.
Proceedings of the 1st Workshop on Architectural and System Support for Improving Software Dependability, 2006

2005
A Case for Asymmetric-Cell Cache Memories.
IEEE Trans. Very Large Scale Integr. Syst., 2005

TRUSS: A Reliable, Scalable Server Architecture.
IEEE Micro, 2005

Evaluating scheduling policies for fine-grain communication protocols on a cluster of SMPs.
J. Parallel Distributed Comput., 2005

TurboSMARTS: accurate microarchitecture simulation sampling in minutes.
Proceedings of the International Conference on Measurements and Modeling of Computer Systems, 2005

Temporal Streaming of Shared Memory.
Proceedings of the 32st International Symposium on Computer Architecture (ISCA 2005), 2005

RECAST: Boosting Tag Line Buffer Coverage in Low-Power High-Level Caches "for Free".
Proceedings of the 23rd International Conference on Computer Design (ICCD 2005), 2005

Accelerating Database Operations Using a Network Processor.
Proceedings of the Workshop on Data Management on New Hardware, 2005

Architecture-Conscious Databases: sub-optimization or the next big leap?
Proceedings of the Workshop on Data Management on New Hardware, 2005

DBmbench: fast and accurate database workload representation on modern microarchitecture.
Proceedings of the 2005 conference of the Centre for Advanced Studies on Collaborative Research, 2005

Store-Ordered Streaming of Shared Memory.
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT 2005), 2005

2004
SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture.
SIGMETRICS Perform. Evaluation Rev., 2004

Fingerprinting: Bounding Soft-Error-Detection Latency and Bandwidth.
IEEE Micro, 2004

Memory coherence activity prediction in commercial workloads.
Proceedings of the 3rd Workshop on Memory Performance Issues, 2004

Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures.
Proceedings of the 37th Annual International Symposium on Microarchitecture (MICRO-37 2004), 2004

Accurate and Complexity-Effective Spatial Pattern Prediction.
Proceedings of the 10th International Conference on High-Performance Computer Architecture (HPCA-10 2004), 2004

2003
Speculative Sequential Consistency with Little Custom Storage.
J. Instr. Level Parallelism, 2003

Near-Optimal Precharging in High-Performance Nanoscale CMOS Caches.
Proceedings of the 36th Annual International Symposium on Microarchitecture, 2003

SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling.
Proceedings of the 30th International Symposium on Computer Architecture (ISCA 2003), 2003

Iimplicitly-Multithreaded Processors.
Proceedings of the 30th International Symposium on Computer Architecture (ISCA 2003), 2003

2002
Optimizing Traffic in DSM Clusters: Fine-Grain Memory Caching versus Page Migration/Replication.
Theory Comput. Syst., 2002

Exploiting Choice in Resizable Cache Design to Optimize Deep-Submicron Processor Energy-Delay.
Proceedings of the Eighth International Symposium on High-Performance Computer Architecture (HPCA'02), 2002

2001
Reducing leakage in a high-performance deep-submicron instruction cache.
IEEE Trans. Very Large Scale Integr. Syst., 2001

Reference idempotency analysis: a framework for optimizing speculative execution.
Proceedings of the 2001 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'01), 2001

Dual use of superscalar datapath for transient-fault detection and recovery.
Proceedings of the 34th Annual International Symposium on Microarchitecture, 2001

Reducing set-associative cache energy via way-prediction and selective direct-mapping.
Proceedings of the 34th Annual International Symposium on Microarchitecture, 2001

Dead-block prediction & dead-block correlating prefetchers.
Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001

Multiplex: unifying conventional and speculative thread-level parallelism on a chip multiprocessor.
Proceedings of the 15th international conference on Supercomputing, 2001

An Integrated Circuit/Architecture Approach to Reducing Leakage in Deep-Submicron High-Performance I-Caches.
Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA'01), 2001

JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers.
Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA'01), 2001

2000
Wisconsin Wind Tunnel II: a fast, portable parallel architecture simulator.
IEEE Concurr., 2000

Comparing the effectiveness of fine-grain memory caching against page migration/replication in reducing traffic in DSM clusters.
Proceedings of the Twelfth annual ACM Symposium on Parallel Algorithms and Architectures, 2000

Gated-V<sub>dd</sub>: a circuit technique to reduce leakage in deep-submicron cache memories
Proceedings of the 2000 International Symposium on Low Power Electronics and Design, 2000

Selective, accurate, and timely self-invalidation using last-touch prediction.
Proceedings of the 27th International Symposium on Computer Architecture (ISCA 2000), 2000

Address Partitioning in DSM Clusters with Parallel Coherence Controllers.
Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques (PACT'00), 2000

1999
Memory Sharing Predictor: The Key to a Speculative Coherent DSM.
Proceedings of the 26th Annual International Symposium on Computer Architecture, 1999

Is SC + ILP=RC?
Proceedings of the 26th Annual International Symposium on Computer Architecture, 1999

Parallel Dispatch Queue: A Queue-Based Programming Abstraction to Parallelize Fine-Grain Communication Protocols.
Proceedings of the Fifth International Symposium on High-Performance Computer Architecture, 1999

1998
Sirocco: Cost-Effective Fine-Grain Distributed Shared Memory.
Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, 1998

1997
Modeling Cost/Performance of a Parallel Computer Simulator.
ACM Trans. Model. Comput. Simul., 1997

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA.
Proceedings of the 24th International Symposium on Computer Architecture, 1997

Scheduling Communication on a SMP Node Parallel Machine.
Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture (HPCA '97), 1997

1996
Coherent Network Interfaces for Fine-Grain Communication.
Proceedings of the 23rd Annual International Symposium on Computer Architecture, 1996

1994
Application-specific protocols for user-level shared memory.
Proceedings of the Proceedings Supercomputing '94, 1994

Cost/performance of a parallel computer simulator.
Proceedings of the Eighth Workshop on Parallel and Distributed Simulation, 1994

Fine-grain Access Control for Distributed Shared Memory.
Proceedings of the ASPLOS-VI Proceedings, 1994

1993
Kernel Support for the Wisconsin Wind Tunnel.
Proceedings of the USENIX Microkernels and Other Kernel Architectures Symposium, 1993

Mechanisms for Cooperative Shared Memory.
Proceedings of the 20th Annual International Symposium on Computer Architecture, 1993


  Loading...