2022
Supporting Massive DLRM Inference through Software Defined Memory.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 42nd IEEE International Conference on Distributed Computing Systems, 2022
Building a Performance Model for Deep Learning Recommendation Model Training on GPUs.
Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022
2021
Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
IEEE Micro, 2021
Supporting Massive DLRM Inference Through Software Defined Memory.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2021
First-Generation Inference Accelerator Deployment at Facebook.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
et al.
CoRR, 2021
2020
Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2020
2018
WSMeter: A Performance Evaluation Methodology for Google's Production Warehouse-Scale Computers.
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018
2016
Using SSDs to scale up Google Fusion Tables, a database-in-the-cloud.
Proceedings of the 32nd IEEE International Conference on Data Engineering, 2016
2015
Can traditional programming bridge the ninja performance gap for parallel computing applications?
Commun. ACM, 2015
2014
Author retrospective for a NUCA substrate for flexible CMP cache sharing.
Proceedings of the ACM International Conference on Supercomputing 25th Anniversary Volume, 2014
Joint interference and user association optimization in cellular wireless networks.
Proceedings of the 48th Asilomar Conference on Signals, Systems and Computers, 2014
2013
Joint Interference and User Association Optimization in Cellular Wireless Networks
CoRR, 2013
Locality-aware task management for unstructured parallelism: a quantitative limit study.
Proceedings of the 25th ACM Symposium on Parallelism in Algorithms and Architectures, 2013
Opportunistic third-party backhaul for cellular wireless networks.
Proceedings of the 2013 Asilomar Conference on Signals, 2013
2012
DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing.
IEEE Micro, 2012
CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster.
Proceedings of the ACM SIGMOD International Conference on Management of Data, 2012
Large-scale energy-efficient graph traversal: a path to efficient data-intensive supercomputing.
Proceedings of the SC Conference on High Performance Computing Networking, 2012
Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems.
Proceedings of the SC Conference on High Performance Computing Networking, 2012
GPP-Grep: High-Speed Regular Expression Processing Engine on General Purpose Processors.
Proceedings of the Research in Attacks, Intrusions, and Defenses, 2012
Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing Single-Node Efficiency.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012
2011
Designing fast architecture-sensitive tree search on modern multicore/many-core processors.
ACM Trans. Database Syst., 2011
PALM: Parallel Architecture-Friendly Latch-Free Modifications to B+ Trees on Many-Core Processors.
Proc. VLDB Endow., 2011
Fast Updates on Read-Optimized Databases Using Multi-Core CPUs.
Proc. VLDB Endow., 2011
Moguls: a model to explore the memory hierarchy for bandwidth improvements.
Proceedings of the 38th International Symposium on Computer Architecture (ISCA 2011), 2011
2010
Performance and Energy Implications of Many-Core Caches for Throughput Computing.
IEEE Micro, 2010
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort.
Proceedings of the ACM SIGMOD International Conference on Management of Data, 2010
FAST: fast architecture sensitive tree search on modern CPUs and GPUs.
Proceedings of the ACM SIGMOD International Conference on Management of Data, 2010
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs.
Proceedings of the Conference on High Performance Computing Networking, 2010
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU.
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 37th International Symposium on Computer Architecture (ISCA 2010), 2010
2009
Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs.
Proc. VLDB Endow., 2009
ClearPath: highly parallel collision avoidance for multi-agent simulation.
Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, 2009
Interactive Modeling, Simulation and Control of Large-Scale Crowds and Traffic.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Motion in Games, Second International Workshop, 2009
Efficient shared cache management through sharing-aware replacement and streaming-aware insertion policy.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009
2008
Multitasking workload scheduling on flexible core chip multiprocessors.
SIGARCH Comput. Archit. News, 2008
Second Life and the New Generation of Virtual Worlds.
Computer, 2008
Atomic Vector Operations on Chip Multiprocessors.
Proceedings of the 35th International Symposium on Computer Architecture (ISCA 2008), 2008
2007
A NUCA Substrate for Flexible CMP Cache Sharing.
IEEE Trans. Parallel Distributed Syst., 2007
On-Chip Interconnection Networks of the TRIPS Chip.
IEEE Micro, 2007
Composable Lightweight Processors.
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-40 2007), 2007
2006
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39 2006), 2006
Implementation and Evaluation of On-Chip Network Architectures.
Proceedings of the 24th International Conference on Computer Design (ICCD 2006), 2006
2004
TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP.
ACM Trans. Archit. Code Optim., 2004
2003
Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture.
IEEE Micro, 2003
Nonuniform Cache Architectures for Wire-Delay Dominated On-Chip Caches.
IEEE Micro, 2003
2002
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches.
Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-X), 2002