William J. Dally

Orcid: 0000-0003-4632-2876

Affiliations:
  • Stanford University, USA
  • NVIDIA


According to our database1, William J. Dally authored at least 264 papers between 1985 and 2024.

Collaborative distances:

Awards

ACM Fellow

ACM Fellow 2002, "For contributions to the architecture and design of interconnections networks and parallel computing.".

IEEE Fellow

IEEE Fellow 2002, "For contributions to parallel computing and interconnection networks".

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
A 0.190-pJ/bit 25.2-Gb/s/wire Inverter-Based AC-Coupled Transceiver for Short-Reach Die-to-Die Interfaces in 5-nm CMOS.
IEEE J. Solid State Circuits, April, 2024

Leveraging Micro-Bump Pitch Scaling to Accelerate Interposer Link Bandwidths for Future High-Performance Compute Applications.
Proceedings of the IEEE Custom Integrated Circuits Conference, 2024

2023
A Novel High-Efficiency Three-Phase Multilevel PV Inverter With Reduced DC-Link Capacitance.
IEEE Trans. Ind. Electron., 2023

A 0.297-pJ/Bit 50.4-Gb/s/Wire Inverter-Based Short-Reach Simultaneous Bi-Directional Transceiver for Die-to-Die Interface in 5-nm CMOS.
IEEE J. Solid State Circuits, 2023

A 95.6-TOPS/W Deep Learning Inference Accelerator With Per-Vector Scaled 4-bit Quantization in 5 nm.
IEEE J. Solid State Circuits, 2023

ChipNeMo: Domain-Adapted LLMs for Chip Design.
CoRR, 2023

Retrospective: EIE: Efficient Inference Engine on Sparse and Compressed Neural Network.
CoRR, 2023

SatIn: Hardware for Boolean Satisfiability Inference.
CoRR, 2023

Hardware for Deep Learning.
Proceedings of the 35th IEEE Hot Chips Symposium, 2023

GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

2022
GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture.
Dataset, October, 2022

LNS-Madam: Low-Precision Training in Logarithmic Number System Using Multiplicative Weight Update.
IEEE Trans. Computers, 2022

BaM: A Case for Enabling Fine-grain High Throughput GPU-Orchestrated Access to Storage.
CoRR, 2022

On the model of computation: point.
Commun. ACM, 2022

A 0.297-pJ/bit 50.4-Gb/s/wire Inverter-Based Short-Reach Simultaneous Bidirectional Transceiver for Die-to-Die Interface in 5nm CMOS.
Proceedings of the IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits 2022), 2022

A 17-95.6 TOPS/W Deep Learning Inference Accelerator with Per-Vector Scaled 4-bit Quantization for Transformers in 5nm.
Proceedings of the IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits 2022), 2022

Frontier vs the Exascale Report: Why so long? and Are We Really There Yet?
Proceedings of the IEEE/ACM International Workshop on Performance Modeling, 2022

Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training.
Proceedings of the International Conference on Machine Learning, 2022

2021
OP-VENT: A Low-Cost, Easily Assembled, Open-Source Medical Ventilator.
GetMobile Mob. Comput. Commun., 2021

Evolution of the Graphics Processing Unit (GPU).
IEEE Micro, 2021

Low-Precision Training in Logarithmic Number System using Multiplicative Weight Update.
CoRR, 2021

PatchNet - Short-range Template Matching for Efficient Video Processing.
CoRR, 2021

VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference.
CoRR, 2021

Simba: scaling deep-learning inference with chiplet-based architecture.
Commun. ACM, 2021

SPAA'21 Panel Paper: Architecture-Friendly Algorithms versus Algorithm-Friendly Architectures.
Proceedings of the SPAA '21: 33rd ACM Symposium on Parallelism in Algorithms and Architectures, 2021

VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference.
Proceedings of the Fourth Conference on Machine Learning and Systems, 2021

2020
Energy Efficient On-Demand Dynamic Branch Prediction Models.
IEEE Trans. Computers, 2020

Accelerating Chip Design With Machine Learning.
IEEE Micro, 2020

A 0.32-128 TOPS, Scalable Multi-Chip-Module-Based Deep Neural Network Inference Accelerator With Ground-Referenced Signaling in 16 nm.
IEEE J. Solid State Circuits, 2020

Domain-specific hardware accelerators.
Commun. ACM, 2020

SpArch: Efficient Architecture for Sparse Matrix Multiplication.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2020

Optimal Operation of a Plug-in Hybrid Vehicle with Battery Thermal and Degradation Model.
Proceedings of the 2020 American Control Conference, 2020

2019
Darwin: A Genomics Coprocessor.
IEEE Micro, 2019

A 1.17-pJ/b, 25-Gb/s/pin Ground-Referenced Single-Ended Serial Link for Off- and On-Package Communication Using a Process- and Temperature-Adaptive Voltage Regulator.
IEEE J. Solid State Circuits, 2019

SysML: The New Frontier of Machine Learning Systems.
CoRR, 2019

A 0.11 pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm.
Proceedings of the 2019 Symposium on VLSI Circuits, Kyoto, Japan, June 9-14, 2019, 2019

CaTDet: Cascaded Tracked Detector for Efficient Object Detection from Video.
Proceedings of the Second Conference on Machine Learning and Systems, SysML 2019, 2019

Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture.
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019

A Delay Metric for Video Object Detection: What Average Precision Fails to Tell.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

MAGNet: A Modular Accelerator Generator for Neural Networks.
Proceedings of the International Conference on Computer-Aided Design, 2019

Darwin-WGA: A Co-processor Provides Increased Sensitivity in Whole Genome Alignments with High Speedup.
Proceedings of the 25th IEEE International Symposium on High Performance Computer Architecture, 2019

A 0.11 PJ/OP, 0.32-128 Tops, Scalable Multi-Chip-Module-Based Deep Neural Network Accelerator Designed with A High-Productivity vlsi Methodology.
Proceedings of the 2019 IEEE Hot Chips 31 Symposium (HCS), 2019

Analog/Mixed-Signal Hardware Error Modeling for Deep Learning Inference.
Proceedings of the 56th Annual Design Automation Conference 2019, 2019

A 2-to-20 GHz Multi-Phase Clock Generator with Phase Interpolators Using Injection-Locked Oscillation Buffers for High-Speed IOs in 16nm FinFET.
Proceedings of the IEEE Custom Integrated Circuits Conference, 2019

A Fine-Grained GALS SoC with Pausible Adaptive Clocking in 16 nm FinFET.
Proceedings of the 25th IEEE International Symposium on Asynchronous Circuits and Systems, 2019

2018
Optimal Operation of a Plug-In Hybrid Vehicle.
IEEE Trans. Veh. Technol., 2018

Hardware-Enabled Artificial Intelligence.
Proceedings of the 2018 IEEE Symposium on VLSI Circuits, 2018

A 1.17pJ/b 25Gb/s/pin ground-referenced single-ended serial link for off- and on-package communication in 16nm CMOS using a process- and temperature-adaptive voltage regulator.
Proceedings of the 2018 IEEE International Solid-State Circuits Conference, 2018

Efficient Sparse-Winograd Convolutional Neural Networks.
Proceedings of the 6th International Conference on Learning Representations, 2018

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training.
Proceedings of the 6th International Conference on Learning Representations, 2018

Bandwidth-efficient deep learning.
Proceedings of the 55th Annual Design Automation Conference, 2018

Ground-referenced signaling for intra-chip and short-reach chip-to-chip interconnects.
Proceedings of the 2018 IEEE Custom Integrated Circuits Conference, 2018

Darwin: A Genomics Co-processor Provides up to 15, 000X Acceleration on Long Read Assembly.
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018

2017
CG-OoO: Energy-Efficient Coarse-Grain Out-of-Order Execution Near In-Order Energy with Near Out-of-Order Performance.
ACM Trans. Archit. Code Optim., 2017

FPGAs versus GPUs in Data centers.
IEEE Micro, 2017

HoLiSwap: Reducing Wire Energy in L1 Caches.
CoRR, 2017

Deep Generative Adversarial Networks for Compressed Sensing Automates MRI.
CoRR, 2017

Exploring the Regularity of Sparse Structure in Convolutional Neural Networks.
CoRR, 2017

Fine-grained DRAM: energy-efficient DRAM for extreme bandwidth systems.
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks.
Proceedings of the 44th Annual International Symposium on Computer Architecture, 2017

Efficient methods and hardware for deep learning.
Proceedings of the Workshop on Trends in Machine-Learning (and impact on computer architecture), 2017

Trained Ternary Quantization.
Proceedings of the 5th International Conference on Learning Representations, 2017

Efficient Sparse-Winograd Convolutional Neural Networks.
Proceedings of the 5th International Conference on Learning Representations, 2017

DSD: Dense-Sparse-Dense Training for Deep Neural Networks.
Proceedings of the 5th International Conference on Learning Representations, 2017

Architecting an Energy-Efficient DRAM System for GPUs.
Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture, 2017

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA.
Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017

Exploring the Granularity of Sparsity in Convolutional Neural Networks.
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017

2016
Reuse Distance-Based Probabilistic Cache Replacement.
ACM Trans. Archit. Code Optim., 2016

A 28 nm 2 Mbit 6 T SRAM With Highly Configurable Low-Voltage Write-Ability Assist Implementation and Capacitor-Based Sense-Amplifier Input Offset Compensation.
IEEE J. Solid State Circuits, 2016

CG-OoO: Energy-Efficient Coarse-Grain Out-of-Order Execution.
CoRR, 2016

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size.
CoRR, 2016

DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow.
CoRR, 2016

Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding.
Proceedings of the 4th International Conference on Learning Representations, 2016

ESE: Efficient Speech Recognition Engine with Compressed LSTM on FPGA.
CoRR, 2016

8.6 A 6.5-to-23.3fJ/b/mm balanced charge-recycling bus in 16nm FinFET CMOS at 1.7-to-2.6Gb/s/wire with clock forwarding and low-crosstalk contraflow wiring.
Proceedings of the 2016 IEEE International Solid-State Circuits Conference, 2016

EIE: Efficient Inference Engine on Compressed Deep Neural Network.
Proceedings of the 43rd ACM/IEEE Annual International Symposium on Computer Architecture, 2016

Deep compression and EIE: Efficient inference engine on compressed deep neural network.
Proceedings of the 2016 IEEE Hot Chips 28 Symposium (HCS), 2016

2015
On-Chip Active Messages for Speed, Scalability, and Efficiency.
IEEE Trans. Parallel Distributed Syst., 2015

Learning both Weights and Connections for Efficient Neural Networks.
CoRR, 2015

On-Demand Dynamic Branch Prediction.
IEEE Comput. Archit. Lett., 2015

Network endpoint congestion control for fine-grained communication.
Proceedings of the International Conference for High Performance Computing, 2015

Learning both Weights and Connections for Efficient Neural Network.
Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, 2015

SLIP: reducing wire energy in the memory hierarchy.
Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015

2014
Scaling the Power Wall: A Path to Exascale.
Proceedings of the International Conference for High Performance Computing, 2014

Author retrospective for design tradeoffs for tiled CMP on-chip networks.
Proceedings of the ACM International Conference on Supercomputing 25th Anniversary Volume, 2014

2013
Elastic Buffer Flow Control for On-Chip Networks.
IEEE Trans. Computers, 2013

A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced Packaging Applications.
IEEE J. Solid State Circuits, 2013

Channel reservation protocol for over-subscribed channels and destinations.
Proceedings of the International Conference for High Performance Computing, 2013

A 0.54pJ/b 20Gb/s ground-referenced single-ended short-haul serial link in 28nm CMOS for advanced packaging applications.
Proceedings of the 2013 IEEE International Solid-State Circuits Conference, 2013

A detailed and flexible cycle-accurate Network-on-Chip simulator.
Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2013

21st century digital design tools.
Proceedings of the 50th Annual Design Automation Conference 2013, 2013

2012
A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors.
ACM Trans. Comput. Syst., 2012

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor.
Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012

Adaptive Backpressure: Efficient buffer management for on-chip networks.
Proceedings of the 30th International IEEE Conference on Computer Design, 2012

Network congestion avoidance through Speculative Reservation.
Proceedings of the 18th IEEE International Symposium on High Performance Computer Architecture, 2012

2011
Evaluating Elastic Buffer and Wormhole Flow Control.
IEEE Trans. Computers, 2011

GPUs and the Future of Parallel Computing.
IEEE Micro, 2011

Packet Chaining: Efficient Single-Cycle Allocation for On-Chip Networks.
IEEE Comput. Archit. Lett., 2011

A compile-time managed multi-level register file hierarchy.
Proceedings of the 44rd Annual IEEE/ACM International Symposium on Microarchitecture, 2011

Power, programmability, and granularity: The challenges of ExaScale computing.
Proceedings of the 2011 IEEE International Test Conference, 2011

Energy-efficient mechanisms for managing thread context in throughput processors.
Proceedings of the 38th International Symposium on Computer Architecture (ISCA 2011), 2011

Panel Statement.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

The utility of fast active messages on many-core chips: Efficient supercomputing project.
Proceedings of the 2011 IEEE Hot Chips 23 Symposium (HCS), 2011

2010
The GPU Computing Era.
IEEE Micro, 2010

Buffer-space efficient and deadlock-free scheduling of stream applications on multi-core architectures.
Proceedings of the SPAA 2010: Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures, 2010

Evaluating Bufferless Flow Control for On-chip Networks.
Proceedings of the NOCS 2010, 2010

Moving the needle, computer architecture research in academe and industry.
Proceedings of the 37th International Symposium on Computer Architecture (ISCA 2010), 2010

Throughput computing.
Proceedings of the 24th International Conference on Supercomputing, 2010

Block-Parallel Programming for Real-Time Embedded Applications.
Proceedings of the 39th International Conference on Parallel Processing, 2010

Fine-grain dynamic instruction placement for L0 scratch-pad memory.
Proceedings of the 2010 International Conference on Compilers, 2010

The Even/Odd Synchronizer: A Fast, All-Digital, Periodic Synchronizer.
Proceedings of the 16th IEEE International Symposium on Asynchronous Circuits and Systems, 2010

2009
Stream Processors.
Proceedings of the Multicore Processors and Systems, 2009

Cost-Efficient Dragonfly Topology for Large-Scale Systems.
IEEE Micro, 2009

Operand Registers and Explicit Operand Forwarding.
IEEE Comput. Archit. Lett., 2009

Router designs for elastic buffer on-chip networks.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2009

Allocator implementations for network-on-chip routers.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2009

Indirect adaptive routing on large scale interconnection networks.
Proceedings of the 36th International Symposium on Computer Architecture (ISCA 2009), 2009

Elastic-buffer flow control for on-chip networks.
Proceedings of the 15th International Conference on High-Performance Computer Architecture (HPCA-15 2009), 2009

2008
A Programmable 512 GOPS Stream Processor for Signal, Image, and Video Processing.
IEEE J. Solid State Circuits, 2008

Efficient Embedded Computing.
Computer, 2008

Hierarchical Instruction Register Organization.
IEEE Comput. Archit. Lett., 2008

An Energy-Efficient Processor Architecture for Embedded Systems.
IEEE Comput. Archit. Lett., 2008

A portable runtime interface for multi-level memory hierarchies.
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008

Technology-Driven, Highly-Scalable Dragonfly Topology.
Proceedings of the 35th International Symposium on Computer Architecture (ISCA 2008), 2008

Stream Scheduling: A Framework to Manage Bulk Operations in Memory Hierarchies.
Proceedings of the Euro-Par 2008, 2008

A tuning framework for software-managed memory hierarchies.
Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, 2008

2007
Research Challenges for On-Chip Interconnection Networks.
IEEE Micro, 2007

A 14-mW 6.25-Gb/s Transceiver in 90-nm CMOS.
IEEE J. Solid State Circuits, 2007

Flattened Butterfly Topology for On-Chip Networks.
IEEE Comput. Archit. Lett., 2007

Compilation for explicitly managed memory hierarchies.
Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2007

Enabling Technology for On-Chip Interconnection Networks.
Proceedings of the First International Symposium on Networks-on-Chips, 2007

A 14mW 6.25Gb/s Transceiver in 90nm CMOS for Serial Chip-to-Chip Communications.
Proceedings of the 2007 IEEE International Solid-State Circuits Conference, 2007

Future of on-chip interconnection architectures.
Proceedings of the 2007 International Symposium on Low Power Electronics and Design, 2007

Flattened butterfly: a cost-efficient topology for high-radix networks.
Proceedings of the 34th International Symposium on Computer Architecture (ISCA 2007), 2007

Executing irregular scientific applications on stream architectures.
Proceedings of the 21th Annual International Conference on Supercomputing, 2007

Tradeoff between data-, instruction-, and thread-level parallelism in stream processors.
Proceedings of the 21th Annual International Conference on Supercomputing, 2007

Interconnect-Centric Computing.
Proceedings of the 13st International Conference on High-Performance Computer Architecture (HPCA-13 2007), 2007

Register pointer architecture for efficient embedded processors.
Proceedings of the 2007 Design, Automation and Test in Europe Conference and Exposition, 2007

Architectural Support for the Stream Execution Model on General-Purpose Processors.
Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT 2007), 2007

Stream Scheduling: A Framework to Manage Bulk Operations in a Memory Hierarchy.
Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT 2007), 2007

2006
Topology optimization of interconnection networks.
IEEE Comput. Archit. Lett., 2006

Data parallel address architecture.
IEEE Comput. Archit. Lett., 2006

Multi-core issues - Multi-Core for HPC: breakthrough or breakdown?
Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

Interconnect routing and scheduling - Adaptive routing in high-radix clos network.
Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

Sequoia: programming the memory hierarchy.
Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

Architecture - The design space of data-parallel memory systems.
Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

The BlackWidow High-Radix Clos Network.
Proceedings of the 33rd International Symposium on Computer Architecture (ISCA 2006), 2006

Design tradeoffs for tiled CMP on-chip networks.
Proceedings of the 20th Annual International Conference on Supercomputing, 2006

Computer Architecture in the Many-Core Era.
Proceedings of the 24th International Conference on Computer Design (ICCD 2006), 2006

Pulsenet - A Parallel Flash Sampler and Digital Processor IC for Optical SETI.
Proceedings of the IEEE 2006 Custom Integrated Circuits Conference, 2006

Compiling for stream processing.
Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques (PACT 2006), 2006

2005
Hot Chips 16: Power, Parallelism, and Memory Performance.
IEEE Micro, 2005

A 20-Gb/s 0.13-μm CMOS serial link transmitter using an LC-PLL to directly drive the output multiplexer.
IEEE J. Solid State Circuits, 2005

Fault Tolerance Techniques for the Merrimac Streaming Supercomputer.
Proceedings of the ACM/IEEE SC2005 Conference on High Performance Networking and Computing, 2005

Microarchitecture of a High-Radix Router.
Proceedings of the 32st International Symposium on Computer Architecture (ISCA 2005), 2005

Scatter-Add in Data Parallel Architectures.
Proceedings of the 11th International Conference on High-Performance Computer Architecture (HPCA-11 2005), 2005

Explaining the gap between ASIC and custom power: a custom perspective.
Proceedings of the 42nd Design Automation Conference, 2005

2004
Stream Processors: Progammability and Efficiency.
ACM Queue, 2004

A 33-mW 8-Gb/s CMOS clock multiplier and CDR for highly integrated I/Os.
IEEE J. Solid State Circuits, 2004

Globally Adaptive Load-Balanced Routing on Tori.
IEEE Comput. Archit. Lett., 2004

Buffer and Delay Bounds in High Radix Interconnection Networks.
IEEE Comput. Archit. Lett., 2004

The case for broader computer architecture education: keynote address.
Proceedings of the 2004 workshop on Computer architecture education, 2004

Adaptive channel queue routing on k-ary n-cubes.
Proceedings of the SPAA 2004: Proceedings of the Sixteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, 2004

Analysis and Performance Results of a Molecular Modeling Application on Merrimac.
Proceedings of the ACM/IEEE SC2004 Conference on High Performance Networking and Computing, 2004

Evaluating the Imagine Stream Architecture.
Proceedings of the 31st International Symposium on Computer Architecture (ISCA 2004), 2004

Stream Register Files with Indexed Access.
Proceedings of the 10th International Conference on High-Performance Computer Architecture (HPCA-10 2004), 2004

2003
Guaranteed scheduling for switches with configuration overhead.
IEEE/ACM Trans. Netw., 2003

A second-order semidigital clock recovery circuit based on injection locking.
IEEE J. Solid State Circuits, 2003

Jitter transfer characteristics of delay-locked loops - theories and design techniques.
IEEE J. Solid State Circuits, 2003

Programmable Stream Processors.
Computer, 2003

Throughput-centric routing algorithm design.
Proceedings of the SPAA 2003: Proceedings of the Fifteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, 2003

Merrimac: Supercomputing with Streams.
Proceedings of the ACM/IEEE SC2003 Conference on High Performance Networking and Computing, 2003

GOAL: A Load-Balanced Adaptive Routing Algorithm for Torus Networks.
Proceedings of the 30th International Symposium on Computer Architecture (ISCA 2003), 2003

CMOS High-Speed I/Os - Present and Future.
Proceedings of the 21st International Conference on Computer Design (ICCD 2003), 2003

Exploring the VLSI Scalability of Stream Processors.
Proceedings of the Ninth International Symposium on High-Performance Computer Architecture (HPCA'03), 2003

A 33mW 8Gb/s CMOS clock multiplier and CDR for highly integrated I/Os.
Proceedings of the IEEE Custom Integrated Circuits Conference, 2003

2002
A low-power multiplying DLL for low-jitter multigigahertz clock generation in highly integrated digital chips.
IEEE J. Solid State Circuits, 2002

Worst-case Traffic for Oblivious Routing Functions.
IEEE Comput. Archit. Lett., 2002

Migration in Single Chip Multiprocessors.
IEEE Comput. Archit. Lett., 2002

Locality-preserving randomized oblivious routing on torus networks.
Proceedings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures, 2002

A Stream Processor Development Platform.
Proceedings of the 20th International Conference on Computer Design (ICCD 2002), 2002

Media Processing Applications on the Imagine Stream Processor.
Proceedings of the 20th International Conference on Computer Design (ICCD 2002), 2002

VLSI Design and Verification of the Imagine Processor.
Proceedings of the 20th International Conference on Computer Design (ICCD 2002), 2002

The Imagine Stream Processor.
Proceedings of the 20th International Conference on Computer Design (ICCD 2002), 2002

Scalable Opto-Electronic Network (SOENet).
Proceedings of the 10th Annual IEEE Symposium on High Performance Interconnects (HOTIC 2002), August 21, 2002

Comparing Reyes and OpenGL on a Stream Architecture.
Proceedings of the 2002 ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, 2002

2001
A Delay Model for Router Microarchitectures.
IEEE Micro, 2001

Imagine: Media Processing with Streams.
IEEE Micro, 2001

Guest Editors' Introduction: Hot Chips 12.
IEEE Micro, 2001

Monolithic chaotic communications system.
Proceedings of the 2001 International Symposium on Circuits and Systems, 2001

A Delay Model and Speculative Architecture for Pipelined Routers.
Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA'01), 2001

Route Packets, Not Wires: On-Chip Interconnection Networks.
Proceedings of the 38th Design Automation Conference, 2001

Digital systems engineering.
Cambridge University Press, ISBN: 978-0-521-59292-5, 2001

2000
Low-power area-efficient high-speed I/O circuit techniques.
IEEE J. Solid State Circuits, 2000

Efficient conditional operations for data-parallel architectures.
Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, 2000

Processor Mechanisms for Software Shared Memory.
Proceedings of the High Performance Computing, Third International Symposium, 2000

Memory access scheduling.
Proceedings of the 27th International Symposium on Computer Architecture (ISCA 2000), 2000

Smart Memories: a modular reconfigurable architecture.
Proceedings of the 27th International Symposium on Computer Architecture (ISCA 2000), 2000

Register Organization for Media Processing.
Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, 2000

Flit-Reservation Flow Control.
Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, 2000

Polygon Rendering on a Stream Architecture.
Proceedings of the 2000 ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware, 2000

The role of custom design in ASIC Chips.
Proceedings of the 37th Conference on Design Automation, 2000

Communication Scheduling.
Proceedings of the ASPLOS-IX Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, 2000

1999
Concurrent Event Handling through Multithreading.
IEEE Trans. Computers, 1999

VLSI Architecture: Past, Present, and Future.
Proceedings of the 18th Conference on Advanced Research in VLSI (ARVLSI '99), 1999

1998
The bleeding edge.
IEEE Micro, 1998

A tracking clock recovery receiver for 4-Gbps signaling.
IEEE Micro, 1998

An Efficient, Protected Message Interface.
Computer, 1998

Point Sample Rendering.
Proceedings of the Rendering Techniques '98, Proceedings of the Eurographics Workshop in Vienna, Austria, June 29, 1998

A Bandwidth-efficient Architecture for Media Processing.
Proceedings of the 31st Annual IEEE/ACM International Symposium on Microarchitecture, 1998

Exploiting Fine-grain Thread Level Parallelism on the MIT Multi-ALU Processor.
Proceedings of the 25th Annual International Symposium on Computer Architecture, 1998

Retrospective: the J-machine.
Proceedings of the 25 Years of the International Symposia on Computer Architecture (Selected Papers)., 1998

Architecture of a Message-Driven Processor.
Proceedings of the 25 Years of the International Symposia on Computer Architecture (Selected Papers)., 1998

The effects of explicitly parallel mechanisms on the multi-ALU processor cluster pipeline.
Proceedings of the International Conference on Computer Design: VLSI in Computers and Processors, 1998

1997
Extended Ehemeral Logging: Log Storage Management for Applications with Long Lived Transactions.
ACM Trans. Database Syst., 1997

Transmitter equalization for 4-Gbps signaling.
IEEE Micro, 1997

The M-machine multicomputer.
Int. J. Parallel Program., 1997

1995
Thread prioritization: A thread scheduling mechanism for multiple-context parallel processors.
Future Gener. Comput. Syst., 1995

Evaluating the Locality Benefits of Active Messages.
Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), 1995

The Named-State Register File: Implementation and Performance.
Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture (HPCA 1995), 1995

Low-latency plesiochronous data retiming.
Proceedings of the 16th Conference on Advanced Research in VLSI (ARVLSI '95), 1995

1994
Architectural and implementation issues for multithreading (panel session I).
SIGARCH Comput. Archit. News, 1994

The Reliable Router: A Reliable and High-Performance Communication Substrate for Parallel Computers.
Proceedings of the Parallel Computer Routing and Communication, 1994

Architecture and implementation of the reliable router.
Proceedings of the Hot Interconnects II, 1994

XEL: Extended Ephemeral Logging for Log Storage Management.
Proceedings of the Third International Conference on Information and Knowledge Management (CIKM'94), Gaithersburg, Maryland, USA, November 29, 1994

Hardware Support for Fast Capability-based Addressing.
Proceedings of the ASPLOS-VI Proceedings, 1994

Named State and Efficient Context Switching.
Proceedings of the Multithreaded Computer Architecture, 1994

Subspace Optimizations.
Proceedings of the Automatic Parallelization: New Approaches to Code Generation, 1994

Issues in the Design and Implementation of Instruction Processors for Multicomputers (Position Statement).
Proceedings of the Multithreaded Computer Architecture, 1994

1993
Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels.
IEEE Trans. Parallel Distributed Syst., 1993

A Universal Parallel Computer Architecture.
New Gener. Comput., 1993

Performance Evaluation of Ephemeral Logging.
Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, 1993

Evaluation of Mechanisms for Fine-Grained Parallel Programs in the J-Machine and the CM-5.
Proceedings of the 20th Annual International Symposium on Computer Architecture, 1993

The J-Machine Multicomputer: An Architectural Evaluation.
Proceedings of the 20th Annual International Symposium on Computer Architecture, 1993

1992
Virtual-Channel Flow Control.
IEEE Trans. Parallel Distributed Syst., 1992

A Fast Translation Method for Paging on top of Segmentation.
IEEE Trans. Computers, 1992

The message-driven processor: a multicomputer processing node with efficient mechanisms.
IEEE Micro, 1992

Processor Coupling: Integrating Compile Time and Runtime Scheduling for Parallelism.
Proceedings of the 19th Annual International Symposium on Computer Architecture. Gold Coast, 1992

The J-Machine Network.
Proceedings of the Proceedings 1992 IEEE International Conference on Computer Design: VLSI in Computer & Processors, 1992

MDP Design Tools and Methods.
Proceedings of the Proceedings 1992 IEEE International Conference on Computer Design: VLSI in Computer & Processors, 1992

The Message Driven Processor: An Integrated Multicomputer Processing Element.
Proceedings of the Proceedings 1992 IEEE International Conference on Computer Design: VLSI in Computer & Processors, 1992

1991
Express Cubes: Improving the Performance of k-Ary n-Cube Interconnection Networks.
IEEE Trans. Computers, 1991

Experiences Implementing Dataflow on a General-Purpose Parallel Computer.
Proceedings of the International Conference on Parallel Processing, 1991

A Mechanism for Efficient Context Switching.
Proceedings of the Proceedings 1991 IEEE International Conference on Computer Design: VLSI in Computer & Processors, 1991

1990
A hardware logic simulation system.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 1990

Performance Analysis of k-Ary n-Cube Interconnection Networks.
IEEE Trans. Computers, 1990

Concurrent Aggregates (CA).
Proceedings of the Second ACM SIGPLAN Symposium on Princiles & Practice of Parallel Programming (PPOPP), 1990

Simultaneous bidirectional signalling for IC systems.
Proceedings of the 1990 IEEE International Conference on Computer Design: VLSI in Computers and Processors, 1990

1989
Experience with CST: Programming and Implementation.
Proceedings of the ACM SIGPLAN'89 Conference on Programming Language Design and Implementation (PLDI), 1989

Universal Mechanisms for Concurrency.
Proceedings of the PARLE '89: Parallel Architectures and Languages Europe, 1989

The J-Machine: A Fine-Gain Concurrent Computer.
Proceedings of the Information Processing 89, Proceedings of the IFIP 11th World Computer Congress, San Francisco, USA, August 28, 1989

Algorithms for Accuracy Enhancement in a Hardware Logic Simulator.
Proceedings of the 26th ACM/IEEE Design Automation Conference, 1989

Micro-Optimization of Floating Point Operations.
Proceedings of the ASPLOS-III Proceedings, 1989

1988
Object-oriented concurrent programming in CST.
Proceedings of the 1988 ACM SIGPLAN Workshop on Object-based Concurrent Programming, 1988

The Reconfigurable Arithmetic Processor.
Proceedings of the 15th Annual International Symposium on Computer Architecture, 1988

Mechanisms for Concurrent Computing.
Proceedings of the International Conference on Fifth Generation Computer Systems, 1988

Finite-grain message passing concurrent computers.
Proceedings of the Third Conference on Hypercube Concurrent Computers and Applications, 1988

1987
Deadlock-Free Message Routing in Multiprocessor Interconnection Networks.
IEEE Trans. Computers, 1987

MARS: A Multiprocessor-Based Programmable Accelerator.
IEEE Des. Test, 1987

Architecture and Design of the MARS Hardware Accelerator.
Proceedings of the 24th ACM/IEEE Design Automation Conference. Miami Beach, FL, USA, June 28, 1987

1986
A VLSI Architecture for Concurrent Data Structures.
PhD thesis, 1986

The Torus Routing Chip.
Distributed Comput., 1986

1985
A Hardware Architecture for Switch-Level Simulation.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 1985

An Object Oriented Architecture.
Proceedings of the 12th Annual Symposium on Computer Architecture, 1985


  Loading...