Wen-Mei W. Hwu

Orcid: 0000-0003-2532-5349

Affiliations:
  • University of Illinois at Urbana-Champaign, Department of Electrical and Computer Engineering, Urbana-Champaign, IL, USA


According to our database1, Wen-Mei W. Hwu authored at least 358 papers between 1985 and 2024.

Collaborative distances:

Awards

ACM Fellow

ACM Fellow 2002, "For technical contributions and leadership in computer architecture.".

IEEE Fellow

IEEE Fellow 1998, "For contributions to high performance compiler and microarchitecture technologies.".

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Determining optimal channel partition for 2:4 fine grained structured sparsity.
Optim. Lett., December, 2024

Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses.
Proc. VLDB Endow., February, 2024

TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading.
CoRR, 2024

HiCCL: A Hierarchical Collective Communication Library.
CoRR, 2024

LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme.
CoRR, 2024

Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level.
CoRR, 2024

CommBench: Micro-Benchmarking Hierarchical Networks with Multi-GPU, Multi-NIC Nodes.
Proceedings of the 38th ACM International Conference on Supercomputing, 2024

Hector: An Efficient Programming and Compilation Framework for Implementing Relational Graph Neural Networks in GPU Architectures.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024

GMT: GPU Orchestrated Memory Tiering for the Big Data Era.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024

2023
RECO-ASCON: Reconfigurable ASCON hash functions for IoT applications.
Integr., November, 2023

CODAG: Characterizing and Optimizing Decompression Algorithms for GPUs.
CoRR, 2023

PIGEON: Optimizing CUDA Code Generator for End-to-End Training and Inference of Relational Graph Neural Networks.
CoRR, 2023

RackBlox: A Software-Defined Rack-Scale Storage System with Network-Storage Co-Design.
Proceedings of the 29th Symposium on Operating Systems Principles, 2023

BLTESTI: Benchmarking Lightweight TinyJAMBU on Embedded Systems for Trusted IoT.
Proceedings of the 36th IEEE International System-on-Chip Conference, 2023

IGB: Addressing The Gaps In Labeling, Features, Heterogeneity, and Size of Public Graph Datasets for Deep Learning Research.
Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023

RECO-LFSR: Reconfigurable Low-power Cryptographic processor based on LFSR for Trusted IoT platforms.
Proceedings of the 24th International Symposium on Quality Electronic Design, 2023

BEEP: Balanced Efficient subgraph Enumeration in Parallel.
Proceedings of the 52nd International Conference on Parallel Processing, 2023

FSSD: FPGA-Based Emulator for SSDs.
Proceedings of the 33rd International Conference on Field-Programmable Logic and Applications, 2023

GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

Can Language Models Be Specific? How?
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023

Parallelizing Maximal Clique Enumeration on GPUs.
Proceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques, 2023

2022
GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture.
Dataset, October, 2022

MemXCT: Design, Optimization, Scaling, and Reproducibility of X-Ray Tomography Imaging.
IEEE Trans. Parallel Distributed Syst., 2022

Exploring HW/SW Co-Design for Video Analysis on CPU-FPGA Heterogeneous Systems.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2022

An efficient GPU implementation and scaling for higher-order 3D stencils.
Inf. Sci., 2022

Submission-Aware Reviewer Profiling for Reviewer Recommender System.
CoRR, 2022

DKG: A Descriptive Knowledge Graph for Explaining Relationships between Entities.
CoRR, 2022

BaM: A Case for Enabling Fine-grain High Throughput GPU-Orchestrated Access to Storage.
CoRR, 2022

RECO-HCON: A High-Throughput Reconfigurable Compact ASCON Processor for Trusted IoT.
Proceedings of the 35th IEEE International System-on-Chip Conference, 2022

Graph Neural Network Training and Data Tiering.
Proceedings of the KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14, 2022

PARSEC: PARallel Subgraph Enumeration in CUDA.
Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

Parallel K-clique counting on GPUs.
Proceedings of the ICS '22: 2022 International Conference on Supercomputing, Virtual Event, June 28, 2022

DEER: Descriptive Knowledge Graph for Explaining Entity Relationships.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

Understanding Jargon: Combining Extraction and Generation for Definition Modeling.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

A Compiler Framework for Optimizing Dynamic Parallelism on GPUs.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2022

Open Relation Modeling: Learning to Define Relations between Entities.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, 2022

2021
Efficient Methods for Mapping Neural Machine Translator on FPGAs.
IEEE Trans. Parallel Distributed Syst., 2021

PyLog: An Algorithm-Centric Python-Based FPGA Programming and Synthesis Flow.
IEEE Trans. Computers, 2021

Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture.
Proc. VLDB Endow., 2021

Graph Neural Network Training with Data Tiering.
CoRR, 2021

MLHarness: A Scalable Benchmarking System for MLCommons.
CoRR, 2021

K-Clique Counting on GPUs.
CoRR, 2021

PyTorch-Direct: Enabling GPU Centric Data Access for Very Large Graph Neural Network Training with Irregular Accesses.
CoRR, 2021

Safer Illinois and RokWall: Privacy Preserving University Health Apps for COVID-19.
CoRR, 2021

PhraseScope: An Effective and Unsupervised Framework for Mining High Quality Phrases.
Proceedings of the 2021 SIAM International Conference on Data Mining, 2021

FFT blitz: the tensor cores strike back.
Proceedings of the PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021

Interpretable Visual Reasoning via Induced Symbolic Space.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

HyKernel: A Hybrid Selection of One/Two-Phase Kernels for Triangle Counting on GPUs.
Proceedings of the 2021 IEEE High Performance Extreme Computing Conference, 2021

TEMPI: An Interposed MPI Library with a Canonical Representation of CUDA-aware Datatypes.
Proceedings of the HPDC '21: The 30th International Symposium on High-Performance Parallel and Distributed Computing, 2021

Extending HLS with High-Level Descriptive Language for Configurable Algorithm-Level Spatial Structure Design.
Proceedings of the 29th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2021

Pseudo-IoU: Improving Label Assignment in Anchor-Free Object Detection.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2021

Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators.
Proceedings of the ASPDAC '21: 26th Asia and South Pacific Design Automation Conference, 2021

Graviton: A Reconfigurable Memory-Compute Fabric for Data Intensive Applications.
Proceedings of the Applied Reconfigurable Computing. Architectures, Tools, and Applications, 2021

Measuring Fine-Grained Domain Relevance of Terms: A Hierarchical Core-Fringe Approach.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021

Accelerating Fourier and Number Theoretic Transforms using Tensor Cores and Warp Shuffles.
Proceedings of the 30th International Conference on Parallel Architectures and Compilation Techniques, 2021

2020
PANTHER: A Programmable Architecture for Neural Network Training Harnessing Energy-Efficient ReRAM.
IEEE Trans. Computers, 2020

EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal In GPUs.
Proc. VLDB Endow., 2020

Fast CUDA-Aware MPI Datatypes without Platform Support.
CoRR, 2020

Tearing Down the Memory Wall.
CoRR, 2020

Efficient Inference on GPUs for the Sparse Deep Neural Network Graph Challenge 2020.
CoRR, 2020

MLModelScope: A Distributed Platform for Model Evaluation and Benchmarking at Scale.
CoRR, 2020

DLBricks: Composable Benchmark Generation to Reduce Deep Learning Benchmarking Effort on CPUs.
Proceedings of the ICPE '20: ACM/SPEC International Conference on Performance Engineering, 2020

Petascale XCT: 3D image reconstruction with hierarchical communications on multi-GPU nodes.
Proceedings of the International Conference for High Performance Computing, 2020

DLSpec: A Deep Learning Task Exchange Specification.
Proceedings of the 2020 USENIX Conference on Operational Machine Learning, 2020

SkyNet: a Hardware-Efficient Method for Object Detection and Tracking on Embedded Systems.
Proceedings of the Third Conference on Machine Learning and Systems, 2020

FReaC Cache: Folded-logic Reconfigurable Computing in the Last Level Cache.
Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture, 2020

Node-Aware Stencil Communication for Heterogeneous Supercomputers.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

Benanza: Automatic μBenchmark Generation to Compute "Lower-bound" Latency and Inform Optimizations of Deep Learning Models on GPUs.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

Advancing Computing Infrastructure for Very Large-Scale Deep Learning at C3SR.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

Micro - GAGE: A Low-power Compact GAGE Hash Function Processor for IoT Applications.
Proceedings of the 27th IEEE International Conference on Electronics, Circuits and Systems, 2020

DNNExplorer: A Framework for Modeling and Exploring a Novel Paradigm of FPGA-based DNN Accelerator.
Proceedings of the IEEE/ACM International Conference On Computer Aided Design, 2020

At-Scale Sparse Deep Neural Network Inference With Efficient GPU Implementation.
Proceedings of the 2020 IEEE High Performance Extreme Computing Conference, 2020

Effective Algorithm-Accelerator Co-design for AI Solutions on Edge Devices.
Proceedings of the GLSVLSI '20: Great Lakes Symposium on VLSI 2020, 2020

Exploring Semantic Capacity of Terms.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020

EDD: Efficient Differentiable DNN Architecture and Implementation Co-search for Embedded AI Solutions.
Proceedings of the 57th ACM/IEEE Design Automation Conference, 2020

Differential Treatment for Stuff and Things: A Simple Unsupervised Domain Adaptation Method for Semantic Segmentation.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

Alleviating Semantic-level Shift: A Semi-supervised Domain Adaptation Method for Semantic Segmentation.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

Vertext: An End-to-end AI Powered Conversation Management System for Multi-party Chat Platforms.
Proceedings of the Companion Publication of the 2020 ACM Conference on Computer Supported Cooperative Work and Social Computing, 2020

The design and implementation of the wolfram language compiler.
Proceedings of the CGO '20: 18th ACM/IEEE International Symposium on Code Generation and Optimization, 2020

The Design and Implementation of a Scalable Deep Learning Benchmarking Platform.
Proceedings of the 13th IEEE International Conference on Cloud Computing, 2020

2019
The Design and Implementation of a Scalable DL Benchmarking Platform.
CoRR, 2019

Across-Stack Profiling and Characterization of Machine Learning Models on GPUs.
CoRR, 2019

SkyNet: A Champion Model for DAC-SDC on Low Power Object Detection.
CoRR, 2019

A Retrospective Recount of Computer Architecture Research with a Data-Driven Study of Over Four Decades of ISCA Publications.
CoRR, 2019

A Bi-Directional Co-Design Approach to Enable Deep Learning on IoT Devices.
CoRR, 2019

Challenges and Pitfalls of Reproducing Machine Learning Artifacts.
CoRR, 2019

Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects.
Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering, 2019

Analysis and Modeling of Collaborative Execution Strategies for Heterogeneous CPU-FPGA Architectures.
Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering, 2019

MLModelScope: Evaluate and Introspect Cognitive Pipelines.
Proceedings of the 2019 IEEE World Congress on Services, 2019

MemXCT: memory-centric X-ray CT reconstruction with massive parallelization.
Proceedings of the International Conference for High Performance Computing, 2019

Reinforcement Learning Based Text Style Transfer without Parallel Training Corpus.
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019

DeepStore: In-Storage Acceleration for Intelligent Queries.
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019

Near-Memory and In-Storage FPGA Acceleration for Emerging Cognitive Computing Workloads.
Proceedings of the 2019 IEEE Computer Society Annual Symposium on VLSI, 2019

Accelerating reduction and scan using tensor core units.
Proceedings of the ACM International Conference on Supercomputing, 2019

SPGNet: Semantic Prediction Guidance for Scene Parsing.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

NAIS: Neural Architecture and Implementation Search and its Applications in Autonomous Driving.
Proceedings of the International Conference on Computer-Aided Design, 2019

Update on Triangle Counting on GPU.
Proceedings of the 2019 IEEE High Performance Extreme Computing Conference, 2019

Accelerating Sparse Deep Neural Networks on FPGAs.
Proceedings of the 2019 IEEE High Performance Extreme Computing Conference, 2019

Update on k-truss Decomposition on GPU.
Proceedings of the 2019 IEEE High Performance Extreme Computing Conference, 2019

An Efficient GPU Implementation Technique for Higher-Order 3D Stencils.
Proceedings of the 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems, 2019

Analysis and Optimization of I/O Cache Coherency Strategies for SoC-FPGA Device.
Proceedings of the 29th International Conference on Field Programmable Logic and Applications, 2019

PaRe: A Paper-Reviewer Matching Approach Using a Common Topic Space.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2019

FPGA/DNN Co-Design: An Efficient Design Methodology for IoT Intelligence on the Edge.
Proceedings of the 56th Annual Design Automation Conference 2019, 2019

Automatic Generation of Warp-Level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2019

PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference.
Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019

FlatFlash: Exploiting the Byte-Accessibility of SSDs within a Unified Memory-Storage Hierarchy.
Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019

Implementing neural machine translation with bi-directional GRU and attention mechanism on FPGAs using HLS.
Proceedings of the 24th Asia and South Pacific Design Automation Conference, 2019

TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep Learning Inference in Function-as-a-Service.
Proceedings of the 12th IEEE International Conference on Cloud Computing, 2019

2018
Iterative Modulo Scheduling.
IEEE Micro, 2018

Accelerator Architectures A Ten-Year Retrospective.
IEEE Micro, 2018

High-throughput Ant Colony Optimization on graphics processing units.
J. Parallel Distributed Comput., 2018

MLModelScope: Evaluate and Measure ML Models within AI Pipelines.
CoRR, 2018

TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments.
CoRR, 2018

A Simple Non-i.i.d. Sampling Approach for Efficient Training and Better Generalization.
CoRR, 2018

Decoupled Classification Refinement: Hard False Positive Suppression for Object Detection.
CoRR, 2018

SCOPE: C3SR Systems Characterization and Benchmarking Framework.
CoRR, 2018

Semi-Coherent DMA: An Alternative I/O Coherency Management for Embedded Systems.
IEEE Comput. Archit. Lett., 2018

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems.
Proceedings of the High Performance Computing, 2018

Application-Transparent Near-Memory Processing Architecture with Memory Channel Network.
Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, 2018

A Fast and Massively-Parallel Inverse Solver for Multiple-Scattering Tomographic Image Reconstruction.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018


DNNBuilder: an automated tool for building high-performance DNN hardware accelerators for FPGAs.
Proceedings of the International Conference on Computer-Aided Design, 2018

Collaborative (CPU + GPU) Algorithms for Triangle Counting and Truss Decomposition.
Proceedings of the 2018 IEEE High Performance Extreme Computing Conference, 2018

Triangle Counting and Truss Decomposition using FPGA.
Proceedings of the 2018 IEEE High Performance Extreme Computing Conference, 2018

AccDNN: An IP-Based DNN Generator for FPGAs.
Proceedings of the 26th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2018

2017
SAVI objects: sharing and virtuality incorporated.
Proc. ACM Program. Lang., 2017

Heterogeneous Computing Meets Near-Memory Acceleration and High-Level Synthesis in the Post-Moore Era.
IEEE Micro, 2017

Collaborative Computing for Heterogeneous Integrated Systems.
Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering, 2017

Enabling GPU Support for the COMPSs-Mobile Framework.
Proceedings of the Accelerator Programming Using Directives - 4th International Workshop, 2017

Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts.
Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 2017

Chai: Collaborative heterogeneous applications for integrated-architectures.
Proceedings of the 2017 IEEE International Symposium on Performance Analysis of Systems and Software, 2017

Keynote: Architecture and software for emerging low-power systems.
Proceedings of the 2017 IEEE/ACM International Symposium on Low Power Electronics and Design, 2017

RAI: A Scalable Project Submission System for Parallel Programming Courses.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Rebooting the Data Access Hierarchy of Computing Systems.
Proceedings of the IEEE International Conference on Rebooting Computing, 2017

Generalize or Die: Operating Systems Support for Memristor-Based Accelerators.
Proceedings of the IEEE International Conference on Rebooting Computing, 2017

Collaborative (CPU + GPU) algorithms for triangle counting and truss decomposition on the Minsky architecture: Static graph challenge: Subgraph isomorphism.
Proceedings of the 2017 IEEE High Performance Extreme Computing Conference, 2017

Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on Next-Generation Architectures.
Proceedings of the 19th IEEE International Conference on High Performance Computing and Communications; 15th IEEE International Conference on Smart City; 3rd IEEE International Conference on Data Science and Systems, 2017

Hardware Acceleration of the Pair-HMM Algorithm for DNA Variant Calling.
Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017

2016
In-Place Matrix Transposition on GPUs.
IEEE Trans. Parallel Distributed Syst., 2016

FCUDA-HB: Hierarchical and Scalable Bus Architecture Generation on FPGAs With the FCUDA Flow.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2016

Common Bonds: MIPS, HPS, Two-Level Branch Prediction, and Compressed Code RISC Processor.
IEEE Micro, 2016

Platform choices and design demands for IoT platforms: cost, power, and performance tradeoffs.
IET Cyper-Phys. Syst.: Theory & Appl., 2016

BLESS 2: accurate, memory-efficient and fast error correction method.
Bioinform., 2016

Design of a power-efficient ARM processor with a timing-error detection and correction mechanism.
Proceedings of the 29th IEEE International System-on-Chip Conference, 2016

A programming system for future proofing performance critical libraries.
Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2016

KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism.
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016

Efficient kernel synthesis for performance portable programming.
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016

AsHES 2016 Keynote.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

WebGPU: A Scalable Online Development Platform for GPU Programming Courses.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

Efficient and Scalable Workflows for Genomic Analyses.
Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing, 2016

Acceleration of the Pair-HMM Algorithm for DNA Variant Calling.
Proceedings of the 24th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2016

SpaceJMP: Programming with Multiple Virtual Address Spaces.
Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, 2016

DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model.
Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, 2016

2015
Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications.
IEEE Trans. Parallel Distributed Syst., 2015

Optimized Data Transfers Based on the OpenCL Event Management Mechanism.
Sci. Program., 2015

Enhancing the Usability and Utilization of Accelerated Architectures via Docker.
Proceedings of the 8th IEEE/ACM International Conference on Utility and Cloud Computing, 2015

GPU-SM: shared memory multi-GPU programming.
Proceedings of the 8th Workshop on General Purpose Processing using GPUs, 2015

Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes.
Proceedings of the 29th ACM on International Conference on Supercomputing, 2015

In-Place Data Sliding Algorithms for Many-Core Architectures.
Proceedings of the 44th International Conference on Parallel Processing, 2015

FPGA accelerated DNA error correction.
Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, 2015

Locality-centric thread scheduling for bulk-synchronous programming models on CPU architectures.
Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2015

2014
What is ahead for parallel computing.
J. Parallel Distributed Comput., 2014

BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads.
Bioinform., 2014

SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance.
Proceedings of the High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, 2014

In-place transposition of rectangular matrices on accelerators.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

Triolet: a programming system that unifies algorithmic skeleton interfaces for high-performance cluster computing.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

Adaptive Cache Management for Energy-Efficient GPU Computing.
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014

Adaptive Cache Bypass and Insertion for Many-core Accelerators.
Proceedings of the 2nd International Workshop on Many-core Embedded Systems, 2014

Automatic execution of single-GPU computations across multiple GPUs.
Proceedings of the International Conference on Parallel Architectures and Compilation, 2014

A Guide for Implementing Tridiagonal Solvers on GPUs.
Proceedings of the Numerical Computations with GPUs, 2014

2013
Scalable SIMD-parallel memory allocation for many-core machines.
J. Supercomput., 2013

Efficient compilation of CUDA kernels for high-performance computing on FPGAs.
ACM Trans. Embed. Comput. Syst., 2013

More IMPATIENT: A gridding-accelerated Toeplitz-based strategy for non-Cartesian high-resolution 3D MRI on GPUs.
J. Parallel Distributed Comput., 2013

Rapid computation of sodium bioscales using gpu-accelerated image reconstruction.
Int. J. Imaging Syst. Technol., 2013

Rethinking computer architecture for throughput computing.
Proceedings of the 2013 International Conference on Embedded Computer Systems: Architectures, 2013

clMPI: An OpenCL Extension for Interoperation with the Message Passing Interface.
Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

Throughput-oriented kernel porting onto FPGAs.
Proceedings of the 50th Annual Design Automation Conference 2013, 2013

Comparison based sorting for systems with multiple GPUs.
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, 2013

2012
Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)
Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, ISBN: 978-3-031-01737-7, 2012

Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications.
Int. J. Parallel Program., 2012

Algorithm and Data Optimization Techniques for Scaling to Massively Threaded Systems.
Computer, 2012

TIGER: tiled iterative genome assembler.
BMC Bioinform., 2012

A scalable, numerically stable, high-performance tridiagonal solver using GPUs.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors.
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2012

Efficient Pattern-Based Time Series Classification on GPU.
Proceedings of the 12th IEEE International Conference on Data Mining, 2012

Design evaluation of OpenCL compiler framework for Coarse-Grained Reconfigurable Arrays.
Proceedings of the 2012 International Conference on Field-Programmable Technology, 2012

2011
Superscalar Processors.
Proceedings of the Encyclopedia of Parallel Computing, 2011

EcoG: A Power-Efficient GPU Cluster Architecture for Scientific Computing.
Comput. Sci. Eng., 2011

Advanced MRI reconstruction toolbox with accelerating on GPU.
Proceedings of the Conference on Parallel Processing for Imaging Applications 2011, 2011

Impatient MRI: Illinois Massively Parallel Acceleration Toolkit for image reconstruction with enhanced throughput in MRI.
Proceedings of the 8th IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 2011

Panel Statement.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

A Scalable Tridiagonal Solver for GPUs.
Proceedings of the International Conference on Parallel Processing, 2011

Parallel implementation of Multi-dimensional Ensemble Empirical Mode Decomposition.
Proceedings of the IEEE International Conference on Acoustics, 2011

Multilevel Granularity Parallelism Synthesis on FPGAs.
Proceedings of the IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines, 2011

2010
High-Performance Computing with Accelerators.
Comput. Sci. Eng., 2010

An adaptive performance modeling tool for GPU architectures.
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010

Implementing a GPU Programming Model on a Non-GPU Accelerator Architecture.
Proceedings of the Computer Architecture, 2010

Accelerating iterative field-compensated MR image reconstruction on GPUS.
Proceedings of the 2010 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 2010

An effective GPU implementation of breadth-first search.
Proceedings of the 47th Design Automation Conference, 2010

Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs.
Proceedings of the CGO 2010, 2010

An asymmetric distributed shared memory model for heterogeneous parallel systems.
Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, 2010

Data layout transformation exploiting memory-level parallelism in structured grid many-core applications.
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 2010

Raising the level of many-core programming with compiler technology: meeting a grand challenge.
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, 2010

Exploiting More Parallelism from Applications Having Generalized Reductions on GPU Architectures.
Proceedings of the 10th IEEE International Conference on Computer and Information Technology, 2010

XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines.
Proceedings of the 10th IEEE International Conference on Computer and Information Technology, 2010

Programming Massively Parallel Processors - A Hands-on Approach.
Morgan Kaufmann, ISBN: 978-0-12-381472-2, 2010

2009
The parallelization of video processing.
IEEE Signal Process. Mag., 2009

Hardware-compiler co-design for adjustable data power savings.
Microprocess. Microsystems, 2009

Compute Unified Device Architecture Application Suitability.
Comput. Sci. Eng., 2009

FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs.
Proceedings of the IEEE 7th Symposium on Application Specific Processors, 2009

Accelerating MR Image Reconstruction on GPUS.
Proceedings of the 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, Boston, MA, USA, June 28, 2009

Long time-scale simulations of in vivo diffusion using GPU hardware.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Many-core parallel computing - Can compilers and tools do the heavy lifting?
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

High-performance CUDA kernel execution on FPGAs.
Proceedings of the 23rd international conference on Supercomputing, 2009

GPU clusters for high-performance computing.
Proceedings of the 2009 IEEE International Conference on Cluster Computing, August 31, 2009

High performance computation and interactive display of molecular orbitals on GPUs and multi-core CPUs.
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, 2009

Optimization of tele-immersion codes.
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, 2009

2008
Guest Editors' Introduction: Accelerator Architectures.
IEEE Micro, 2008

Accelerating advanced MRI reconstructions on GPUs.
J. Parallel Distributed Comput., 2008

Program optimization carving for GPU computing.
J. Parallel Distributed Comput., 2008

Thousand-Core Chips [Roundtable].
IEEE Des. Test Comput., 2008

The Concurrency Challenge.
IEEE Des. Test Comput., 2008

Application Acceleration with the Explicitly Parallel Operations System - the EPOS Processor.
Proceedings of the IEEE Symposium on Application Specific Processors, 2008

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA.
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008

CUDA-Lite: Reducing GPU Programming Complexity.
Proceedings of the Languages and Compilers for Parallel Computing, 2008

MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs.
Proceedings of the Languages and Compilers for Parallel Computing, 2008

CUBA: an architecture for efficient CPU/co-processor data communication.
Proceedings of the 22nd Annual International Conference on Supercomputing, 2008

Visualization and Analysis of GPU Summer School Applicants and Participants.
Proceedings of the Fourth International Conference on e-Science, 2008

Program optimization space pruning for a multithreaded gpu.
Proceedings of the Sixth International Symposium on Code Generation and Optimization (CGO 2008), 2008

GPU acceleration of cutoff pair potentials for molecular modeling applications.
Proceedings of the 5th Conference on Computing Frontiers, 2008

2007
Automatic Discovery of Coarse-Grained Parallelism in Media Applications.
Trans. High Perform. Embed. Archit. Compil., 2007

Toward Application-Aware Security and Reliability.
IEEE Secur. Priv., 2007

Iteration Disambiguation for Parallelism Identification in Time-Sliced Applications.
Proceedings of the Languages and Compilers for Parallel Computing, 2007

Corezilla: Build and Tame the Multicore Beast?
Proceedings of the 44th Design Automation Conference, 2007

Implicitly Parallel Programming Models for Thousand-Core Microprocessors.
Proceedings of the 44th Design Automation Conference, 2007

CIGAR: Application Partitioning for a CPU/Coprocessor Architecture.
Proceedings of the 16th International Conference on Parallel Architectures and Compilation Techniques (PACT 2007), 2007

2006
Beating In-Order Stalls with "Flea-Flicker" Two-Pass Pipelining.
IEEE Trans. Computers, 2006

Tolerating Cache-Miss Latency with Multipass Pipelines.
IEEE Micro, 2006

2005
Guest Editors' Introduction.
IEEE Trans. Computers, 2005

"Flea-flicker" Multipass Pipelining: An Alternative to the High-Power Out-of-Order Offense.
Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38 2005), 2005

The Future of Computer Architecture Research: An Industrial Perspective.
Proceedings of the 11th International Conference on High-Performance Computer Architecture (HPCA-11 2005), 2005

2004
Bottom-Up and Top-Down Context-Sensitive Summary-Based Pointer Analysis.
Proceedings of the Static Analysis, 11th International Symposium, 2004

Importance of heap specialization in pointer analysis.
Proceedings of the 2004 ACM SIGPLAN-SIGSOFT Workshop on Program Analysis For Software Tools and Engineering, 2004

Trimaran: An Infrastructure for Research in Instruction-Level Parallelism.
Proceedings of the Languages and Compilers for High Performance Computing, 2004

Field-testing IMPACT EPIC research results in Itanium 2.
Proceedings of the 31st International Symposium on Computer Architecture (ISCA 2004), 2004

2003
Energy saving and capacity improvement potential of power control in multi-hop wireless networks.
Comput. Networks, 2003

2002
Vacuum packing: extracting hardware-detected program phases for post-link optimization.
Proceedings of the 35th Annual International Symposium on Microarchitecture, 2002

Code coverage and input variability: effects on architecture and compiler research.
Proceedings of the International Conference on Compilers, 2002

2001
An Architectural Framework for Runtime Optimization.
IEEE Trans. Computers, 2001

Program decision logic optimization using predication and control speculation.
Proc. IEEE, 2001

Enhancing loop buffering of media and telecommunications applications using low-overhead predication.
Proceedings of the 34th Annual International Symposium on Microarchitecture, 2001

Modulo schedule buffers.
Proceedings of the 34th Annual International Symposium on Microarchitecture, 2001

A Study of the Energy Saving and Capacity Improvement Potential of Power Control in Multi-Hop Wireless Networks.
Proceedings of the 26th Annual IEEE Conference on Local Computer Networks (LCN 2001), 2001

A Power Controlled Multiple Access Protocol for Wireless Packet Networks.
Proceedings of the Proceedings IEEE INFOCOM 2001, 2001

Code Reordering and Speculation Support for Dynamic Optimization System.
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques (PACT 2001), 2001

2000
Modular interprocedural pointer analysis using access paths: design, implementation, and evaluation.
Proceedings of the 2000 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2000

Accurate and efficient predicate analysis with binary decision diagrams.
Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, 2000

Transmission Power Control for Multiple Access Wireless Packet Networks.
Proceedings of the Proceedings 27th Conference on Local Computer Networks, 2000

A hardware mechanism for dynamic extraction and relayout of program hot spots.
Proceedings of the 27th International Symposium on Computer Architecture (ISCA 2000), 2000

Hardware Support for Dynamic Management of Compiler-Directed Computation Reuse.
Proceedings of the ASPLOS-IX Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, 2000

1999
Architecture.
Proceedings of the VLSI Handbook., 1999

Run-Time Cache Bypassing.
IEEE Trans. Computers, 1999

Editors' Introduction.
Int. J. Parallel Program., 1999

Editor's Introduction.
Int. J. Parallel Program., 1999

The Partial Reverse If-Conversion Framework for Balancing Control Flow and Predication.
Int. J. Parallel Program., 1999

A New Framework for Debugging Globally Optimized Code.
Proceedings of the 1999 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 1999

Compiler-Directed Dynamic Computation Reuse: Rationale and Initial Results.
Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture, 1999

An Empirical Study of Function Pointers Using SPEC Benchmarks.
Proceedings of the Languages and Compilers for Parallel Computing, 1999

A Hardware-Driven Profiling Scheme for Identifying Program Hot Spots to Support Runtime Optimization.
Proceedings of the 26th Annual International Symposium on Computer Architecture, 1999

The Program Decision Logic Approach to Predicated Execution.
Proceedings of the 26th Annual International Symposium on Computer Architecture, 1999

An Architecture Framework for Introducing Predicated Execution into Embedded Microprocessors.
Proceedings of the Euro-Par '99 Parallel Processing, 5th International Euro-Par Conference, Toulouse, France, August 31, 1999

1998
Combining Trace Sampling with Single Pass Methods for Efficient Cache Simulation.
IEEE Trans. Computers, 1998

Optimization of Machine Descriptions for Efficient Use.
Int. J. Parallel Program., 1998

Foreword to the Special Issue.
Int. J. Parallel Program., 1998

Introduction to Predicate Execution.
Computer, 1998

Compiler-Directed Early Load-Address Generation.
Proceedings of the 31st Annual IEEE/ACM International Symposium on Microarchitecture, 1998

Retrospective: HPSm, a High Performance Restricted Data Flow Architecture Having Minimal Functionality.
Proceedings of the 25 Years of the International Symposia on Computer Architecture (Selected Papers)., 1998

Retrospective: IMPACT: An Architectural Framework for Multiple-Instruction Issue.
Proceedings of the 25 Years of the International Symposia on Computer Architecture (Selected Papers)., 1998

IMPACT: An Architectural Framework for Multiple-Instruction-Issue Processors.
Proceedings of the 25 Years of the International Symposia on Computer Architecture (Selected Papers)., 1998

Integrated Predicated and Speculative Execution in the IMPACT EPIC Architecture.
Proceedings of the 25th Annual International Symposium on Computer Architecture, 1998

Run-Time Adaptive Cache Management.
Proceedings of the Thirty-First Annual Hawaii International Conference on System Sciences, 1998

Improving Static Branch Prediction in a Compiler.
Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, 1998

1997
Region-based compilation: Introduction, motivation, and initial experience.
Int. J. Parallel Program., 1997

Optimizing NET Compilers for Improved Java Performance.
Computer, 1997

Run-Time Spatial Locality Detection and Optimization.
Proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, 1997

A Framework for Balancing Control Flow and Predication.
Proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, 1997

Run-Time Adaptive Cache Hierarchy Management via Reference Analysis.
Proceedings of the 24th International Symposium on Computer Architecture, 1997

Architectural Support for Compiler-Synthesized Dynamic Branch Prediction Strategies: Rationale and Initial Results.
Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture (HPCA '97), 1997

A study of the cache and branch performance issues with running Java on current hardware platforms.
Proceedings of the Proceedings IEEE COMPCON 97, 1997

1996
Guest Editors' Introduction.
Int. J. Parallel Program., 1996

Modulo Scheduling of Loops in Control-intensive Non-numeric Programs.
Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, 1996

Java Bytecode to Native Code Translation: The Caffeine Prototype and Preliminary Results.
Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, 1996

Speculative Hedge: Regulating Compile-time Speculation Against Profile Variations.
Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, 1996

1995
Compiler-Based Multiple Instruction Retry.
IEEE Trans. Computers, 1995

Three Architecutral Models for Compiler-Controlled Speculative Execution.
IEEE Trans. Computers, 1995

The Importance of Prepass Code Scheduling for Superscalar and Superpipelined Processors.
IEEE Trans. Computers, 1995

Compiler-Assisted Multiple Instruction Rollback Recovery Using a Read Buffer.
IEEE Trans. Computers, 1995

Compiler technology for future microprocessors.
Proc. IEEE, 1995

Advances in Benchmarking Techniques: New Standards and Quantitative Metrics.
Adv. Comput., 1995

Unrolling-based optimizations for modulo scheduling.
Proceedings of the 28th Annual International Symposium on Microarchitecture, Ann Arbor, Michigan, USA, November 29, 1995

Region-based compilation: an introduction and motivation.
Proceedings of the 28th Annual International Symposium on Microarchitecture, Ann Arbor, Michigan, USA, November 29, 1995

A Comparison of Full and Partial Predicated Execution Support for ILP Processors.
Proceedings of the 22nd Annual International Symposium on Computer Architecture, 1995

A study of the effects of compiler-controlled speculation on instruction and data caches.
Proceedings of the 28th Annual Hawaii International Conference on System Sciences (HICSS-28), 1995

1994
The Susceptibility of Programs to Context Switching.
IEEE Trans. Computers, 1994

Incremental Compiler Transformations for Multiple Instruction Retry.
Softw. Pract. Exp., 1994

Performance Implications of Synchronization Support for Parallel Fortran Programs.
J. Parallel Distributed Comput., 1994

From the guest editors.
Int. J. Parallel Program., 1994

Profile-assisted instruction scheduling.
Int. J. Parallel Program., 1994

Data relocation and prefetching for programs with large data sets.
Proceedings of the 27th Annual International Symposium on Microarchitecture, San Jose, California, USA, November 30, 1994

Characterizing the impact of predicated execution on branch prediction.
Proceedings of the 27th Annual International Symposium on Microarchitecture, San Jose, California, USA, November 30, 1994

An Analytical Approach to Scheduling Code for Superscalar and VLIW Architectures.
Proceedings of the 1994 International Conference on Parallel Processing, 1994

Dynamic Memory Disambiguation Using the Memory Conflict Buffer.
Proceedings of the ASPLOS-VI Proceedings, 1994

1993
Sentinel Scheduling for VLIW and Superscalar Processors.
ACM Trans. Comput. Syst., 1993

The superblock: An effective technique for VLIW and superscalar compilation.
J. Supercomput., 1993

The Effect of Code Expanding Optimizations on Instruction Cache Design.
IEEE Trans. Computers, 1993

An execution Profiler for Window-oriented Applications.
Softw. Pract. Exp., 1993

Reverse If-Conversion.
Proceedings of the ACM SIGPLAN'93 Conference on Programming Language Design and Implementation (PLDI), 1993

Superblock formation using static program analysis.
Proceedings of the 26th Annual International Symposium on Microarchitecture, 1993

Speculative execution exception recovery using write-back suppression.
Proceedings of the 26th Annual International Symposium on Microarchitecture, 1993

Register Connection: A New Approach to Adding Registers into Instruction Set Architectures.
Proceedings of the 20th Annual International Symposium on Computer Architecture, 1993

Application of Compiler-Assisted Rollback Recovery to Speculative Execution Repair.
Proceedings of the Hardware and Software Architectures for Fault Tolerance, 1993

1992
Efficient Instruction Sequencing with Inline Target Insertion.
IEEE Trans. Computers, 1992

Profile-guided Automatic Inline Expansion for C Programs.
Softw. Pract. Exp., 1992

Xprof: Profiling the Execution of X Window Programs.
Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, 1992

Compiler Code Transformations for Superscalar-Based High Performance Systems.
Proceedings of the Proceedings Supercomputing '92, 1992

Systematic prototyping of superscalar computer architectures.
Proceedings of the Third International Workshop on Rapid System Prototyping, 1992

Using Profile Information to Assist Advaced Compiler Optimization and Scheduling.
Proceedings of the Languages and Compilers for Parallel Computing, 1992

Tolerating data access latency with register preloading.
Proceedings of the 6th international conference on Supercomputing, 1992

Tolerating First Level Memory Access Latency in High-Performance Systems.
Proceedings of the 1992 International Conference on Parallel Processing, 1992

Executing Nested Parallel Loops on Shared-Memory Multiprocessors.
Proceedings of the 1992 International Conference on Parallel Processing, 1992

Branch Recovery with Compiler-Assisted Multiple Instruction Retry.
Proceedings of the Digest of Papers: FTCS-22, 1992

Sentinel Scheduling for VLIW and Superscalar Processors.
Proceedings of the ASPLOS-V Proceedings, 1992

1991
Using Profile Information to Assist Classic Code Optimizations.
Softw. Pract. Exp., 1991

A brief survey of benchmark usage in the architecture community.
SIGARCH Comput. Archit. News, 1991

Benchmark Characterization.
Computer, 1991

Data Access Microarchitectures for Superscalar Processors with Compiler-Assisted Data Prefetching.
Proceedings of the 24th Annual IEEE/ACM International Symposium on Microarchitecture, 1991

Comparing Static and Dynamic Code Scheduling for Multiple-Instruction-Issue Processors.
Proceedings of the 24th Annual IEEE/ACM International Symposium on Microarchitecture, 1991

The Effect of Compiler Optimizations on Available Parallelism in Scalar Programs.
Proceedings of the International Conference on Parallel Processing, 1991

1990
Snoopy cache test-and-test-and-set without execessive bus contention.
SIGARCH Comput. Archit. News, 1990

A software based approach to achieving optimal performance for signature control flow checking.
Proceedings of the 20th International Symposium on Fault-Tolerant Computing, 1990

1989
A Simulation Study of Simultaneous Vector Prefetch Performance in Multiprocessor Memory Subsystems (Extended Abstract).
Proceedings of the 1989 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 1989

Inline Function Expansion for Compiling C Programs.
Proceedings of the ACM SIGPLAN'89 Conference on Programming Language Design and Implementation (PLDI), 1989

Forward semantic: a compiler-assisted instruction fetch method for heavily pipelined processors.
Proceedings of the 22nd Annual Workshop and Symposium on Microprogramming and Microarchitecture, 1989

Comparing Software and Hardware Schemes For Reducing the Cost of Branches.
Proceedings of the 16th Annual International Symposium on Computer Architecture. Jerusalem, 1989

Achieving High Instruction Cache Performance with an Optimizing Compiler.
Proceedings of the 16th Annual International Symposium on Computer Architecture. Jerusalem, 1989

Control flow optimization for supercomputer scalar processing.
Proceedings of the 3rd international conference on Supercomputing, 1989

1988
Trace selection for compiling large C application programs to microcode.
Proceedings of the 21st Annual Workshop and Symposium on Microprogramming and Microarchitecture, 1988, San Diego, California, USA, November 28, 1988

Exploiting Parallel Microprocessor Microarchitectures With a Compiler Code Generator.
Proceedings of the 15th Annual International Symposium on Computer Architecture, 1988

1987
Checkpoint Repair for High-Performance Out-of-Order Execution Machines.
IEEE Trans. Computers, 1987

On tuning the microarchitecture of an HPS implementation of the VAX.
Proceedings of the 20st Annual Workshop and Symposium on Microprogramming and Microarchitecture, 1987

Exploiting horizontal and vertical concurrency via the HPSm microprocessor.
Proceedings of the 20st Annual Workshop and Symposium on Microprogramming and Microarchitecture, 1987

Checkpoint Repair for Out-of-order Execution Machines.
Proceedings of the 14th Annual International Symposium on Computer Architecture. Pittsburgh, 1987

1986
Run-time generation of HPS microinstructions from a VAX instruction stream.
Proceedings of the 19th annual workshop on Microprogramming, 1986

HPSm, a High Performance Restricted Data Flow Architecture Having Minimal Functionality.
Proceedings of the 13th Annual Symposium on Computer Architecture, Tokyo, Japan, June 1986, 1986

Experiments with HPS, a Restricted Data Flow Microarchitecture for High Performance Computers.
Proceedings of the Spring COMPCON'86, 1986

1985
Critical issues regarding HPS, a high performance microarchitecture.
Proceedings of the 18th annual workshop on Microprogramming, 1985

HPS, a new microarchitecture: rationale and introduction.
Proceedings of the 18th annual workshop on Microprogramming, 1985


  Loading...