Guohao Dai

Orcid: 0000-0003-0849-3252

According to our database1, Guohao Dai authored at least 97 papers between 2013 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Toward High-Accuracy and Real-Time Two-Stage Small Object Detection on FPGA.
IEEE Trans. Circuits Syst. Video Technol., September, 2024

GRAPHIC: Gather and Process Harmoniously in the Cache With High Parallelism and Flexibility.
IEEE Trans. Emerg. Top. Comput., 2024

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective.
CoRR, 2024

Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding.
CoRR, 2024

MARCA: Mamba Accelerator with ReConfigurable Architecture.
CoRR, 2024

CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios.
CoRR, 2024

Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs.
CoRR, 2024

MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression.
CoRR, 2024

Can LLMs Learn by Teaching? A Preliminary Study.
CoRR, 2024

DiTFastAttn: Attention Compression for Diffusion Transformer Models.
CoRR, 2024

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation.
CoRR, 2024

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization.
CoRR, 2024

HetHub: A Heterogeneous distributed hybrid training system for large-scale models.
CoRR, 2024

DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis.
CoRR, 2024

A Survey on Efficient Inference for Large Language Models.
CoRR, 2024

Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better.
CoRR, 2024

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K.
CoRR, 2024

Efficient Deployment of Large Language Model across Cloud-Device Systems.
Proceedings of the 37th IEEE International System-on-Chip Conference, 2024

FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics.
Proceedings of the Seventh Annual Conference on Machine Learning and Systems, 2024

Evaluating Quantized Large Language Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs.
Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2024

DyPIM: Dynamic-Inference-Enabled Processing - In-Memory Accelerator.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2024

FusionArch: A Fusion-Based Accelerator for Point-Based Point Cloud Neural Networks.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2024

DySpMM: From Fix to Dynamic for Sparse Matrix-Matrix Multiplication Accelerators.
Proceedings of the 61st ACM/IEEE Design Automation Conference, 2024

FlashEval: Towards Fast and Accurate Evaluation of Text-to-Image Diffusion Generative Models.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024

2023
CoGNN: An Algorithm-Hardware Co-Design Approach to Accelerate GNN Inference With Minibatch Sampling.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., December, 2023

MNSIM 2.0: A Behavior-Level Modeling Tool for Processing-In-Memory Architectures.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., November, 2023

Gibbon: An Efficient Co-Exploration Framework of NN Model and Processing-In-Memory Architecture.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., November, 2023

Adaptive Multidimensional Parallel Fault Simulation Framework on Heterogeneous System.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., June, 2023

Sgap: towards efficient sparse tensor algebra compilation for GPU.
CCF Trans. High Perform. Comput., June, 2023

Serving Multi-DNN Workloads on FPGAs: A Coordinated Architecture, Scheduling, and Mapping Perspective.
IEEE Trans. Computers, May, 2023

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization.
CoRR, 2023

FlashDecoding++: Faster Large Language Model Inference on GPUs.
CoRR, 2023

CogDL: A Comprehensive Library for Graph Deep Learning.
Proceedings of the ACM Web Conference 2023, 2023

History-Detr: Optimize Query Initialization Strategy by Using Historical Information and Kinematics.
Proceedings of the ACM Multimedia Asia 2023, 2023

HyperGef: A Framework Enabling Efficient Fusion for Hypergraph Neural Network on GPUs.
Proceedings of the Sixth Conference on Machine Learning and Systems, 2023

Exploiting Hardware Utilization and Adaptive Dataflow for Efficient Sparse Convolution in 3D Point Clouds.
Proceedings of the Sixth Conference on Machine Learning and Systems, 2023

DF-GAS: a Distributed FPGA-as-a-Service Architecture towards Billion-Scale Graph-based Approximate Nearest Neighbor Search.
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023

TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023

Ada3D : Exploiting the Spatial Redundancy with Adaptive Inference for Efficient 3D Object Detection.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

TSTC: Two-Level Sparsity Tensor Core Enabling both Algorithm Flexibility and Hardware Efficiency.
Proceedings of the IEEE/ACM International Conference on Computer Aided Design, 2023

OPT: Optimal Proposal Transfer for Efficient Yield Optimization for Analog and SRAM Circuits.
Proceedings of the IEEE/ACM International Conference on Computer Aided Design, 2023

A Point Transformer Accelerator with Fine-Grained Pipelines and Distribution-Aware Dynamic FPS.
Proceedings of the IEEE/ACM International Conference on Computer Aided Design, 2023

Adam Accumulation to Reduce Memory Footprints of Both Activations and Gradients for Large-Scale DNN Training.
Proceedings of the ECAI 2023 - 26th European Conference on Artificial Intelligence, September 30 - October 4, 2023, Kraków, Poland, 2023

Minimizing Communication Conflicts in Network-On-Chip Based Processing-In-Memory Architecture.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2023

CLAP: Locality Aware and Parallel Triangle Counting with Content Addressable Memory.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2023

PIM-HLS: An Automatic Hardware Generation Tool for Heterogeneous Processing-In-Memory-based Neural Network Accelerators.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023

Processing-In-Hierarchical-Memory Architecture for Billion-Scale Approximate Nearest Neighbor Search.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023

An Efficient Accelerator for Point-based and Voxel-based Point Cloud Neural Networks.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023

Seeking the Yield Barrier: High-Dimensional SRAM Evaluation Through Optimal Manifold.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023

Memory-Efficient and Real-Time SPAD-based dToF Depth Sensor with Spatial and Statistical Correlation.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023

TorchSparse++: Efficient Point Cloud Engine.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

High-Dimensional Yield Estimation Using Shrinkage Deep Features and Maximization of Integral Entropy Reduction.
Proceedings of the 28th Asia and South Pacific Design Automation Conference, 2023

NTGAT: A Graph Attention Network Accelerator with Runtime Node Tailoring.
Proceedings of the 28th Asia and South Pacific Design Automation Conference, 2023

2022
A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud.
ACM Trans. Reconfigurable Technol. Syst., 2022

Exploring the Potential of Low-Bit Training of Convolutional Neural Networks.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2022

INCAME: Interruptible CNN Accelerator for Multirobot Exploration.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2022

GRAPHIC: GatheR-And-Process in Highly parallel with In-SSD Compression Architecture in Very Large-Scale Graph.
CoRR, 2022

Heuristic Adaptability to Input Dynamics for SpMM on GPUs.
CoRR, 2022

Understanding GNN Computational Graph: A Coordinated Computation, IO, and Memory Perspective.
Proceedings of the Fifth Conference on Machine Learning and Systems, 2022

Optimizing Graph-based Approximate Nearest Neighbor Search: Stronger and Smarter.
Proceedings of the 23rd IEEE International Conference on Mobile Data Management, 2022

DIMMining: pruning-efficient and parallel graph mining on near-memory-computing.
Proceedings of the ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18, 2022

Exploiting Parallelism with Vertex-Clustering in Processing-In-Memory-based GCN Accelerators.
Proceedings of the 2022 Design, Automation & Test in Europe Conference & Exhibition, 2022

Gibbon: Efficient Co-Exploration of NN Model and Processing-In-Memory Architecture.
Proceedings of the 2022 Design, Automation & Test in Europe Conference & Exhibition, 2022

Heuristic adaptability to input dynamics for SpMM on CPUs.
Proceedings of the DAC '22: 59th ACM/IEEE Design Automation Conference, San Francisco, California, USA, July 10, 2022

A one-for-all and <i>o</i>(<i>v</i> log(<i>v</i> ))-cost solution for parallel merge style operations on sorted key-value arrays.
Proceedings of the ASPLOS '22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022, 2022

2021
Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction.
CoRR, 2021

CogDL: An Extensive Toolkit for Deep Learning on Graphs.
CoRR, 2021

Exploiting Online Locality and Reduction Parallelism for Sampled Dense Matrix Multiplication on GPUs.
Proceedings of the 39th IEEE International Conference on Computer Design, 2021

Rerec: In-ReRAM Acceleration with Access-Aware Mapping for Personalized Recommendation.
Proceedings of the IEEE/ACM International Conference On Computer Aided Design, 2021

3M-AI: A Multi-task and Multi-core Virtualization Framework for Multi-FPGA AI Systems in the Cloud.
Proceedings of the FPGA '21: The 2021 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Virtual Event, USA, February 28, 2021

2020
GE-SpMM: general-purpose sparse matrix-matrix multiplication on GPUs for graph neural networks.
Proceedings of the International Conference for High Performance Computing, 2020

GraphSDH: A General Graph Sampling Framework with Distribution and Hierarchy.
Proceedings of the 2020 IEEE High Performance Extreme Computing Conference, 2020

LessMine: Reducing Sample Space and Data Access for Dense Pattern Mining.
Proceedings of the 2020 IEEE High Performance Extreme Computing Conference, 2020

MNSIM 2.0: A Behavior-Level Modeling Tool for Memristor-based Neuromorphic Computing Systems.
Proceedings of the GLSVLSI '20: Great Lakes Symposium on VLSI 2020, 2020

An Order Sampling Processing-in-Memory Architecture for Approximate Graph Pattern Mining.
Proceedings of the GLSVLSI '20: Great Lakes Symposium on VLSI 2020, 2020

Enable Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud.
Proceedings of the FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020

INCAME: INterruptible CNN Accelerator for Multi-robot Exploration.
Proceedings of the FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020

Enabling Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud.
Proceedings of the 28th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2020

INCA: INterruptible CNN Accelerator for Multi-tasking in Embedded Robots.
Proceedings of the 57th ACM/IEEE Design Automation Conference, 2020

2019
GraphH: A Processing-in-Memory Architecture for Large-Scale Graph Processing.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2019

HyVE: Hybrid Vertex-Edge Memory Hierarchy for Energy-Efficient Graph Processing.
IEEE Trans. Computers, 2019

Centrifuge: Evaluating full-system HLS-generated heterogenous-accelerator SoCs using FPGA-Acceleration.
Proceedings of the International Conference on Computer-Aided Design, 2019

A Configurable Multi-Precision CNN Computing Framework Based on Single Bit RRAM.
Proceedings of the 56th Annual Design Automation Conference 2019, 2019

Memory-Bound Proof-of-Work Acceleration for Blockchain Applications.
Proceedings of the 56th Annual Design Automation Conference 2019, 2019

GraphSAR: a sparsity-aware processing-in-memory architecture for large-scale graph processing on ReRAMs.
Proceedings of the 24th Asia and South Pacific Design Automation Conference, 2019

2018
GraphIA: an in-situ accelerator for large-scale graph processing.
Proceedings of the International Symposium on Memory Systems, 2018

NewGraph: Balanced Large-Scale Graph Processing on FPGAs with Low Preprocessing Overheads.
Proceedings of the 26th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2018

HyVE: Hybrid vertex-edge memory hierarchy for energy-efficient graph processing.
Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition, 2018

2017
ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture.
Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017

2016
NXgraph: An efficient graph processing system on a single machine.
Proceedings of the 32nd IEEE International Conference on Data Engineering, 2016

Approximate Frequent Itemset Mining for streaming data on FPGA.
Proceedings of the 26th International Conference on Field Programmable Logic and Applications, 2016

FPGP: Graph Processing Framework on FPGA A Case Study of Breadth-First Search.
Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016

2015
A self-aware data compression system on FPGA in Hadoop.
Proceedings of the 2015 International Conference on Field Programmable Technology, 2015

2014
Online scheduling for FPGA computation in the Cloud.
Proceedings of the 2014 International Conference on Field-Programmable Technology, 2014

2013
DTW-Based Subsequence Similarity Search on AMD Heterogeneous Computing Platform.
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013


  Loading...