2025
Fine-Grained Structured Sparse Computing for FPGA-Based AI Inference.
,
,
,
,
,
,
,
,
,
,
,
,
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., July, 2025
Enabling Efficient Sparse Multiplications on GPUs With Heuristic Adaptability.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., June, 2025
HyCTor: A Hybrid CNN-Transformer Network Accelerator With Flexible Weight/Output Stationary Dataflow and Multicore Extension.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., May, 2025
R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing.
CoRR, May, 2025
PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs.
CoRR, May, 2025
semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage.
,
,
,
,
,
,
,
,
,
,
CoRR, April, 2025
FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation.
,
,
,
,
,
,
,
,
,
,
,
CoRR, April, 2025
VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate.
CoRR, April, 2025
DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers.
CoRR, March, 2025
A Point Transformer Accelerator With Distribution-Aware Heuristic Distance Calculation.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., February, 2025
Megrez-Omni Technical Report.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, February, 2025
DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation.
CoRR, February, 2025
DeepGate4: Efficient and Effective Representation Learning for Circuit Design at Scale.
CoRR, February, 2025
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models.
CoRR, January, 2025
SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting.
Proceedings of the 52nd Annual International Symposium on Computer Architecture, 2025
Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better.
,
,
,
,
,
,
,
,
,
,
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
TB-STC: Transposable Block-wise N: M Structured Sparse Tensor Core.
,
,
,
,
,
,
,
,
,
,
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2025
FMC-LLM: Enabling FPGAs for Efficient Batched Decoding of 70B+ LLMs with a Memory-Centric Streaming Architecture.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2025
FlightVGM: Efficient Video Generation Model Inference with Online Sparsification and Hybrid Precision on FPGAs.
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2025
DyLGNN: Efficient LM-GNN Fine-Tuning with Dynamic Node Partitioning, Low-Degree Sparsity, and Asynchronous Sub-Batch.
Proceedings of the Design, Automation & Test in Europe Conference, 2025
AiSpGEMM: Accelerating Imbalanced SpGEMM on FPGAs with Flexible Interconnect and Intra-row Parallel Merging.
Proceedings of the Design, Automation & Test in Europe Conference, 2025
SoftmAP: Software-Hardware Co-Design for Integer-Only Softmax on Associative Processors.
Proceedings of the Design, Automation & Test in Europe Conference, 2025
Deploying Diffusion Models with Scheduling Space Search and Memory Overflow Prevention Based on Graph Optimization.
Proceedings of the 30th Asia and South Pacific Design Automation Conference, 2025
Accelerator for LLM-Enhanced GNN with Product Quantization and Unified Indexing.
Proceedings of the 30th Asia and South Pacific Design Automation Conference, 2025
LLSM: LLM-enhanced Logic Synthesis Model with EDA-guided CoT Prompting, Hybrid Embedding and AIG-tailored Acceleration.
Proceedings of the 30th Asia and South Pacific Design Automation Conference, 2025
ViDA: Video Diffusion Transformer Acceleration with Differential Approximation and Adaptive Dataflow.
Proceedings of the 30th Asia and South Pacific Design Automation Conference, 2025
2024
Toward High-Accuracy and Real-Time Two-Stage Small Object Detection on FPGA.
IEEE Trans. Circuits Syst. Video Technol., September, 2024
GRAPHIC: Gather and Process Harmoniously in the Cache With High Parallelism and Flexibility.
,
,
,
,
,
,
,
,
,
,
,
IEEE Trans. Emerg. Top. Comput., 2024
MBQ: Modality-Balanced Quantization for Large Vision-Language Models.
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling.
CoRR, 2024
Automating Energy-Efficient GPU Kernel Generation: A Fast Search-Based Compilation Approach.
CoRR, 2024
Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search.
CoRR, 2024
Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective.
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding.
CoRR, 2024
CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios.
CoRR, 2024
Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs.
CoRR, 2024
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression.
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
Can LLMs Learn by Teaching? A Preliminary Study.
CoRR, 2024
DiTFastAttn: Attention Compression for Diffusion Transformer Models.
CoRR, 2024
ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation.
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
HetHub: A Heterogeneous distributed hybrid training system for large-scale models.
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis.
CoRR, 2024
A Survey on Efficient Inference for Large Language Models.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better.
CoRR, 2024
LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K.
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
Efficient Deployment of Large Language Model across Cloud-Device Systems.
Proceedings of the 37th IEEE International System-on-Chip Conference, 2024
DiTFastAttn: Attention Compression for Diffusion Transformer Models.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Can LLMs Learn by Teaching for Better Reasoning? A Preliminary Study.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics.
Proceedings of the Seventh Annual Conference on Machine Learning and Systems, 2024
Evaluating Quantized Large Language Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024
Towards Floating Point-Based Attention-Free LLM: Hybrid PIM with Non-Uniform Data Format and Reduced Multiplications.
Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, 2024
Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization.
Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, 2024
MARCA: Mamba Accelerator with Reconfigurable Architecture.
Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, 2024
FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2024
MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization.
Proceedings of the Computer Vision - ECCV 2024, 2024
DyPIM: Dynamic-Inference-Enabled Processing - In-Memory Accelerator.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2024
FusionArch: A Fusion-Based Accelerator for Point-Based Point Cloud Neural Networks.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2024
DySpMM: From Fix to Dynamic for Sparse Matrix-Matrix Multiplication Accelerators.
Proceedings of the 61st ACM/IEEE Design Automation Conference, 2024
FlashEval: Towards Fast and Accurate Evaluation of Text-to-Image Diffusion Generative Models.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning.
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024
2023
CoGNN: An Algorithm-Hardware Co-Design Approach to Accelerate GNN Inference With Minibatch Sampling.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., December, 2023
MNSIM 2.0: A Behavior-Level Modeling Tool for Processing-In-Memory Architectures.
,
,
,
,
,
,
,
,
,
,
,
,
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., November, 2023
Gibbon: An Efficient Co-Exploration Framework of NN Model and Processing-In-Memory Architecture.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., November, 2023
Adaptive Multidimensional Parallel Fault Simulation Framework on Heterogeneous System.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., June, 2023
Sgap: towards efficient sparse tensor algebra compilation for GPU.
CCF Trans. High Perform. Comput., June, 2023
Serving Multi-DNN Workloads on FPGAs: A Coordinated Architecture, Scheduling, and Mapping Perspective.
IEEE Trans. Computers, May, 2023
FlashDecoding++: Faster Large Language Model Inference on GPUs.
CoRR, 2023
CogDL: A Comprehensive Library for Graph Deep Learning.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the ACM Web Conference 2023, 2023
History-Detr: Optimize Query Initialization Strategy by Using Historical Information and Kinematics.
Proceedings of the ACM Multimedia Asia 2023, 2023
HyperGef: A Framework Enabling Efficient Fusion for Hypergraph Neural Network on GPUs.
Proceedings of the Sixth Conference on Machine Learning and Systems, 2023
Exploiting Hardware Utilization and Adaptive Dataflow for Efficient Sparse Convolution in 3D Point Clouds.
Proceedings of the Sixth Conference on Machine Learning and Systems, 2023
DF-GAS: a Distributed FPGA-as-a-Service Architecture towards Billion-Scale Graph-based Approximate Nearest Neighbor Search.
,
,
,
,
,
,
,
,
,
,
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023
TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023
Ada3D : Exploiting the Spatial Redundancy with Adaptive Inference for Efficient 3D Object Detection.
,
,
,
,
,
,
,
,
,
,
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023
TSTC: Two-Level Sparsity Tensor Core Enabling both Algorithm Flexibility and Hardware Efficiency.
Proceedings of the IEEE/ACM International Conference on Computer Aided Design, 2023
OPT: Optimal Proposal Transfer for Efficient Yield Optimization for Analog and SRAM Circuits.
Proceedings of the IEEE/ACM International Conference on Computer Aided Design, 2023
A Point Transformer Accelerator with Fine-Grained Pipelines and Distribution-Aware Dynamic FPS.
Proceedings of the IEEE/ACM International Conference on Computer Aided Design, 2023
Adam Accumulation to Reduce Memory Footprints of Both Activations and Gradients for Large-Scale DNN Training.
Proceedings of the ECAI 2023 - 26th European Conference on Artificial Intelligence, September 30 - October 4, 2023, Kraków, Poland, 2023
Minimizing Communication Conflicts in Network-On-Chip Based Processing-In-Memory Architecture.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2023
CLAP: Locality Aware and Parallel Triangle Counting with Content Addressable Memory.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2023
PIM-HLS: An Automatic Hardware Generation Tool for Heterogeneous Processing-In-Memory-based Neural Network Accelerators.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023
Processing-In-Hierarchical-Memory Architecture for Billion-Scale Approximate Nearest Neighbor Search.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023
An Efficient Accelerator for Point-based and Voxel-based Point Cloud Neural Networks.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023
Seeking the Yield Barrier: High-Dimensional SRAM Evaluation Through Optimal Manifold.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023
Memory-Efficient and Real-Time SPAD-based dToF Depth Sensor with Spatial and Statistical Correlation.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023
TorchSparse++: Efficient Point Cloud Engine.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
High-Dimensional Yield Estimation Using Shrinkage Deep Features and Maximization of Integral Entropy Reduction.
Proceedings of the 28th Asia and South Pacific Design Automation Conference, 2023
NTGAT: A Graph Attention Network Accelerator with Runtime Node Tailoring.
Proceedings of the 28th Asia and South Pacific Design Automation Conference, 2023
2022
A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud.
ACM Trans. Reconfigurable Technol. Syst., 2022
Exploring the Potential of Low-Bit Training of Convolutional Neural Networks.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2022
INCAME: Interruptible CNN Accelerator for Multirobot Exploration.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2022
GRAPHIC: GatheR-And-Process in Highly parallel with In-SSD Compression Architecture in Very Large-Scale Graph.
,
,
,
,
,
,
,
,
,
,
CoRR, 2022
Heuristic Adaptability to Input Dynamics for SpMM on GPUs.
CoRR, 2022
Understanding GNN Computational Graph: A Coordinated Computation, IO, and Memory Perspective.
Proceedings of the Fifth Conference on Machine Learning and Systems, 2022
Optimizing Graph-based Approximate Nearest Neighbor Search: Stronger and Smarter.
Proceedings of the 23rd IEEE International Conference on Mobile Data Management, 2022
DIMMining: pruning-efficient and parallel graph mining on near-memory-computing.
Proceedings of the ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18, 2022
Exploiting Parallelism with Vertex-Clustering in Processing-In-Memory-based GCN Accelerators.
Proceedings of the 2022 Design, Automation & Test in Europe Conference & Exhibition, 2022
Gibbon: Efficient Co-Exploration of NN Model and Processing-In-Memory Architecture.
Proceedings of the 2022 Design, Automation & Test in Europe Conference & Exhibition, 2022
Heuristic adaptability to input dynamics for SpMM on CPUs.
Proceedings of the DAC '22: 59th ACM/IEEE Design Automation Conference, San Francisco, California, USA, July 10, 2022
A one-for-all and <i>o</i>(<i>v</i> log(<i>v</i> ))-cost solution for parallel merge style operations on sorted key-value arrays.
Proceedings of the ASPLOS '22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022, 2022
2021
Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction.
CoRR, 2021
CogDL: An Extensive Toolkit for Deep Learning on Graphs.
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2021
Exploiting Online Locality and Reduction Parallelism for Sampled Dense Matrix Multiplication on GPUs.
Proceedings of the 39th IEEE International Conference on Computer Design, 2021
Rerec: In-ReRAM Acceleration with Access-Aware Mapping for Personalized Recommendation.
Proceedings of the IEEE/ACM International Conference On Computer Aided Design, 2021
3M-AI: A Multi-task and Multi-core Virtualization Framework for Multi-FPGA AI Systems in the Cloud.
,
,
,
,
,
,
,
,
,
,
Proceedings of the FPGA '21: The 2021 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Virtual Event, USA, February 28, 2021
2020
GE-SpMM: general-purpose sparse matrix-matrix multiplication on GPUs for graph neural networks.
Proceedings of the International Conference for High Performance Computing, 2020
GraphSDH: A General Graph Sampling Framework with Distribution and Hierarchy.
Proceedings of the 2020 IEEE High Performance Extreme Computing Conference, 2020
LessMine: Reducing Sample Space and Data Access for Dense Pattern Mining.
Proceedings of the 2020 IEEE High Performance Extreme Computing Conference, 2020
MNSIM 2.0: A Behavior-Level Modeling Tool for Memristor-based Neuromorphic Computing Systems.
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the GLSVLSI '20: Great Lakes Symposium on VLSI 2020, 2020
An Order Sampling Processing-in-Memory Architecture for Approximate Graph Pattern Mining.
Proceedings of the GLSVLSI '20: Great Lakes Symposium on VLSI 2020, 2020
Enable Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud.
Proceedings of the FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020
INCAME: INterruptible CNN Accelerator for Multi-robot Exploration.
Proceedings of the FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020
Enabling Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud.
Proceedings of the 28th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2020
INCA: INterruptible CNN Accelerator for Multi-tasking in Embedded Robots.
Proceedings of the 57th ACM/IEEE Design Automation Conference, 2020
2019
GraphH: A Processing-in-Memory Architecture for Large-Scale Graph Processing.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2019
HyVE: Hybrid Vertex-Edge Memory Hierarchy for Energy-Efficient Graph Processing.
IEEE Trans. Computers, 2019
Centrifuge: Evaluating full-system HLS-generated heterogenous-accelerator SoCs using FPGA-Acceleration.
Proceedings of the International Conference on Computer-Aided Design, 2019
A Configurable Multi-Precision CNN Computing Framework Based on Single Bit RRAM.
Proceedings of the 56th Annual Design Automation Conference 2019, 2019
Memory-Bound Proof-of-Work Acceleration for Blockchain Applications.
Proceedings of the 56th Annual Design Automation Conference 2019, 2019
GraphSAR: a sparsity-aware processing-in-memory architecture for large-scale graph processing on ReRAMs.
Proceedings of the 24th Asia and South Pacific Design Automation Conference, 2019
2018
GraphIA: an in-situ accelerator for large-scale graph processing.
Proceedings of the International Symposium on Memory Systems, 2018
NewGraph: Balanced Large-Scale Graph Processing on FPGAs with Low Preprocessing Overheads.
Proceedings of the 26th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2018
HyVE: Hybrid vertex-edge memory hierarchy for energy-efficient graph processing.
Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition, 2018
2017
ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture.
Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017
2016
NXgraph: An efficient graph processing system on a single machine.
Proceedings of the 32nd IEEE International Conference on Data Engineering, 2016
Approximate Frequent Itemset Mining for streaming data on FPGA.
Proceedings of the 26th International Conference on Field Programmable Logic and Applications, 2016
FPGP: Graph Processing Framework on FPGA A Case Study of Breadth-First Search.
Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016
2015
A self-aware data compression system on FPGA in Hadoop.
Proceedings of the 2015 International Conference on Field Programmable Technology, 2015
2014
Online scheduling for FPGA computation in the Cloud.
Proceedings of the 2014 International Conference on Field-Programmable Technology, 2014
2013
DTW-Based Subsequence Similarity Search on AMD Heterogeneous Computing Platform.
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013