2025
KPerfIR: Towards an Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads.
CoRR, May, 2025
OneAdapt: Adaptive Compilation for Resource-Constrained Photonic One-Way Quantum Computing.
CoRR, April, 2025
WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training.
,
,
,
,
,
,
,
,
,
,
,
CoRR, March, 2025
Hardware-aware Calibration Protocol for Quantum Computers.
Proceedings of the 52nd Annual International Symposium on Computer Architecture, 2025
SwitchQNet: Optimizing Distributed Quantum Computing for Quantum Data Centers with Switch Networks.
Proceedings of the 52nd Annual International Symposium on Computer Architecture, 2025
TRACI: Network Acceleration of Input-Dynamic Communication for Large-Scale Deep Learning Recommendation Model.
Proceedings of the 52nd Annual International Symposium on Computer Architecture, 2025
CaliQEC: In-situ Qubit Calibration for Surface Code Quantum Error Correction.
,
,
,
,
,
,
,
,
,
,
Proceedings of the 52nd Annual International Symposium on Computer Architecture, 2025
Mutual Effort for Efficiency: A Similarity-based Token Pruning for Vision Transformers in Self-Supervised Learning.
,
,
,
,
,
,
,
,
,
,
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
Push Multicast: A Speculative and Coherent Interconnect for Mitigating Manycore CPU Communication Bottleneck.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2025
QECC-Synth: A Layout Synthesizer for Quantum Error Correction Codes on Sparse Architectures.
Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2025
HetEC: Architectures for Heterogeneous Quantum Error Correction Codes.
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2025
2024
Improving GPU Multi-Tenancy Through Dynamic Multi-Instance GPU Reconfiguration.
CoRR, 2024
OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model.
Proceedings of the 2024 USENIX Annual Technical Conference, 2024
RecFlex: Enabling Feature Heterogeneity-Aware Optimization for Deep Recommendation Models with Flexible Schedules.
Proceedings of the International Conference for High Performance Computing, 2024
Surf-Deformer: Mitigating Dynamic Defects on Surface Code via Adaptive Deformation.
Proceedings of the 57th IEEE/ACM International Symposium on Microarchitecture, 2024
Soter: Analytical Tensor-Architecture Modeling and Automatic Tensor Program Tuning for Spatial Accelerators.
Proceedings of the 51st ACM/IEEE Annual International Symposium on Computer Architecture, 2024
MECH: Multi-Entry Communication Highway for Superconducting Quantum Chiplets.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024
OnePerc: A Randomness-aware Compiler for Photonic Quantum Computing.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024
RAP: Resource-aware Automated GPU Sharing for Multi-GPU Recommendation Model Training and Input Preprocessing.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024
ZENO: A Type-based Optimization Framework for Zero Knowledge Neural Network Inference.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024
EVT: Accelerating Deep Learning Training with Epilogue Visitor Tree.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024
2023
MPU: Memory-centric SIMT Processor via In-DRAM Near-bank Computing.
ACM Trans. Archit. Code Optim., September, 2023
Comprehensive SNN Compression Using ADMM Optimization and Activity Regularization.
IEEE Trans. Neural Networks Learn. Syst., June, 2023
Exploring Adversarial Attack in Spiking Neural Networks With Spike-Compatible Gradient.
IEEE Trans. Neural Networks Learn. Syst., May, 2023
A Geometrical Approach to Evaluate the Adversarial Robustness of Deep Neural Networks.
ACM Trans. Multim. Comput. Commun. Appl., 2023
SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2023
SPG: Structure-Private Graph Database via SqueezePIR.
Proc. VLDB Endow., 2023
ReDCIM: Reconfigurable Digital Computing- In -Memory Processor With Unified FP/INT Pipeline for Cloud AI Acceleration.
IEEE J. Solid State Circuits, 2023
TranCIM: Full-Digital Bitline-Transpose CIM-based Sparse Transformer Accelerator With Pipeline/Parallel Reconfigurable Modes.
IEEE J. Solid State Circuits, 2023
TC-GNN: Bridging Sparse GNN Computation and Dense Tensor Cores on GPUs.
Proceedings of the 2023 USENIX Annual Technical Conference, 2023
QASMTrans: A QASM Quantum Transpiler Framework for NISQ Devices.
,
,
,
,
,
,
,
,
,
,
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023
Dynamic N: M Fine-Grained Structured Sparse Attention Mechanism.
Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2023
MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms.
Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation, 2023
ALCOP: Automatic Load-Compute Pipelining in Deep Learning Compiler for AI-GPUs.
Proceedings of the Sixth Conference on Machine Learning and Systems, 2023
QuComm: Optimizing Collective Communication for Distributed Quantum Computing.
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023
RM-STC: Row-Merge Dataflow Inspired GPU Sparse Tensor Core for Energy-Efficient Sparse Acceleration.
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023
OneQ: A Compilation Framework for Photonic One-Way Quantum Computation.
Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023
Q-BEEP: Quantum Bayesian Error Mitigation Employing Poisson Modeling over the Hamming Spectrum.
Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023
ECSSD: Hardware/Data Layout Co-Designed In-Storage-Computing Architecture for Extreme Classification.
Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023
On Adversarial Robustness of Point Cloud Semantic Segmentation.
Proceedings of the 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Network, 2023
2022
STPAcc: Structural TI-Based Pruning for Accelerating Distance-Related Algorithms on CPU-FPGA Platforms.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2022
Rubik: A Hierarchical Architecture for Efficient Graph Neural Network Training.
,
,
,
,
,
,
,
,
,
,
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2022
Dynamic Sparse Attention for Scalable Transformer Acceleration.
IEEE Trans. Computers, 2022
A Systematic View of Model Leakage Risks in Deep Neural Network Systems.
IEEE Trans. Computers, 2022
Quantum and Post-Moore's Law Computing.
IEEE Internet Comput., 2022
Enabling Data Movement and Computation Pipelining in Deep Learning Compiler.
CoRR, 2022
Empowering GNNs with Fine-grained Communication-Computation Pipelining on Multi-GPU Platforms.
CoRR, 2022
CollComm: Enabling Efficient Collective Quantum Communication Based on EPR buffering.
CoRR, 2022
GMI-DRL: Empowering Multi-GPU Deep Reinforcement Learning with GPU Spatial Multiplexing.
CoRR, 2022
Heuristic Adaptability to Input Dynamics for SpMM on GPUs.
CoRR, 2022
MPU-Sim: A Simulator for In-DRAM Near-Bank Processing Architectures.
IEEE Comput. Archit. Lett., 2022
Faith: An Efficient Framework for Transformer Verification on GPUs.
Proceedings of the 2022 USENIX Annual Technical Conference, 2022
LightSeq2: Accelerated Training for Transformer-Based Models on GPUs.
Proceedings of the SC22: International Conference for High Performance Computing, 2022
EL-Rec: Efficient Large-Scale Recommendation Model Training via Tensor-Train Embedding Table.
Proceedings of the SC22: International Conference for High Performance Computing, 2022
QGTC: accelerating quantized graph neural networks via GPU tensor core.
Proceedings of the PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2, 2022
Biologically Inspired Dynamic Thresholds for Spiking Neural Networks.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022
Understanding GNN Computational Graph: A Coordinated Computation, IO, and Memory Perspective.
Proceedings of the Fifth Conference on Machine Learning and Systems, 2022
AutoComm: A Framework for Enabling Efficient Communication in Distributed Quantum Programs.
Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture, 2022
A 28nm 15.59µJ/Token Full-Digital Bitline-Transpose CIM-Based Sparse Transformer Accelerator with Pipeline/Parallel Reconfigurable Modes.
Proceedings of the IEEE International Solid-State Circuits Conference, 2022
A 28nm 29.2TFLOPS/W BF16 and 36.5TOPS/W INT8 Reconfigurable Digital CIM Processor with Unified FP/INT Pipeline and Bitwise In-Memory Booth Multiplication for Cloud Deep Learning Acceleration.
Proceedings of the IEEE International Solid-State Circuits Conference, 2022
A synthesis framework for stitching surface code with superconducting quantum devices.
Proceedings of the ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18, 2022
EQC: ensembled quantum computing for variational quantum algorithms.
Proceedings of the ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18, 2022
INSPIRE: in-storage private information retrieval via protocol and architecture co-design.
Proceedings of the ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18, 2022
Shfl-BW: accelerating deep neural network inference with tensor-core aware weight pruning.
Proceedings of the DAC '22: 59th ACM/IEEE Design Automation Conference, San Francisco, California, USA, July 10, 2022
Heuristic adaptability to input dynamics for SpMM on CPUs.
Proceedings of the DAC '22: 59th ACM/IEEE Design Automation Conference, San Francisco, California, USA, July 10, 2022
DOTA: detect and omit weak attentions for scalable transformer acceleration.
Proceedings of the ASPLOS '22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022, 2022
Paulihedral: a generalized block-wise compiler optimization framework for Quantum simulation kernels.
Proceedings of the ASPLOS '22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022, 2022
2021
Effective and Efficient Batch Normalization Using a Few Uncorrelated Data for Statistics Estimation.
IEEE Trans. Neural Networks Learn. Syst., 2021
Reuse-centric k-means configuration.
Inf. Syst., 2021
ZEN: Efficient Zero-Knowledge Proofs for Neural Networks.
IACR Cryptol. ePrint Arch., 2021
Attacking Point Cloud Segmentation with Color-only Perturbation.
CoRR, 2021
TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs.
CoRR, 2021
Towards Efficient Ansatz Architecture for Variational Quantum Algorithms.
CoRR, 2021
Mapping Surface Code to Superconducting Quantum Processors.
CoRR, 2021
QECV: Quantum Error Correction Verification.
CoRR, 2021
Mitigating Noise-Induced Gradient Vanishing in Variational Quantum Algorithm Training.
CoRR, 2021
QGTC: Accelerating Quantized GNN via GPU Tensor Core.
CoRR, 2021
Transformer Acceleration with Dynamic Sparse Attention.
CoRR, 2021
Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction.
CoRR, 2021
MPU: Towards Bandwidth-abundant SIMT Processor via Near-bank Computing.
CoRR, 2021
Palleon: A Runtime System for Efficient Video Processing toward Dynamic Class Skew.
Proceedings of the 2021 USENIX Annual Technical Conference, 2021
APNN-TC: accelerating arbitrary precision neural networks on ampere GPU tensor cores.
Proceedings of the International Conference for High Performance Computing, 2021
Efficient tensor core-based GPU kernels for structured sparsity under reduced precision.
Proceedings of the International Conference for High Performance Computing, 2021
EGEMM-TC: accelerating scientific computing on tensor cores with extended precision.
Proceedings of the PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021
GNNAdvisor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs.
Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation, 2021
On the Co-Design of Quantum Software and Hardware.
Proceedings of the NANOCOM '21: The Eighth Annual ACM International Conference on Nanoscale Computing and Communication, Virtual Event, Italy, September 7, 2021
ENMC: Extreme Near-Memory Classification via Approximate Screening.
Proceedings of the MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021
Improving Streaming Graph Processing Performance using Input Knowledge.
Proceedings of the MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021
DSXplore: Optimizing Convolutional Neural Networks via Sliding-Channel Convolutions.
Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021
Overcoming the Memory Hierarchy Inefficiencies in Graph Processing Applications.
Proceedings of the IEEE/ACM International Conference On Computer Aided Design, 2021
Saga: Sparse Adversarial Attack on EEG-Based Brain Computer Interface.
Proceedings of the IEEE International Conference on Acoustics, 2021
An Efficient Quantitative Approach for Optimizing Convolutional Neural Networks.
Proceedings of the CIKM '21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1, 2021
TiAcc: Triangle-inequality based Hardware Accelerator for K-means on FPGAs.
Proceedings of the 21st IEEE/ACM International Symposium on Cluster, 2021
UAG: Uncertainty-aware Attention Graph Neural Network for Defending Adversarial Attacks.
Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021
2020
Projection-based runtime assertions for testing and debugging Quantum programs.
Proc. ACM Program. Lang., 2020
Rethinking the performance comparison between SNNS and ANNS.
Neural Networks, 2020
Tianjic: A Unified and Scalable Chip Bridging Spike-Based and Continuous Neural Computation.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
IEEE J. Solid State Circuits, 2020
A novel ensemble pruning approach based on information exchange glowworm swarm optimization and complementarity measure.
J. Intell. Fuzzy Syst., 2020
Rubik: A Hierarchical Architecture for Efficient Graph Learning.
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2020
Uncertainty-aware Attention Graph Neural Network for Defending Adversarial Attacks.
CoRR, 2020
Scalable Adversarial Attack on Graph Neural Networks with Alternating Direction Method of Multipliers.
CoRR, 2020
Optimizing Convolutional Neural Network Architecture via Information Field.
CoRR, 2020
GNNAdvisor: An Efficient Runtime System for GNN Acceleration on GPUs.
CoRR, 2020
Domain-adversarial multi-task framework for novel therapeutic property prediction of compounds.
,
,
,
,
,
,
,
,
,
,
Bioinform., 2020
A Close Look at Multi-tenant Parallel CNN Inference for Autonomous Driving.
Proceedings of the Network and Parallel Computing, 2020
DUET: Boosting Deep Neural Network Efficiency on Dual-Module Architecture.
Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture, 2020
iPIM: Programmable In-Memory Image Processing Accelerator Using Near-Bank Architecture.
Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture, 2020
SGQuant: Squeezing the Last Bit on Graph Neural Networks with Specialized Quantization.
Proceedings of the 32nd IEEE International Conference on Tools with Artificial Intelligence, 2020
Boosting Deep Neural Network Efficiency with Dual-Module Inference.
Proceedings of the 37th International Conference on Machine Learning, 2020
Eliminating Redundant Computation in Noisy Quantum Computing Simulation.
Proceedings of the 57th ACM/IEEE Design Automation Conference, 2020
Towards Efficient Superconducting Quantum Processor Architecture Design.
Proceedings of the ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, 2020
DeepSniffer: A DNN Model Extraction Framework Based on Learning Architectural Hints.
,
,
,
,
,
,
,
,
,
,
Proceedings of the ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, 2020
Weighted-Sampling Audio Adversarial Example Attack.
Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020
2019
DASM: Data-Streaming-Based Computing in Nonvolatile Memory Architecture for Embedded System.
IEEE Trans. Very Large Scale Integr. Syst., 2019
Poq: Projection-based Runtime Assertions for Debugging on a Quantum Computer.
CoRR, 2019
AccD: A Compiler-based Framework for Accelerating Distance-related Algorithms on CPU-FPGA Platforms.
CoRR, 2019
SANQ: A Simulation Framework for Architecting Noisy Intermediate-Scale Quantum Computing System.
CoRR, 2019
Neural Network Model Extraction Attacks in Edge Devices by Hearing Architectural Hints.
CoRR, 2019
Adversarial attack on Speech-to-Text Recognition Models.
CoRR, 2019
Reconciling Feature-Reuse and Overfitting in DenseNet with Specialized Dropout.
Proceedings of the 31st IEEE International Conference on Tools with Artificial Intelligence, 2019
Dynamic Sparse Graph for Efficient Deep Learning.
Proceedings of the 7th International Conference on Learning Representations, 2019
KPynq: A Work-Efficient Triangle-Inequality Based K-Means on FPGA.
Proceedings of the 27th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2019
Tackling the Qubit Mapping Problem for NISQ-Era Quantum Devices.
Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019
2018
Penetrating the Fog: the Path to Efficient CNN Models.
CoRR, 2018
Domain-Adversarial Multi-Task Framework for Novel Therapeutic Property Prediction of Compounds.
CoRR, 2018
Reconciling Feature-Reuse and Overfitting in DenseNet with Specialized Dropout.
CoRR, 2018
In-memory multiplication engine with SOT-MRAM based stochastic computing.
CoRR, 2018
SECS: Efficient Deep Stream Processing via Class Skew Dichotomy.
CoRR, 2018
Challenges Towards Deploying Data Intensive Scientific Applications on Extreme Heterogeneity Supercomputers.
CoRR, 2018
Reuse-Centric K-Means Configuration.
Proceedings of the 34th IEEE International Conference on Data Engineering, 2018
2017
GLORE: generalized loop redundancy elimination upon LER-notation.
Proc. ACM Program. Lang., 2017
Generalizations of the theory and deployment of triangular inequality for compiler-based strength reduction.
Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2017
Sweet KNN: An Efficient KNN on GPU through Reconciliation between Redundancy Removal and Regularity.
Proceedings of the 33rd IEEE International Conference on Data Engineering, 2017
2015
TOP: A Framework for Enabling Algorithmic Optimizations for Distance-Related Problems.
Proc. VLDB Endow., 2015
Autotuning algorithmic choice for input sensitivity.
Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2015
Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup.
Proceedings of the 32nd International Conference on Machine Learning, 2015
2014
Call sequence prediction through probabilistic calling automata.
Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, 2014
Finding the limit: examining the potential and complexity of compilation scheduling for JIT-based runtime systems.
Proceedings of the Architectural Support for Programming Languages and Operating Systems, 2014
2013
Profmig: A framework for flexible migration of program profiles across software versions.
Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, 2013