Libo Huang

Orcid: 0000-0002-8307-6742

According to our database1, Libo Huang authored at least 136 papers between 2007 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2025
RVAM16: a low-cost multiple-ISA processor based on RISC-V and ARM Thumb.
Frontiers Comput. Sci., January, 2025

2024
PDD: Pruning Neural Networks During Knowledge Distillation.
Cogn. Comput., November, 2024

A Low-Cost Floating-Point FMA Unit Supporting Package Operations for HPC-AI Applications.
IEEE Trans. Circuits Syst. II Express Briefs, July, 2024

A survey of compute nodes with 100 TFLOPS and beyond for supercomputers.
CCF Trans. High Perform. Comput., June, 2024

EPHA: An Energy-efficient Parallel Hybrid Architecture for ANNs and SNNs.
ACM Trans. Design Autom. Electr. Syst., May, 2024

Automatical Spike Sorting With Low-Rank and Sparse Representation.
IEEE Trans. Biomed. Eng., May, 2024

SAC: An Ultra-Efficient Spin-based Architecture for Compressed DNNs.
ACM Trans. Archit. Code Optim., March, 2024

MPRTA: An Efficient Multilevel Parallel Mobile Accelerator for High-Performance Ray Tracing.
IEEE Trans. Very Large Scale Integr. Syst., February, 2024

A Low-Cost Floating-Point Dot-Product-Dual-Accumulate Architecture for HPC-Enabled AI.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., February, 2024

UMT-Net: A Uniform Multi-Task Network With Adaptive Task Weighting.
IEEE Trans. Intell. Veh., January, 2024

Wavelet-based Mamba with Fourier Adjustment for Low-light Image Enhancement.
CoRR, 2024

Real-time Stereo-based 3D Object Detection for Streaming Perception.
CoRR, 2024

Continual Learning in the Frequency Domain.
CoRR, 2024

EGOR: Efficient Generated Objects Replay for incremental object detection.
CoRR, 2024

Exemplar-Free Class Incremental Learning via Incremental Representation.
CoRR, 2024

Relational Diffusion Distillation for Efficient Image Generation.
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024, 2024

Low-Precision Vectorized Arithmetic Unit Designs for Deep Learning.
Proceedings of the IEEE Computer Society Annual Symposium on VLSI, 2024

Online Policy Distillation with Decision-Attention.
Proceedings of the International Joint Conference on Neural Networks, 2024

KFC: Knowledge Reconstruction and Feedback Consolidation Enable Efficient and Effective Continual Generative Learning.
Proceedings of the Second Tiny Papers Track at ICLR 2024, 2024

Cost-Effective Value Predictor for ILP processors through Design Space Exploration.
Proceedings of the Great Lakes Symposium on VLSI 2024, 2024

ImSPU: Implicit Sharing of Computation Resources Between Vector and Scalar Processing Units.
Proceedings of the Euro-Par 2024: Parallel Processing, 2024

CLIP-KD: An Empirical Study of CLIP Model Distillation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Online Relational Knowledge Distillation for Image Classification.
Proceedings of the 27th International Conference on Computer Supported Cooperative Work in Design, 2024

Class-wise Image Mixture Guided Self-Knowledge Distillation for Image Classification.
Proceedings of the 27th International Conference on Computer Supported Cooperative Work in Design, 2024

QuickTree: A Fast Hardware BVH Construction Engine.
Proceedings of the 21st ACM International Conference on Computing Frontiers, 2024

Out-of-Order and Recursive RAS: A Return Address Stack Design on High Performance Processor.
Proceedings of the 35th IEEE International Conference on Application-specific Systems, 2024

eTag: Class-Incremental Learning via Embedding Distillation and Task-Oriented Generation.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023
A Current Loop Model for the Fast Simulation of Ferrofluids.
IEEE Trans. Vis. Comput. Graph., December, 2023

MMsRT: A Hardware Architecture for Ray Tracing in the Mobile Domain.
J. Circuits Syst. Comput., July, 2023

Nonlinear Causal Discovery for High-Dimensional Deterministic Data.
IEEE Trans. Neural Networks Learn. Syst., May, 2023

Multiple-Mode-Supporting Floating-Point FMA Unit for Deep Learning Processors.
IEEE Trans. Very Large Scale Integr. Syst., February, 2023

RCFusion: Fusing 4-D Radar and Camera With Bird's-Eye View Features for 3-D Object Detection.
IEEE Trans. Instrum. Meas., 2023

Tracking of Multiple Static and Dynamic Targets for 4D Automotive Millimeter-Wave Radar Point Cloud in Urban Environments.
Remote. Sens., 2023

E2Net: Resource-Efficient Continual Learning with Elastic Expansion Network.
CoRR, 2023

CLIP-KD: An Empirical Study of Distilling CLIP Models.
CoRR, 2023

eTag: Class-Incremental Learning with Embedding Distillation and Task-Oriented Generation.
CoRR, 2023

A Survey on Causal Reinforcement Learning.
CoRR, 2023

SFDoP: A Scalable Fused BFloat16 Dot-Product Architecture for DNN.
Proceedings of the 41st IEEE International Conference on Computer Design, 2023

A Scalable BFloat16 Dot-Product Architecture for Deep Learning.
Proceedings of the Great Lakes Symposium on VLSI 2023, 2023

Low-Cost Multiple-Precision Multiplication Unit Design For Deep Learning.
Proceedings of the Great Lakes Symposium on VLSI 2023, 2023

Confidence Counter Modelling for Value Predictor.
Proceedings of the Great Lakes Symposium on VLSI 2023, 2023

A Multi-level Parallel Integer/Floating-Point Arithmetic Architecture for Deep Learning Instructions.
Proceedings of the Euro-Par 2023: Parallel Processing - 29th International Conference on Parallel and Distributed Computing, Limassol, Cyprus, August 28, 2023

2022
A fast unsmoothed aggregation algebraic multigrid framework for the large-scale simulation of incompressible flow.
ACM Trans. Graph., 2022

Multi-Lane Detection and Tracking Using Temporal-Spatial Model and Particle Filtering.
IEEE Trans. Intell. Transp. Syst., 2022

RV16: An Ultra-Low-Cost Embedded RISC-V Processor Core.
J. Comput. Sci. Technol., 2022

Lifelong Generative Learning via Knowledge Reconstruction.
CoRR, 2022

Stride Equality Prediction for Value Speculation.
IEEE Comput. Archit. Lett., 2022

SADD: A Novel Systolic Array Accelerator with Dynamic Dataflow for Sparse GEMM in Deep Learning.
Proceedings of the Network and Parallel Computing, 2022

Optimizing Winograd Convolution on GPUs via Partial Kernel Fusion.
Proceedings of the Network and Parallel Computing, 2022

TJ4DRadSet: A 4D Radar Dataset for Autonomous Driving.
Proceedings of the 25th IEEE International Conference on Intelligent Transportation Systems, 2022

MMTP: Multi-Modal Trajectory Prediction with Interaction Attention and Adaptive Task Weighting.
Proceedings of the 25th IEEE International Conference on Intelligent Transportation Systems, 2022

Efficient Multiple-Precision and Mixed-Precision Floating-Point Fused Multiply-Accumulate Unit for HPC and AI Applications.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2022

PipeFB: An Optimized Pipeline Parallelism Scheme to Reduce the Peak Memory Usage.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2022

RTA: an Efficient SIMD Architecture for Ray Tracing.
Proceedings of the 24th IEEE Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, 2022

2021
Ships, splashes, and waves on a vast ocean.
ACM Trans. Graph., 2021

GraphPEG: Accelerating Graph Processing on GPUs.
ACM Trans. Archit. Code Optim., 2021

Dynamic Hand Gesture Recognition in In-Vehicle Environment Based on FMCW Radar and Transformer.
Sensors, 2021

A Joint 2D-3D Complementary Network for Stereo Matching.
Sensors, 2021

Fast and Accurate Lane Detection via Graph Structure and Disentangled Representation Learning.
Sensors, 2021

Radar Transformer: An Object Classification Network Based on 4D MMW Imaging Radar.
Sensors, 2021

Robust Target Detection and Tracking Algorithm Based on Roadside Radar and Camera.
Sensors, 2021

Fast Convolution based on Winograd Minimum Filtering: Introduction and Development.
CoRR, 2021

Multi-Scale Cost Volumes Cascade Network for Stereo Matching.
Proceedings of the IEEE International Conference on Robotics and Automation, 2021

Multi-Scale Cascade Disparity Refinement Stereo Network.
Proceedings of the IEEE International Conference on Acoustics, 2021

Unsupervised Hard Case Extraction Based on Image Perceptual Hash Encoding.
Proceedings of the CONF-CDS 2021: The 2nd International Conference on Computing and Data Science, 2021

2020
Surface-only ferrofluids.
ACM Trans. Graph., 2020

A quantitative evaluation of unified memory in GPUs.
J. Supercomput., 2020

HPE: Hierarchical Page Eviction Policy for Unified Memory in GPUs.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2020

DancerFly: An Order-Aware Network-on-Chip Router On-the-Fly Mitigating Multi-path Packet Reordering.
Int. J. Parallel Program., 2020

Coordinated Page Prefetch and Eviction for Memory Oversubscription Management in GPUs.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

Spike Sorting Based On Low-Rank And Sparse Representation.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2020

2019
Coordinated DMA: Improving the DRAM Access Efficiency for Matrix Multiplication.
IEEE Trans. Parallel Distributed Syst., 2019

On the accurate large-scale simulation of ferrofluids.
ACM Trans. Graph., 2019

Efficient architectural exploration of TAGE branch predictor for embedded processors.
Microelectron. J., 2019

SIMD stealing: Architectural support for efficient data parallel execution on multicores.
Microprocess. Microsystems, 2019

MT-DMA: A DMA Controller Supporting Efficient Matrix Transposition for Digital Signal Processing.
IEEE Access, 2019

Hierarchical Page Eviction Policy for Unified Memory in GPUs.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2019

An Efficient Direct Memory Access (DMA) Controller for Scientific Computing Accelerators.
Proceedings of the IEEE International Symposium on Circuits and Systems, 2019

Improving the DRAM Access Efficiency for Matrix Multiplication on Multicore Accelerators.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2019

MOOBench: towards massive open online workbench.
Proceedings of the ACM Turing Celebration Conference - China, 2019

2018
Moving from exascale to zettascale computing: challenges and techniques.
Frontiers Inf. Technol. Electron. Eng., 2018

CHAM: Improving Prefetch Efficiency Using a Composite Hierarchy-Aware Method.
J. Circuits Syst. Comput., 2018

FC-AMAT: factor-based C-AMAT analysis in memory system measurement.
Innov. Syst. Softw. Eng., 2018

The Design of NoC-Side Memory Access Scheduling for Energy-Efficient GPGPUs.
Int. J. Parallel Program., 2018

DyCache: Dynamic Multi-Grain Cache Management for Irregular Memory Accesses on GPU.
IEEE Access, 2018

Accelerating BFS via Data Structure-Aware Prefetching on GPU.
IEEE Access, 2018

Evaluating Memory Performance of Emerging Scale-Out Applications Using C-AMAT.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2018

HMCSP: Reducing Transaction Latency of CSR-based SPMV in Hybrid Memory Cube.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2018

Improving Branch Prediction Accuracy on Multi-Core Architectures for Big Data.
Proceedings of the IEEE International Conference on Parallel & Distributed Processing with Applications, 2018

Adaptive VC Partitioning for NoCs in GPGPUs.
Proceedings of the IEEE International Symposium on Circuits and Systems, 2018

VISU: A Simple and Efficient Cache Coherence Protocol Based on Self-updating.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2018

Peer-Formulated Assignment Method for Experimental Projects in CS courses.
Proceedings of the IEEE Frontiers in Education Conference, 2018

CMH: compression management for improving capacity in the hybrid memory cube.
Proceedings of the 15th ACM International Conference on Computing Frontiers, 2018

HASS: High Accuracy Spike Sorting with Wavelet Package Decomposition and Mutual Information.
Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, 2018

2017
Improving the Efficiency of GPGPU Work-Queue Through Data Awareness.
ACM Trans. Archit. Code Optim., 2017

Factor-Based C-AMAT Analysis for Memory Optimization.
Proceedings of the Verification and Evaluation of Computer and Communication Systems, 2017

Motivating Students through Peer-Formulated Assignments in CS Experimental Courses.
Proceedings of the 18th Annual Conference on Information Technology Education and the 6th Annual Conference on Research in Information Technology, 2017

Improving Branch Prediction for Thread Migration on Multi-core Architectures.
Proceedings of the Network and Parallel Computing, 2017

SimpleBP: A Lightweight Branch Prediction Simulator for Effective Design Exploration.
Proceedings of the 2017 International Conference on Networking, Architecture, and Storage, 2017

Branch Prediction Migration for Multi-Core Architectures.
Proceedings of the 2017 International Conference on Networking, Architecture, and Storage, 2017

BPSim: An integrated missrate, area, and power simulator for branch predictor.
Proceedings of the 6th International Conference on Modern Circuits and Systems Technologies, 2017

Unleashing the power of GPU for physically-based rendering via dynamic ray shuffling.
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017

Effective Optimization of Branch Predictors through Lightweight Simulation.
Proceedings of the 2017 IEEE International Conference on Computer Design, 2017

Trace-based method for big data memory characteristics research.
Proceedings of the 2017 International Conference on Advances in Computing, 2017

Design Space Exploration of TAGE Branch Predictor with Ultra-Small RAM.
Proceedings of the on Great Lakes Symposium on VLSI 2017, 2017

BC-AMAT: Considering Blocked Time in Memory System Measurement.
Proceedings of the Computing Frontiers Conference, 2017

POSTER: DaQueue: A Data-Aware Work-Queue Design for GPGPUs.
Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques, 2017

2016
A Methodology for Performance Verification of Microprocessors.
Proceedings of the Computer Engineering and Technology - 20th CCF Conference, 2016

2015
Efficient data management on 3D stacked memory for big data applications.
Proceedings of the 10th International Design & Test Symposium, 2015

A Study on Non-volatile 3D Stacked Memory for Big Data Applications.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2015

Fast FPGA system for microarchitecture optimization on synthesizable modern processor design.
Proceedings of the 25th International Conference on Field Programmable Logic and Applications, 2015

2014
Integrated Coherence Prediction: Towards Efficient Cache Coherence on NoC-Based Multicore Architectures.
ACM Trans. Design Autom. Electr. Syst., 2014

Holistic Routing Algorithm Design to Support Workload Consolidation in NoCs.
IEEE Trans. Computers, 2014

Mac or Non-MAC: not a Problem.
J. Circuits Syst. Comput., 2014

Efficient Utilization of SIMD Engines for General-Purpose Processors.
Comput. J., 2014

Leveraging on-chip networks for efficient prediction on multicore coherence.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2014

2013
Dynamic Streamization Model Execution for SIMD Engines on Multicore Architectures.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2013

Adaptive communication mechanism for accelerating MPI functions in NoC-based multicore processors.
ACM Trans. Archit. Code Optim., 2013

Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP.
Parallel Comput., 2013

VBON: Toward efficient on-chip networks via hierarchical virtual bus.
Microprocess. Microsystems, 2013

DCP: Improving the Throughput of Asynchronous Pipeline by Dual Control Path.
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013

2012
Low-Cost Binary128 Floating-Point FMA Unit Design with SIMD Support.
IEEE Trans. Computers, 2012

An optimized multicore cache coherence design for exploiting communication locality.
Proceedings of the Great Lakes Symposium on VLSI 2012, 2012

Accelerating NoC-Based MPI Primitives via Communication Architecture Customization.
Proceedings of the 23rd IEEE International Conference on Application-Specific Systems, 2012

2011
A specialized low-cost vectorized loop buffer for embedded processors.
Proceedings of the Design, Automation and Test in Europe, 2011

2010
Permutation optimization for SIMD devices.
Proceedings of the International Symposium on Circuits and Systems (ISCAS 2010), May 30, 2010

SV: Enhancing SIMD Architectures via Combined SIMD-Vector Approach.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2010

SIF: Overcoming the limitations of SIMD devices via implicit permutation.
Proceedings of the 16th International Conference on High-Performance Computer Architecture (HPCA-16 2010), 2010

2009
Optimal subgraph covering for customisable VLIW processors.
IET Comput. Digit. Tech., 2009

Implementation of OpenVG Path and Paint Algorithms on Synchronous Data Triggered Architecture with Optimization.
Proceedings of the International Conference on Networking, Architecture, and Storage, 2009

2008
Hierarchical memory system design for a heterogeneous multi-core processor.
Proceedings of the 2008 ACM Symposium on Applied Computing (SAC), 2008

A New CORDIC Algorithm and Software Implementation Based on Synchronized Data Triggering Architecture.
Proceedings of the 2008 International Conference on Multimedia and Ubiquitous Engineering (MUE 2008), 2008

Customizing computation accelerators for extensible multi-issue processors with effective optimization techniques.
Proceedings of the 45th Design Automation Conference, 2008

Memory System Design for a Multi-core Processor.
Proceedings of the Second International Conference on Complex, 2008

2007
Hardware Support for Arithmetic Units of Processor with Multimedia Extension.
Proceedings of the 2007 International Conference on Multimedia and Ubiquitous Engineering (MUE 2007), 2007

A New Architecture For Multiple-Precision Floating-Point Multiply-Add Fused Unit Design.
Proceedings of the 18th IEEE Symposium on Computer Arithmetic (ARITH-18 2007), 2007


  Loading...