Huayou Su

Orcid: 0000-0002-3587-0917

According to our database1, Huayou Su authored at least 44 papers between 2009 and 2024.

Collaborative distances:
  • Dijkstra number2 of five.
  • Erdős number3 of four.



In proceedings 
PhD thesis 




Pro-Prophet: A Systematic Load Balancing Method for Efficient Parallel Training of Large-scale MoE Models.
CoRR, 2024

SyncIntellects: Orchestrating LLM Inference with Progressive Prediction and QoS-Friendly Control.
Proceedings of the 32nd IEEE/ACM International Symposium on Quality of Service, 2024

Prism: Decomposing Program Semantics for Code Clone Detection through Compilation.
Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024

Optimizing GNN Inference Processing on Very Long Vector Processor.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2023

Model Provenance Management in MLOps Pipeline.
Proceedings of the ICCDE 2022: The 8th International Conference on Computing and Data Engineering, Bangkok, Thailand, January 11, 2022

An Efficient Transformer Inference Engine on DSP.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2022

Optimizing GNN on ARM Multi-Core Processors.
Proceedings of the 24th IEEE Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, 2022

Optimize DGL Operations on x86-64 Multi-Core Processors.
Proceedings of the HP3C 2022: 6th International Conference on High Performance Compilation, 2022

Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures.
J. Supercomput., 2021

Beyond AP: a new evaluation index for multiclass classification task accuracy.
Appl. Intell., 2021

Graphcomm: A Graph Neural Network Based Method for Multi-Agent Reinforcement Learning.
Proceedings of the IEEE International Conference on Acoustics, 2021

DWS-MKL: Depth-width-scaling multiple kernel learning for data classification.
Neurocomputing, 2020

P4 to FPGA-A Fast Approach for Generating Efficient Network Processors.
IEEE Access, 2020

A High-Throughput LDPC Decoder Based on GPUs for 5G New Radio.
Proceedings of the IEEE Symposium on Computers and Communications, 2020

Learning Network Representation Through Reinforcement Learning.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Data Layout Transformation for Stencil Computations Using ARM NEON Extension.
Proceedings of the 22nd IEEE International Conference on High Performance Computing and Communications; 18th IEEE International Conference on Smart City; 6th IEEE International Conference on Data Science and Systems, 2020

Optimization and Performance Modeling of Stencil Computations on ARM Architectures.
Proceedings of the 22nd IEEE International Conference on High Performance Computing and Communications; 18th IEEE International Conference on Smart City; 6th IEEE International Conference on Data Science and Systems, 2020

A Skewness-Aware Matrix Factorization Approach for Mesh-Structured Cloud Services.
IEEE/ACM Trans. Netw., 2019

Poster Abstract: A Template-based Framework for Generating Network Processor in FPGA.
Proceedings of the IEEE INFOCOM 2019, 2019

Author Disambiguation through Adversarial Network Representation Learning.
Proceedings of the International Joint Conference on Neural Networks, 2019

HPGraph: High-Performance Graph Analytics with Productivity on the GPU.
Sci. Program., 2018

Deep Discriminative Clustering Network.
Proceedings of the 2018 International Joint Conference on Neural Networks, 2018

High performance graph analytics with productivity on hybrid CPU-GPU platforms.
Proceedings of the 2nd International Conference on High Performance Compilation, 2018

A Highly Parallel and Scalable Motion Estimation Algorithm with GPU for HEVC.
Sci. Program., 2017

Cryo-EM structure of the protein-conducting ERAD channel Hrd1 in complex with Hrd3.
Nat., 2017

Efficient parallel implementation of a density peaks clustering algorithm on graphics processing unit.
Frontiers Inf. Technol. Electron. Eng., 2017

High Performance Parallel Graph Coloring on GPGPUs.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

An analytical GPU performance model for 3D stencil computations from the angle of data traffic.
J. Supercomput., 2015

Towards simulation of subcellular calcium dynamics at nanometre resolution.
Int. J. High Perform. Comput. Appl., 2015

High efficient sedimentary basin simulations on hybrid CPU-GPU clusters.
Clust. Comput., 2014

Resource-efficient utilization of CPU/GPU-based heterogeneous supercomputers for Bayesian phylogenetic inference.
J. Supercomput., 2013

On the GPU Performance of 3D Stencil Computations Implemented in OpenCL.
Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013

On the GPU-CPU Performance Portability of OpenCL for 3D Stencil Computations.
Proceedings of the 19th IEEE International Conference on Parallel and Distributed Systems, 2013

Performance of Sediment Transport Simulations on NVIDIA's Kepler Architecture.
Proceedings of the International Conference on Computational Science, 2013

Improving Performance of GPU Specific OpenCL Program on CPUs.
Proceedings of the 13th International Conference on Parallel and Distributed Computing, 2012

A Parallel H.264 Encoder with CUDA: Mapping and Evaluation.
Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems, 2012

Parallelization Design of Irregular Algorithms of Video Processing on GPUs.
Proceedings of the 2012 IEEE International Conference on Multimedia and Expo, 2012

Using 1000+ GPUs and 10000+ CPUs for Sedimentary Basin Simulations.
Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012

A high-efficient software parallel CAVCL encoder based on GPU.
Proceedings of the 34th International Conference on Telecommunications and Signal Processing (TSP 2011), 2011

High-efficient software parallel CAVLC encoder based on programmable stream processor.
Proceedings of the 19th International Conference on Multimedia 2011, Scottsdale, AZ, USA, November 28, 2011

A Multilevel Parallel Intra Coding for H.264/AVC Based on CUDA.
Proceedings of the Sixth International Conference on Image and Graphics, 2011

A Parallel Streaming Motion Estimation for Real-Time HD H.264 Encoding on Programmable Processors.
Proceedings of the Fifth International Conference on Frontier of Computer Science and Technology, 2010

SAT: A Stream Architecture Template for Embedded Applications.
Proceedings of the 10th IEEE International Conference on Computer and Information Technology, 2010

Streaming HD H.264 encoder on programmable processors.
Proceedings of the 17th International Conference on Multimedia 2009, 2009
