Proceedings of the 24th IEEE Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, 2022

MZ Core: An Enhanced Matrix Acceleration Engine for HPC/ AI Applications.

[BibT_eX]

[DOI]

Yasong Cao

Exploring ILP for VLIW Architecture by Quantified Modeling and Dynamic Programming-Based Instruction Scheduling.

[BibT_eX]

[DOI]

Proceedings of the 27th Asia and South Pacific Design Automation Conference, 2022

2021

Sustaining Consumer Trust and Continuance Intention by Institutional Mechanisms: An Empirical Survey of DiDi in China.

[BibT_eX]

[DOI]

IEEE Access, 2021

Automatic mapping and code optimization for OpenCL kernels on FT-matrix architecture (WIP paper).

[BibT_eX]

[DOI]

Proceedings of the LCTES '21: 22nd ACM SIGPLAN/SIGBED International Conference on Languages, 2021

sRouting: Towards a Better Flow Size Estimation Performance through Routing and Sketch Configuration.

[BibT_eX]

[DOI]

Yang Shi

Mei Wen

Proceedings of the ICPP 2021: 50th International Conference on Parallel Processing, Lemont, IL, USA, August 9, 2021

SAI: Self-Adjusting Incremental Quantile Estimation for Sparse Training of Neural Networks on Hardware Accelerators.

[BibT_eX]

[DOI]

Proceedings of the 2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, 2021

Embrace the Conflicts: Exploring the Integration of Single Port Memory in Systolic Array-based Accelerators.

[BibT_eX]

[DOI]

2020

Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters.

[BibT_eX]

[DOI]

IEEE Trans. Parallel Distributed Syst., 2020

Toward an Efficient Deep Pipelined Template-Based Architecture for Accelerating the Entire 2-D and 3-D CNNs on FPGA.

[BibT_eX]

[DOI]

IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2020

Efficient Parallel TLD on CPU-GPU Platform for Real-Time Tracking.

[BibT_eX]

[DOI]

KSII Trans. Internet Inf. Syst., 2020

P4 to FPGA-A Fast Approach for Generating Efficient Network Processors.

[BibT_eX]

[DOI]

IEEE Access, 2020

Incremental Deployment of Programmable Switches for Sketch-based Network Measurement.

[BibT_eX]

[DOI]

Yang Shi

Mei Wen

Chunyuan Zhang

Proceedings of the IEEE Symposium on Computers and Communications, 2020

Towards High-Efficiency Data Centers via Job-Aware Network Scheduling.

[BibT_eX]

[DOI]

Yang Shi

Mei Wen

Chunyuan Zhang

Proceedings of the ICPP 2020: 49th International Conference on Parallel Processing, 2020

HybridSketch: A Memory-centric Precise Approach for Flow Measurement.

[BibT_eX]

[DOI]

Proceedings of the 2020 IEEE International Conference on Communications, 2020

Optimized HybridSketch: More Efficient with Analysis and Algorithm.

[BibT_eX]

[DOI]

Proceedings of the Algorithms and Architectures for Parallel Processing, 2020

Towards a Deep-Pipelined Architecture for Accelerating Deep GCN on a Multi-FPGA Platform.

[BibT_eX]

[DOI]

Proceedings of the Algorithms and Architectures for Parallel Processing, 2020

Scalable FPGA-based Architecture for High-Performance Per-Flow Traffic Measurement.

[BibT_eX]

[DOI]

Proceedings of the FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020

Towards Memory-Efficient Streaming Processing with Counter-Cascading Sketching on FPGA.

[BibT_eX]

[DOI]

Proceedings of the 57th ACM/IEEE Design Automation Conference, 2020

2019

A Fast Approach for Generating Efficient Parsers on FPGAs.

[BibT_eX]

[DOI]

Symmetry, 2019

Metaflow: A DAG-Based Network Abstraction for Distributed Applications.

[BibT_eX]

[DOI]

CoRR, 2019

Application-Oriented Network Scheduling With Metaflow.

[BibT_eX]

[DOI]

IEEE Access, 2019

Interleaved Sketch: Toward Consistent Network Telemetry for Commodity Programmable Switches.

[BibT_eX]

[DOI]

IEEE Access, 2019

KVSwitch: An In-network Load Balancer for Key-Value Stores.

[BibT_eX]

[DOI]

Proceedings of the 2019 IEEE Symposium on Computers and Communications, 2019

Towards a Uniform Architecture for the Efficient Implementation of 2D and 3D Deconvolutional Neural Networks on FPGAs.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Circuits and Systems, 2019

Poster Abstract: Deep Learning Workloads Scheduling with Reinforcement Learning on GPU Clusters.

[BibT_eX]

[DOI]

Proceedings of the IEEE INFOCOM 2019, 2019

Poster Abstract: A Template-based Framework for Generating Network Processor in FPGA.

[BibT_eX]

[DOI]

Proceedings of the IEEE INFOCOM 2019, 2019

An Efficient Design Flow for Accelerating Complicated-connected CNNs on a Multi-FPGA Platform.

[BibT_eX]

[DOI]

Proceedings of the 48th International Conference on Parallel Processing, 2019

Metaflow: A Better Traffic Abstraction for Distributed Applications.

[BibT_eX]

[DOI]

Proceedings of the 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems, 2019

TBSW: Time-Based Sliding Window Algorithm for Network Traffic Measurement.

[BibT_eX]

[DOI]

SACC: Configuring Application-Level Cache Intelligently for In-Memory Database Based on Long Short-Term Memory.

[BibT_eX]

[DOI]

SWAP: a sliding window algorithm for in-network packet measurement.

[BibT_eX]

[DOI]

Proceedings of the 3rd International Conference on High Performance Compilation, 2019

Accelerating 3D CNN-based Lung Nodule Segmentation on a Multi-FPGA System.

[BibT_eX]

[DOI]

Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2019

GENIE: QoS-guided Dynamic Scheduling for CNN-based Tasks on SME Clusters.

[BibT_eX]

[DOI]

Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2019

Scale-out Acceleration for 3D CNN-based Lung Nodule Segmentation on a Multi-FPGA System.

[BibT_eX]

[DOI]

Proceedings of the 56th Annual Design Automation Conference 2019, 2019

2018

HPGraph: High-Performance Graph Analytics with Productivity on the GPU.

[BibT_eX]

[DOI]

Sci. Program., 2018

MALMM: A multi-array architecture for large-scale matrix multiplication on FPGA.

[BibT_eX]

[DOI]

IEICE Electron. Express, 2018

Towards a Multi-array Architecture for Accelerating Large-scale Matrix Multiplication on FPGAs.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Circuits and Systems, 2018

Multiple CNN-based Tasks Scheduling across Shared GPU Platform in Research and Development Scenarios.

[BibT_eX]

[DOI]

Proceedings of the 20th IEEE International Conference on High Performance Computing and Communications; 16th IEEE International Conference on Smart City; 4th IEEE International Conference on Data Science and Systems, 2018

High performance graph analytics with productivity on hybrid CPU-GPU platforms.

[BibT_eX]

[DOI]

Proceedings of the 2nd International Conference on High Performance Compilation, 2018

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA.

[BibT_eX]

[DOI]

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2018

2017

A Highly Parallel and Scalable Motion Estimation Algorithm with GPU for HEVC.

[BibT_eX]

[DOI]

Sci. Program., 2017

Exploiting a depth context model in visual tracking with correlation filter.

[BibT_eX]

[DOI]

Frontiers Inf. Technol. Electron. Eng., 2017

Applying Detection Proposals to Visual Tracking for Scale and Aspect Ratio Adaptability.

[BibT_eX]

[DOI]

Int. J. Comput. Vis., 2017

FPGA-accelerated deep convolutional neural networks for high throughput and energy efficiency.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2017

High Performance Implementation of 3D Convolutional Neural Networks on a GPU.

[BibT_eX]

[DOI]

Comput. Intell. Neurosci., 2017

Optimizing OpenCL Implementation of Deep Convolutional Neural Network on FPGA.

[BibT_eX]

[DOI]

Proceedings of the Network and Parallel Computing, 2017

RVNet: A fast and high energy efficiency network packet processing system on RISC-V.

[BibT_eX]

[DOI]

Proceedings of the 28th IEEE International Conference on Application-specific Systems, 2017

2016

Enabling Tissue-Scale Cardiac Simulations Using Heterogeneous Computing on Tianhe-2.

[BibT_eX]

[DOI]

Proceedings of the 22nd IEEE International Conference on Parallel and Distributed Systems, 2016

2015

An analytical GPU performance model for 3D stencil computations from the angle of data traffic.

[BibT_eX]

[DOI]

J. Supercomput., 2015

Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations.

[BibT_eX]

[DOI]

Frontiers Inf. Technol. Electron. Eng., 2015

Towards simulation of subcellular calcium dynamics at nanometre resolution.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2015

Enabling a Uniform OpenCL Device View for Heterogeneous Platforms.

[BibT_eX]

[DOI]

IEICE Trans. Inf. Syst., 2015

Communication-hiding programming for clusters with multi-coprocessor nodes.

[BibT_eX]

[DOI]

Concurr. Comput. Pract. Exp., 2015

Fast tracking via context depth model learning.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Conference on Image Processing, 2015

Unified Virtual Memory Support for Deep CNN Accelerator on SoC FPGA.

[BibT_eX]

[DOI]

Proceedings of the Algorithms and Architectures for Parallel Processing, 2015

Enable Scale and Aspect Ratio Adaptability in Visual Tracking with Detection Proposals.

[BibT_eX]

[DOI]

Proceedings of the British Machine Vision Conference 2015, 2015

2014

High efficient sedimentary basin simulations on hybrid CPU-GPU clusters.

[BibT_eX]

[DOI]

Clust. Comput., 2014

Utilizing Multiple Xeon Phi Coprocessors on One Compute Node.

[BibT_eX]

[DOI]

Proceedings of the Algorithms and Architectures for Parallel Processing, 2014

Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2014 Parallel Processing, 2014

2013

Accelerating thread-intensive and explicit memory management programs with dynamic partial reconfiguration.

[BibT_eX]

[DOI]

J. Supercomput., 2013

Resource-efficient utilization of CPU/GPU-based heterogeneous supercomputers for Bayesian phylogenetic inference.

[BibT_eX]

[DOI]

J. Supercomput., 2013

Simulating Cardiac Electrophysiology in the Era of GPU-Cluster Computing.

[BibT_eX]

[DOI]

IEICE Trans. Inf. Syst., 2013

On the GPU Performance of 3D Stencil Computations Implemented in OpenCL.

[BibT_eX]

[DOI]

Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013

On the GPU-CPU Performance Portability of OpenCL for 3D Stencil Computations.

[BibT_eX]

[DOI]

Proceedings of the 19th IEEE International Conference on Parallel and Distributed Systems, 2013

Performance of Sediment Transport Simulations on NVIDIA's Kepler Architecture.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Computational Science, 2013

Solving the Cardiac Model Using Multi-core CPU and Many Integrated Cores (MIC).

[BibT_eX]

[DOI]

Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013

Automatic Mapping Single-Device OpenCL Program to Heterogeneous Multi-device Platform.

[BibT_eX]

[DOI]

Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013

ACF: Networks-on-Chip Deadlock Recovery with Accurate Detection and Elastic Credit.

[BibT_eX]

[DOI]

Proceedings of the Advanced Parallel Processing Technologies, 2013

2012

Improving Performance of GPU Specific OpenCL Program on CPUs.

[BibT_eX]

[DOI]

Proceedings of the 13th International Conference on Parallel and Distributed Computing, 2012

A Parallel H.264 Encoder with CUDA: Mapping and Evaluation.

[BibT_eX]

[DOI]

Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems, 2012

Parallelization Design of Irregular Algorithms of Video Processing on GPUs.

[BibT_eX]

[DOI]

Proceedings of the 2012 IEEE International Conference on Multimedia and Expo, 2012

Extending BORPH for shared memory reconfigurable computers.

[BibT_eX]

[DOI]

Proceedings of the 22nd International Conference on Field Programmable Logic and Applications (FPL), 2012

The masala machine: accelerating thread-intensive and explicit memory management programs with dynamically reconfigurable FPGAs (abstract only).

[BibT_eX]

[DOI]

Proceedings of the ACM/SIGDA 20th International Symposium on Field Programmable Gate Arrays, 2012

Using 1000+ GPUs and 10000+ CPUs for Sedimentary Basin Simulations.

[BibT_eX]

[DOI]

Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012

2011

Tiled Multi-Core Stream Architecture.

[BibT_eX]

[DOI]

Trans. High Perform. Embed. Archit. Compil., 2011

Cross-Market Financial Risk Analysis: an Agent-Based Computational Finance.

[BibT_eX]

[DOI]

Int. J. Inf. Technol. Decis. Mak., 2011

A high-efficient software parallel CAVCL encoder based on GPU.

[BibT_eX]

[DOI]

Proceedings of the 34th International Conference on Telecommunications and Signal Processing (TSP 2011), 2011

High-efficient software parallel CAVLC encoder based on programmable stream processor.

[BibT_eX]

[DOI]

Proceedings of the 19th International Conference on Multimedia 2011, Scottsdale, AZ, USA, November 28, 2011

A Multilevel Parallel Intra Coding for H.264/AVC Based on CUDA.

[BibT_eX]

[DOI]

Proceedings of the Sixth International Conference on Image and Graphics, 2011

2010

A Parallel Streaming Motion Estimation for Real-Time HD H.264 Encoding on Programmable Processors.

[BibT_eX]

[DOI]

Proceedings of the Fifth International Conference on Frontier of Computer Science and Technology, 2010

Software Managed Instruction Scratchpad Memory Optimization in Stream Architecture Based on Hot Code Analysis of Kernels.

[BibT_eX]

[DOI]

Proceedings of the 13th Euromicro Conference on Digital System Design, 2010

SAT: A Stream Architecture Template for Embedded Applications.

[BibT_eX]

[DOI]

Proceedings of the 10th IEEE International Conference on Computer and Information Technology, 2010

2009

Streaming HD H.264 encoder on programmable processors.

[BibT_eX]

[DOI]

Proceedings of the 17th International Conference on Multimedia 2009, 2009

Cache streamization for high performance stream processor.

[BibT_eX]

[DOI]

Proceedings of the 16th International Conference on High Performance Computing, 2009

Software parallel CAVLC encoder based on stream processing.

[BibT_eX]

[DOI]

Proceedings of the 7th IEEE/ACM/IFIP Workshop on Embedded Systems for Real-Time Multimedia, 2009

2008

On-Chip Memory System Optimization Design for the FT64 Scientific Stream Accelerator.

[BibT_eX]

[DOI]

IEEE Micro, 2008

Load scheduling: Reducing pressure on distributed register files for free.

[BibT_eX]

[DOI]

Proceedings of the 13th Asia South Pacific Design Automation Conference, 2008

FPGA-based Equivalent Simulation Technology (FEST) for clustered stream architecture.

[BibT_eX]

[DOI]

Proceedings of the 13th Asia-Pacific Computer Systems Architecture Conference, 2008

2007

FT64: Scientific Computing with Streams.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing, 2007

A Stream System-on-Chip Architecture for High Speed Target Recognition Based on Biologic Vision.

[BibT_eX]

[DOI]

Proceedings of the Advances in Computer Systems Architecture, 2007

2006

[BibT_eX]

[DOI]

Proceedings of the Advances in Computer Systems Architecture, 11th Asia-Pacific Conference, 2006

Optimization and Evaluating of StreamYGX2 on MASA Stream Processor.

[BibT_eX]

[DOI]

Proceedings of the Advances in Computer Systems Architecture, 11th Asia-Pacific Conference, 2006

Analysis and Performance Results of a fluid dynamics Application on MASA Stream Processor.

[BibT_eX]

[DOI]

Proceedings of the 5th Annual IEEE/ACIS International Conference on Computer and Information Science (ICIS 2006) and 1st IEEE/ACIS International Workshop on Component-Based Software Engineering, 2006

2005

Multiple-Morphs Adaptive Stream Architecture.

[BibT_eX]

[DOI]

J. Comput. Sci. Technol., 2005

Accelerated Motion Estimation of H.264 on Imagine Stream Processor.

[BibT_eX]

[DOI]