Mei Wen

Orcid: 0000-0002-5875-3297

According to our database1, Mei Wen authored at least 110 papers between 2004 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
ABS: Accumulation Bit-Width Scaling Method for Designing Low-Precision Tensor Core.
IEEE Trans. Very Large Scale Integr. Syst., September, 2024

ESEN: Efficient GPU sharing of Ensemble Neural Networks.
Neurocomputing, 2024

Enhancing the PE Utilization for Multi-Precision Systolic Array via Optimizing Computation Latency.
Proceedings of the IEEE International Symposium on Circuits and Systems, 2024

BitShare: An Efficient Precision-Scalable Accelerator with Combining-Like-Terms GEMM.
Proceedings of the 35th IEEE International Conference on Application-specific Systems, 2024

2023
Releasing the Potential of Tensor Core for Unstructured SpMM using Tiled-CSR Format.
Proceedings of the 41st IEEE International Conference on Computer Design, 2023

Automatic End-to-End Joint Optimization for Kernel Compilation on DSPs.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023

2022
TILE-SIM: A Systematic Approach to Systolic Array-based Accelerator Evaluation.
Proceedings of the International IEEE Symposium on Performance Analysis of Systems and Software, 2022

S-SIM: A Simulator for Systolic Array-based DNN Accelerators with Tile Access Awareness.
Proceedings of the IEEE International Symposium on Circuits and Systems, 2022

Mentha: Enabling Sparse-Packing Computation on Systolic Arrays.
Proceedings of the 51st International Conference on Parallel Processing, 2022

BP-Im2col: Implicit Im2col Supporting AI Backpropagation on Systolic Arrays.
Proceedings of the IEEE 40th International Conference on Computer Design, 2022

Light: A Component Enhances Faster and More Accurate Traffic Measurement<sup>*</sup>.
Proceedings of the IEEE International Conference on Communications, 2022

CORF: Bridging the Gap of Complex Operator Fusion for Faster DNN Inference.
Proceedings of the 24th IEEE Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, 2022

MZ Core: An Enhanced Matrix Acceleration Engine for HPC/ AI Applications.
Proceedings of the 24th IEEE Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, 2022

Exploring ILP for VLIW Architecture by Quantified Modeling and Dynamic Programming-Based Instruction Scheduling.
Proceedings of the 27th Asia and South Pacific Design Automation Conference, 2022

2021
Sustaining Consumer Trust and Continuance Intention by Institutional Mechanisms: An Empirical Survey of DiDi in China.
IEEE Access, 2021

Automatic mapping and code optimization for OpenCL kernels on FT-matrix architecture (WIP paper).
Proceedings of the LCTES '21: 22nd ACM SIGPLAN/SIGBED International Conference on Languages, 2021

sRouting: Towards a Better Flow Size Estimation Performance through Routing and Sketch Configuration.
Proceedings of the ICPP 2021: 50th International Conference on Parallel Processing, Lemont, IL, USA, August 9, 2021

SAI: Self-Adjusting Incremental Quantile Estimation for Sparse Training of Neural Networks on Hardware Accelerators.
Proceedings of the 2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, 2021

Embrace the Conflicts: Exploring the Integration of Single Port Memory in Systolic Array-based Accelerators.
Proceedings of the 2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, 2021

2020
Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters.
IEEE Trans. Parallel Distributed Syst., 2020

Toward an Efficient Deep Pipelined Template-Based Architecture for Accelerating the Entire 2-D and 3-D CNNs on FPGA.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2020

Efficient Parallel TLD on CPU-GPU Platform for Real-Time Tracking.
KSII Trans. Internet Inf. Syst., 2020

P4 to FPGA-A Fast Approach for Generating Efficient Network Processors.
IEEE Access, 2020

Incremental Deployment of Programmable Switches for Sketch-based Network Measurement.
Proceedings of the IEEE Symposium on Computers and Communications, 2020

Towards High-Efficiency Data Centers via Job-Aware Network Scheduling.
Proceedings of the ICPP 2020: 49th International Conference on Parallel Processing, 2020

HybridSketch: A Memory-centric Precise Approach for Flow Measurement.
Proceedings of the 2020 IEEE International Conference on Communications, 2020

Optimized HybridSketch: More Efficient with Analysis and Algorithm.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2020

Towards a Deep-Pipelined Architecture for Accelerating Deep GCN on a Multi-FPGA Platform.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2020

Scalable FPGA-based Architecture for High-Performance Per-Flow Traffic Measurement.
Proceedings of the FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020

Towards Memory-Efficient Streaming Processing with Counter-Cascading Sketching on FPGA.
Proceedings of the 57th ACM/IEEE Design Automation Conference, 2020

2019
A Fast Approach for Generating Efficient Parsers on FPGAs.
Symmetry, 2019

Metaflow: A DAG-Based Network Abstraction for Distributed Applications.
CoRR, 2019

Application-Oriented Network Scheduling With Metaflow.
IEEE Access, 2019

Interleaved Sketch: Toward Consistent Network Telemetry for Commodity Programmable Switches.
IEEE Access, 2019

KVSwitch: An In-network Load Balancer for Key-Value Stores.
Proceedings of the 2019 IEEE Symposium on Computers and Communications, 2019

Towards a Uniform Architecture for the Efficient Implementation of 2D and 3D Deconvolutional Neural Networks on FPGAs.
Proceedings of the IEEE International Symposium on Circuits and Systems, 2019

Poster Abstract: Deep Learning Workloads Scheduling with Reinforcement Learning on GPU Clusters.
Proceedings of the IEEE INFOCOM 2019, 2019

Poster Abstract: A Template-based Framework for Generating Network Processor in FPGA.
Proceedings of the IEEE INFOCOM 2019, 2019

An Efficient Design Flow for Accelerating Complicated-connected CNNs on a Multi-FPGA Platform.
Proceedings of the 48th International Conference on Parallel Processing, 2019

Metaflow: A Better Traffic Abstraction for Distributed Applications.
Proceedings of the 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems, 2019

TBSW: Time-Based Sliding Window Algorithm for Network Traffic Measurement.
Proceedings of the 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems, 2019

SACC: Configuring Application-Level Cache Intelligently for In-Memory Database Based on Long Short-Term Memory.
Proceedings of the 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems, 2019

SWAP: a sliding window algorithm for in-network packet measurement.
Proceedings of the 3rd International Conference on High Performance Compilation, 2019

Accelerating 3D CNN-based Lung Nodule Segmentation on a Multi-FPGA System.
Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2019

GENIE: QoS-guided Dynamic Scheduling for CNN-based Tasks on SME Clusters.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2019

Scale-out Acceleration for 3D CNN-based Lung Nodule Segmentation on a Multi-FPGA System.
Proceedings of the 56th Annual Design Automation Conference 2019, 2019

2018
HPGraph: High-Performance Graph Analytics with Productivity on the GPU.
Sci. Program., 2018

MALMM: A multi-array architecture for large-scale matrix multiplication on FPGA.
IEICE Electron. Express, 2018

Towards a Multi-array Architecture for Accelerating Large-scale Matrix Multiplication on FPGAs.
Proceedings of the IEEE International Symposium on Circuits and Systems, 2018

Multiple CNN-based Tasks Scheduling across Shared GPU Platform in Research and Development Scenarios.
Proceedings of the 20th IEEE International Conference on High Performance Computing and Communications; 16th IEEE International Conference on Smart City; 4th IEEE International Conference on Data Science and Systems, 2018

High performance graph analytics with productivity on hybrid CPU-GPU platforms.
Proceedings of the 2nd International Conference on High Performance Compilation, 2018

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA.
Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2018

2017
A Highly Parallel and Scalable Motion Estimation Algorithm with GPU for HEVC.
Sci. Program., 2017

Exploiting a depth context model in visual tracking with correlation filter.
Frontiers Inf. Technol. Electron. Eng., 2017

Applying Detection Proposals to Visual Tracking for Scale and Aspect Ratio Adaptability.
Int. J. Comput. Vis., 2017

FPGA-accelerated deep convolutional neural networks for high throughput and energy efficiency.
Concurr. Comput. Pract. Exp., 2017

High Performance Implementation of 3D Convolutional Neural Networks on a GPU.
Comput. Intell. Neurosci., 2017

Optimizing OpenCL Implementation of Deep Convolutional Neural Network on FPGA.
Proceedings of the Network and Parallel Computing, 2017

RVNet: A fast and high energy efficiency network packet processing system on RISC-V.
Proceedings of the 28th IEEE International Conference on Application-specific Systems, 2017

2016
Enabling Tissue-Scale Cardiac Simulations Using Heterogeneous Computing on Tianhe-2.
Proceedings of the 22nd IEEE International Conference on Parallel and Distributed Systems, 2016

2015
An analytical GPU performance model for 3D stencil computations from the angle of data traffic.
J. Supercomput., 2015

Improving performance portability for GPU-specific OpenCL kernels on multi-core/many-core CPUs by analysis-based transformations.
Frontiers Inf. Technol. Electron. Eng., 2015

Towards simulation of subcellular calcium dynamics at nanometre resolution.
Int. J. High Perform. Comput. Appl., 2015

Enabling a Uniform OpenCL Device View for Heterogeneous Platforms.
IEICE Trans. Inf. Syst., 2015

Communication-hiding programming for clusters with multi-coprocessor nodes.
Concurr. Comput. Pract. Exp., 2015

Fast tracking via context depth model learning.
Proceedings of the 2015 IEEE International Conference on Image Processing, 2015

Unified Virtual Memory Support for Deep CNN Accelerator on SoC FPGA.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2015

Enable Scale and Aspect Ratio Adaptability in Visual Tracking with Detection Proposals.
Proceedings of the British Machine Vision Conference 2015, 2015

2014
High efficient sedimentary basin simulations on hybrid CPU-GPU clusters.
Clust. Comput., 2014

Utilizing Multiple Xeon Phi Coprocessors on One Compute Node.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2014

Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs.
Proceedings of the Euro-Par 2014 Parallel Processing, 2014

2013
Accelerating thread-intensive and explicit memory management programs with dynamic partial reconfiguration.
J. Supercomput., 2013

Resource-efficient utilization of CPU/GPU-based heterogeneous supercomputers for Bayesian phylogenetic inference.
J. Supercomput., 2013

Simulating Cardiac Electrophysiology in the Era of GPU-Cluster Computing.
IEICE Trans. Inf. Syst., 2013

On the GPU Performance of 3D Stencil Computations Implemented in OpenCL.
Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013

On the GPU-CPU Performance Portability of OpenCL for 3D Stencil Computations.
Proceedings of the 19th IEEE International Conference on Parallel and Distributed Systems, 2013

Performance of Sediment Transport Simulations on NVIDIA's Kepler Architecture.
Proceedings of the International Conference on Computational Science, 2013

Solving the Cardiac Model Using Multi-core CPU and Many Integrated Cores (MIC).
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013

Automatic Mapping Single-Device OpenCL Program to Heterogeneous Multi-device Platform.
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013

ACF: Networks-on-Chip Deadlock Recovery with Accurate Detection and Elastic Credit.
Proceedings of the Advanced Parallel Processing Technologies, 2013

2012
Improving Performance of GPU Specific OpenCL Program on CPUs.
Proceedings of the 13th International Conference on Parallel and Distributed Computing, 2012

A Parallel H.264 Encoder with CUDA: Mapping and Evaluation.
Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems, 2012

Parallelization Design of Irregular Algorithms of Video Processing on GPUs.
Proceedings of the 2012 IEEE International Conference on Multimedia and Expo, 2012

Extending BORPH for shared memory reconfigurable computers.
Proceedings of the 22nd International Conference on Field Programmable Logic and Applications (FPL), 2012

The masala machine: accelerating thread-intensive and explicit memory management programs with dynamically reconfigurable FPGAs (abstract only).
Proceedings of the ACM/SIGDA 20th International Symposium on Field Programmable Gate Arrays, 2012

Using 1000+ GPUs and 10000+ CPUs for Sedimentary Basin Simulations.
Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012

2011
Tiled Multi-Core Stream Architecture.
Trans. High Perform. Embed. Archit. Compil., 2011

Cross-Market Financial Risk Analysis: an Agent-Based Computational Finance.
Int. J. Inf. Technol. Decis. Mak., 2011

A high-efficient software parallel CAVCL encoder based on GPU.
Proceedings of the 34th International Conference on Telecommunications and Signal Processing (TSP 2011), 2011

High-efficient software parallel CAVLC encoder based on programmable stream processor.
Proceedings of the 19th International Conference on Multimedia 2011, Scottsdale, AZ, USA, November 28, 2011

A Multilevel Parallel Intra Coding for H.264/AVC Based on CUDA.
Proceedings of the Sixth International Conference on Image and Graphics, 2011

2010
A Parallel Streaming Motion Estimation for Real-Time HD H.264 Encoding on Programmable Processors.
Proceedings of the Fifth International Conference on Frontier of Computer Science and Technology, 2010

Software Managed Instruction Scratchpad Memory Optimization in Stream Architecture Based on Hot Code Analysis of Kernels.
Proceedings of the 13th Euromicro Conference on Digital System Design, 2010

SAT: A Stream Architecture Template for Embedded Applications.
Proceedings of the 10th IEEE International Conference on Computer and Information Technology, 2010

2009
Streaming HD H.264 encoder on programmable processors.
Proceedings of the 17th International Conference on Multimedia 2009, 2009

Cache streamization for high performance stream processor.
Proceedings of the 16th International Conference on High Performance Computing, 2009

Software parallel CAVLC encoder based on stream processing.
Proceedings of the 7th IEEE/ACM/IFIP Workshop on Embedded Systems for Real-Time Multimedia, 2009

2008
On-Chip Memory System Optimization Design for the FT64 Scientific Stream Accelerator.
IEEE Micro, 2008

Load scheduling: Reducing pressure on distributed register files for free.
Proceedings of the 13th Asia South Pacific Design Automation Conference, 2008

FPGA-based Equivalent Simulation Technology (FEST) for clustered stream architecture.
Proceedings of the 13th Asia-Pacific Computer Systems Architecture Conference, 2008

2007
FT64: Scientific Computing with Streams.
Proceedings of the High Performance Computing, 2007

A Stream System-on-Chip Architecture for High Speed Target Recognition Based on Biologic Vision.
Proceedings of the Advances in Computer Systems Architecture, 2007

2006
Register Allocation on Stream Processor with Local Register File.
Proceedings of the Advances in Computer Systems Architecture, 11th Asia-Pacific Conference, 2006

Optimization and Evaluating of StreamYGX2 on MASA Stream Processor.
Proceedings of the Advances in Computer Systems Architecture, 11th Asia-Pacific Conference, 2006

Analysis and Performance Results of a fluid dynamics Application on MASA Stream Processor.
Proceedings of the 5th Annual IEEE/ACIS International Conference on Computer and Information Science (ICIS 2006) and 1st IEEE/ACIS International Workshop on Component-Based Software Engineering, 2006

2005
Multiple-Morphs Adaptive Stream Architecture.
J. Comput. Sci. Technol., 2005

Accelerated Motion Estimation of H.264 on Imagine Stream Processor.
Proceedings of the Image Analysis and Recognition, Second International Conference, 2005

A Stream Architecture Supporting Multiple Stream Execution Models.
Proceedings of the Advances in Computer Systems Architecture, 10th Asia-Pacific Conference, 2005

2004
A Parallel Reed-Solomon Decoder on the Imagine Stream Processor.
Proceedings of the Parallel and Distributed Processing and Applications, 2004

Multiple-Dimension Scalable Adaptive Stream Architecture.
Proceedings of the Advances in Computer Systems Architecture, 9th Asia-Pacific Conference, 2004


  Loading...