Tao Tang

Orcid: 0009-0009-2883-6997

  • National University of Defense Technology, College of Computer, Changsha, China (PhD 2011)

According to our database1, Tao Tang authored at least 55 papers between 2007 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.



SNCL: a supernode OpenCL implementation for hybrid computing arrays.
J. Supercomput., May, 2024

Optimizing General Matrix Multiplications on Modern Multi-core DSPs.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

Optimizing Stencil Computation on Multi-core DSPs.
Proceedings of the 53rd International Conference on Parallel Processing, 2024

A Motion Trace Decomposition-based overset grid method for parallel CFD simulations with moving boundaries.
Proceedings of the 53rd International Conference on Parallel Processing, 2024

VLASPH: Smoothed Particle Hydrodynamics on VLA SIMD Architectures.
Proceedings of the Euro-Par 2024: Parallel Processing, 2024

Programming bare-metal accelerators with heterogeneous threading models: a case study of Matrix-3000.
Frontiers Inf. Technol. Electron. Eng., 2023

Optimizing Direct Convolutions on ARM Multi-Cores.
Proceedings of the International Conference for High Performance Computing, 2023

MT-3000: a heterogeneous multi-zone processor for HPC.
CCF Trans. High Perform. Comput., 2022

VISPR-online: a web-based interactive tool to visualize CRISPR screening experiments.
BMC Bioinform., 2021

Large-Scale Parallel Alignment Algorithm for SMRT Reads.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2021

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures.
IEEE Trans. Parallel Distributed Syst., 2020

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization.
Future Gener. Comput. Syst., 2020

Parallel Programming Models for Heterogeneous Many-Cores : A Survey.
CoRR, 2020

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based Approach.
CoRR, 2020

Parallel programming models for heterogeneous many-cores: a comprehensive survey.
CCF Trans. High Perform. Comput., 2020

Auto-Tuning MPI Collective Operations on Large-Scale Parallel Systems.
Proceedings of the 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems, 2019

Orchestrating parallel detection of strongly connected components on GPUs.
Parallel Comput., 2018

Tuning Streamed Applications on Intel Xeon Phi: A Machine Learning Based Approach.
CoRR, 2018

Auto-tuning Streamed Applications on Intel Xeon Phi.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

MOCL: an efficient openCL implementation for the matrix-2000 architecture.
Proceedings of the 15th ACM International Conference on Computing Frontiers, 2018

多核/众核平台上推荐算法的实现与性能评估 (Implementation and Performance Evaluation of Recommender Algorithms Based on Multi-/Many-core Platforms).
计算机科学, 2017

面向存储层次设计优化的GPU程序性能分析 (Performance Analysis of GPU Programs Towards Better Memory Hierarchy Design).
计算机科学, 2017

Efficient and high-quality sparse graph coloring on GPUs.
Concurr. Comput. Pract. Exp., 2017

LU factorization on heterogeneous systems: an energy-efficient approach towards high performance.
Computing, 2017

High Performance Detection of Strongly Connected Components in Sparse Graphs on GPUs.
Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores, 2017

Implementing and Evaluating OpenCL on an ARMv8 Multi-Core CPU.
Proceedings of the 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC), 2017

Efficient and Portable ALS Matrix Factorization for Recommender Systems.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

High Performance Coordinate Descent Matrix Factorization for Recommender Systems.
Proceedings of the Computing Frontiers Conference, 2017

Evaluating Multiple Streams on Heterogeneous Platforms.
Parallel Process. Lett., 2016

Streaming Applications on Heterogeneous Platforms.
Proceedings of the Network and Parallel Computing, 2016

Evaluating the Performance Impact of Multiple Streams on the MIC-Based Heterogeneous Platform.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

High Performance Parallel Graph Coloring on GPGPUs.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

An Energy-Efficient Implementation of LU Factorization on Heterogeneous Systems.
Proceedings of the 22nd IEEE International Conference on Parallel and Distributed Systems, 2016

An Efficient Clique-Based Algorithm of Compute Nodes Allocation for In-memory Checkpoint System.
Proceedings of the High Performance Computing - 30th International Conference, 2015

OpenMC: Towards Simplifying Programming for TianHe Supercomputers.
J. Comput. Sci. Technol., 2014

Exploiting hierarchy parallelism for molecular dynamics on a petascale heterogeneous system.
J. Parallel Distributed Comput., 2013

OpenACC to Intel Offload: Automatic Translation and Optimization.
Proceedings of the Computer Engineering and Technology - 17th CCF Conference, 2013

MIC acceleration of short-range molecular dynamics simulations.
Proceedings of the First International Workshop on Code Optimisation for Multi and Many Cores, 2013

MPtostream: an OpenMP compiler for CPU-GPU heterogeneous parallel systems.
Sci. China Inf. Sci., 2012

Power Optimization for GPU Programs Based on Software Prefetching.
Proceedings of the IEEE 10th International Conference on Trust, 2011

Cache Miss Analysis for GPU Programs Based on Stack Distance Profile.
Proceedings of the 2011 International Conference on Distributed Computing Systems, 2011

Optimization and Implementation of LBM Benchmark on Multithreaded GPU.
Proceedings of the International Conference on Data Storage and Data Engineering, 2010

Improving scratchpad allocation with demand-driven data tiling.
Proceedings of the 2010 International Conference on Compilers, 2010

Optimizing Stencil Application on Multi-thread GPU Architecture Using Stream Programming Model.
Proceedings of the Architecture of Computing Systems, 2010

A Data Communication Scheduler for Stream Programs on CPU-GPU Platform.
Proceedings of the 10th IEEE International Conference on Computer and Information Technology, 2010

SRF Coloring: Stream Register File Allocation via Graph Coloring.
J. Comput. Sci. Technol., 2009

Program Optimization of Stencil Based Application on the GPU-Accelerated System.
Proceedings of the IEEE International Symposium on Parallel and Distributed Processing with Applications, 2009

Program Optimization of Array-Intensive SPEC2k Benchmarks on Multithreaded GPU Using CUDA and Brook+.
Proceedings of the 15th IEEE International Conference on Parallel and Distributed Systems, 2009

Optimizing scientific application loops on stream processors.
Proceedings of the 2008 ACM SIGPLAN/SIGBED Conference on Languages, 2008

Model-guided strip size selection for minimal execution time on imagine stream processor.
Proceedings of 8th IEEE International Conference on Computer and Information Technology, 2008

Implementation and Optimization of Dense LU Decomposition on the Stream Processor.
Proceedings of the Parallel Processing and Applied Mathematics, 2007

Implementation and Optimization of Sparse Matrix-Vector Multiplication on Imagine Stream Processor.
Proceedings of the Parallel and Distributed Processing and Applications, 2007

Architecture-Based Optimization for Mapping Scientific Applications to Imagine.
Proceedings of the Parallel and Distributed Processing and Applications, 2007

Evaluation of Transcendental Functions on Imagine Architecture.
Proceedings of the 2007 International Conference on Parallel Processing (ICPP 2007), 2007

Implementation and Evaluation of Jacobi Iteration on the Imagine Stream Processor.
Proceedings of the High Performance Computing, 2007
