Guangming Tan

Orcid: 0000-0002-6361-5948

According to our database1, Guangming Tan authored at least 154 papers between 2005 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Towards connection-scalable RNIC architecture.
J. Supercomput., July, 2024

Fast and scalable all-optical network architecture for distributed deep learning.
J. Opt. Commun. Netw., March, 2024

10-Million Atoms Simulation of First-Principle Package LS3DF.
J. Comput. Sci. Technol., March, 2024

FILL: a heterogeneous resource scheduling system addressing the low throughput problem in GROMACS.
CCF Trans. High Perform. Comput., February, 2024

Special issue of HPCChina 2023.
CCF Trans. High Perform. Comput., February, 2024

Scaling Molecular Dynamics with ab initio Accuracy to 149 Nanoseconds per Day.
CoRR, 2024

Overcoming Memory Constraints in Quantum Circuit Simulation with a High-Fidelity Compression Framework.
CoRR, 2024

JingZhao: A Framework for Rapid NIC Prototyping in the Domain-Specific-Network Era.
CoRR, 2024

POSTER: FineCo: Fine-grained Heterogeneous Resource Management for Concurrent DNN Inferences.
Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024

Exploiting Fine-Grained Redundancy in Set-Centric Graph Pattern Mining.
Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024

Training one DeePMD Model in Minutes: a Step towards Online Learning.
Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024

POSTER: Optimizing Sparse Tensor Contraction with Revisiting Hash Table Design.
Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024

Accelerate Distributed Deep Learning with a Fast Reconfigurable Optical Network.
Proceedings of the Optical Fiber Communications Conference and Exhibition, 2024

A Coordinated Strategy for GNN Combining Computational Graph and Operator Optimizations.
Proceedings of the 38th ACM International Conference on Supercomputing, 2024

FNCC: Fast Notification Congestion Control in Data Center Networks.
Proceedings of the 53rd International Conference on Parallel Processing, 2024

ElasticRoom: Multi-Tenant DNN Inference Engine via Co-design with Resource-constrained Compilation and Strong Priority Scheduling.
Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, 2024

Accelerating Large-Scale Sparse LU Factorization for RF Circuit Simulation.
Proceedings of the Euro-Par 2024: Parallel Processing, 2024

BeeZip: Towards An Organized and Scalable Architecture for Data Compression.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024

2023
Editorial for the special issue on architecture, algorithms and applications of high performance sparse matrix computations.
CCF Trans. High Perform. Comput., June, 2023

Adaptive Workload-Balanced Scheduling Strategy for Global Ocean Data Assimilation on Massive GPUs.
Proceedings of the International Conference for High Performance Computing, 2023

Enhance the Strong Scaling of LAMMPS on Fugaku.
Proceedings of the International Conference for High Performance Computing, 2023

Fast All-Pairs Shortest Paths Algorithm in Large Sparse Graph.
Proceedings of the 37th International Conference on Supercomputing, 2023

GraphPar: Efficient Workload-Aware Subgraph Matching System on Multiple GPUs.
Proceedings of the 29th IEEE International Conference on Parallel and Distributed Systems, 2023

JetEsti: A New DLT Job Scheduling Simulator Based on Fine-Grained Process Modeling.
Proceedings of the 43rd IEEE International Conference on Distributed Computing Systems, 2023

DeletePop: A DLT Execution Time Predictor Based on Comprehensive Modeling.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2023

NvWa: Enhancing Sequence Alignment Accelerator Throughput via Hardware Scheduling.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2023

Integrative Drug Discovery Platform: A Modular Approach for Efficient and Automated Virtual Screening.
Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, 2023

RLEKF: An Optimizer for Deep Potential with Ab Initio Accuracy.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022
A Pattern-Based SpGEMM Library for Multi-Core and Many-Core Architectures.
IEEE Trans. Parallel Distributed Syst., 2022

Double precision is not necessary for LSQR for solving discrete linear ill-posed problems.
CoRR, 2022

Backward error analysis of the Lanczos bidiagonalization with reorthogonalization.
CoRR, 2022

Extending the limit of molecular dynamics with ab initio accuracy to 10 billion atoms.
CoRR, 2022

Fast and accurate variable batch size convolution neural network training on large scale distributed systems.
Concurr. Comput. Pract. Exp., 2022

Improvement of AI forecast of gridded PM2.5 forecast in China through ConvLSTM and Attention.
CCF Trans. High Perform. Comput., 2022

W-Cycle SVD: A Multilevel Algorithm for Batched SVD on GPUs.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

2.5 Million-Atom Ab Initio Electronic-Structure Simulation of Complex Metallic Heterostructures with DGDFT.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse Matrices.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

A W-cycle algorithm for efficient batched SVD on GPUs.
Proceedings of the PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2, 2022

Extending the limit of molecular dynamics with <i>ab initio</i> accuracy to 10 billion atoms.
Proceedings of the PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2, 2022

CSAM: A Channel and Spatial Attention Mechanism for Impervious Surface Extraction in Difficult Areas.
Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, 2022

MegTaiChi: dynamic tensor-based memory management optimization for DNN training.
Proceedings of the ICS '22: 2022 International Conference on Supercomputing, Virtual Event, June 28, 2022

TileSpMSpV: A Tiled Algorithm for Sparse Matrix-Sparse Vector Multiplication on GPUs.
Proceedings of the 51st International Conference on Parallel Processing, 2022

csRNA: Connection-Scalable RDMA NIC Architecture in Datacenter Environment.
Proceedings of the IEEE 40th International Conference on Computer Design, 2022

MetaZip: a high-throughput and efficient accelerator for DEFLATE.
Proceedings of the DAC '22: 59th ACM/IEEE Design Automation Conference, San Francisco, California, USA, July 10, 2022

2021
Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems.
IEEE Trans. Parallel Distributed Syst., 2021

A New Optoelectronic Hybrid Network Based on Scheduling Optimization of Optical Links.
IEEE Trans. Computers, 2021

PIM-Align: A Processing-in-Memory Architecture for FM-Index Search Algorithm.
J. Comput. Sci. Technol., 2021

Guest Editorial: Special issue on Network and Parallel Computing for Emerging Architectures and Applications.
Int. J. Parallel Program., 2021

High-performance Migration Tool for Live Container in a Workflow.
Int. J. Parallel Program., 2021

Editorial for the special issue on large-scale AI in classical HPC environment and AI for science.
CCF Trans. High Perform. Comput., 2021

I/O lower bounds for auto-tuning of convolutions in CNNs.
Proceedings of the PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021

A Multi-GPU Design for Large Size Cryo-EM 3D Reconstruction.
Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021

TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs.
Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021

Spatio-Temporal Features Processing Network for Change Detection in Remote Sensing Images.
Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, 2021

Deep Reinforcement Agent for Failure-aware Job scheduling in High-Performance Computing.
Proceedings of the 27th IEEE International Conference on Parallel and Distributed Systems, 2021

Building Agile Workflow Microservice System for HPC Applications Based on Fast-start OSv.
Proceedings of the 27th IEEE International Conference on Parallel and Distributed Systems, 2021

WidePipe: High-Throughput Deep Learning Inference System on a Cluster of Neural Processing Units.
Proceedings of the 39th IEEE International Conference on Computer Design, 2021

2020
Fast Data-Obtaining Algorithm for Data Assimilation with Large Data Set.
Int. J. Parallel Program., 2020

Towards a heterogeneous architecture solver for the incompressible Navier-Stokes equations.
CCF Trans. High Perform. Comput., 2020

Editorial for the special issue on HPC algorithms and applications.
CCF Trans. High Perform. Comput., 2020

Communication Lower Bounds of Convolutions in CNNs.
Proceedings of the SPAA '20: 32nd ACM Symposium on Parallelism in Algorithms and Architectures, 2020

Revisiting linpack algorithm on large-scale CPU-GPU heterogeneous systems.
Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

FEB<sup>3D</sup>: An Efficient FPGA-Accelerated Compression Framework for Microscopy Images.
Proceedings of the Network and Parallel Computing, 2020

2019
Register-Aware Optimizations for Parallel Sparse Matrix-Matrix Multiplication.
Int. J. Parallel Program., 2019

Editorial for the special issue on innovations in supercomputing techniques.
CCF Trans. High Perform. Comput., 2019

Wormhole optical network: a new architecture to solve long diameter problem in exascale computer.
CCF Trans. High Perform. Comput., 2019

S-EnKF: co-designing for scalable ensemble Kalman filter.
Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

A pattern based algorithmic autotuner for graph processing on GPUs.
Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

Tensor Layout Optimization of Convolution for Inference on Digital Signal Processor.
Proceedings of the 2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, 2019

A Variable Batch Size Strategy for Large Scale Distributed DNN Training.
Proceedings of the 2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, 2019

IA-SpGEMM: an input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication.
Proceedings of the ACM International Conference on Supercomputing, 2019

T2HT : Traffic-Driven Machine Learning Based Hierarchical Topology Generation Model.
Proceedings of the 25th IEEE International Conference on Parallel and Distributed Systems, 2019

A New Traffic Offloading Method with Slow Switching Optical Device in Exascale Computer.
Proceedings of the 37th IEEE International Conference on Computer Design, 2019

OeIM: An Optoelectronic Interconnection Middleware for the Exascale Computer.
Proceedings of the 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems, 2019

2018
Quadboost: A Scalable Concurrent Quadtree.
IEEE Trans. Parallel Distributed Syst., 2018

An Autotuning Protocol to Rapidly Build Autotuners.
ACM Trans. Parallel Comput., 2018

Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture.
ACM Trans. Math. Softw., 2018

Automated and precise event detection method for big data in biomedical imaging with support vector machine.
Comput. Syst. Sci. Eng., 2018

Register-based implementation of the sparse general matrix-matrix multiplication on GPUs.
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018

High-performance genomic analysis framework with in-memory computing.
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018

Routing and Spectrum Allocation for Time Varying Traffic by Artificial Bee Colony Algorithm in Elastic Optical Networks.
Proceedings of the IEEE International Conference on Parallel & Distributed Processing with Applications, 2018

Communication-Avoiding for Dynamical Core of Atmospheric General Circulation Model.
Proceedings of the 47th International Conference on Parallel Processing, 2018

Accelerating FM-index Search for Genomic Data Processing.
Proceedings of the 47th International Conference on Parallel Processing, 2018

2017
Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning.
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

A performance analysis framework for exploiting GPU microarchitectural capability.
Proceedings of the International Conference on Supercomputing, 2017

RING: NUMA-Aware Message-Batching Runtime for Data-Intensive Applications.
Proceedings of the 23rd IEEE International Conference on Parallel and Distributed Systems, 2017

Quantifying and Mitigating Computational Inefficiency of Genomics Data Analysis.
Proceedings of the 19th IEEE International Conference on High Performance Computing and Communications; 15th IEEE International Conference on Smart City; 3rd IEEE International Conference on Data Science and Systems, 2017

2016
Graphine: Programming Graph-Parallel Computation of Large Natural Graphs for Multicore Clusters.
IEEE Trans. Parallel Distributed Syst., 2016

Accelerating Irregular Computation in Massive Short Reads Mapping on FPGA Co-Processor.
IEEE Trans. Parallel Distributed Syst., 2016

边缘海静力数值预报模式并行算法研究 (Parallelization of Hydrostatic Numerical Forecasting Model of Marginal Sea).
计算机科学, 2016

Locality of Computation for Stencil Optimization.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2016

Accelerating large-scale genomic analysis with Spark.
Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, 2016

2015
SuperDragon: A Heterogeneous Parallel System for Accelerating 3D Reconstruction of Cryo-Electron Microscopy Images.
ACM Trans. Reconfigurable Technol. Syst., 2015

Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance.
Int. J. High Perform. Comput. Appl., 2015

FAST: A Fast Stencil Autotuning Framework Based On An Optimal-solution Space Model.
Proceedings of the 29th ACM on International Conference on Supercomputing, 2015

Bit Flipping Errors in High Performance Linpack at Exascale and Beyond.
Proceedings of the 44th International Conference on Parallel Processing, 2015

Study on Partitioning Real-World Directed Graphs of Skewed Degree Distribution.
Proceedings of the 44th International Conference on Parallel Processing, 2015

Implementation of Short Read Alignment Algorithm in OpenCL on Xeon Phi Coprocessor.
Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications, 2015

Application Taxonomy via Algorithmic Commonality for Domain-Specific Architecture Desgin.
Proceedings of the 22nd IEEE International Conference on High Performance Computing, 2015

A Reliable Distributed Convolutional Neural Network for Biology Image Segmentation.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

2014
Exploiting fine-grained parallelism in graph traversal algorithms via lock virtualization on multi-core architecture.
J. Supercomput., 2014

Accelerating massive short reads mapping for next generation sequencing (abstract only).
Proceedings of the 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2014

Reducing Communication in Parallel Breadth-First Search on Distributed Memory Systems.
Proceedings of the 17th IEEE International Conference on Computational Science and Engineering, 2014

Optimizing stencil code via locality of computation.
Proceedings of the International Conference on Parallel Architectures and Compilation, 2014

2013
Scalability study of molecular dynamics simulation on Godson-T many-core architecture.
J. Parallel Distributed Comput., 2013

Optimizing Parallel S n Sweeps on Unstructured Grids for Multi-Core Clusters.
J. Comput. Sci. Technol., 2013

Understanding parallelism in graph traversal on multi-core clusters.
Comput. Sci. Res. Dev., 2013

GRE: A Graph Runtime Engine for Large-Scale Distributed Graph-Parallel Applications.
CoRR, 2013

A Study of Leveraging Memory Level Parallelism for DRAM System on Multi-core/Many-Core Architecture.
Proceedings of the 12th IEEE International Conference on Trust, 2013

SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication.
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2013

ParaInsight: An Assistant for Quantitatively Analyzing Multi-granularity Parallel Region.
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013

Vlock: Lock virtualization mechanism for exploiting fine-grained parallelism in graph traversal algorithms.
Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, 2013

2012
SMAT: An Input Adaptive Sparse Matrix-Vector Multiplication Auto-Tuner
CoRR, 2012

Compression and Sieve: Reducing Communication in Parallel Breadth First Search on Distributed Memory Systems
CoRR, 2012

A lightweight hybrid hardware/software approach for object-relative memory profiling.
Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2012

A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

Investigating Memory Optimization of Hash-index for Next Generation Sequencing on Multi-core Architecture.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

PDSEC Introduction.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs.
Proceedings of the International Conference on Supercomputing, 2012

A coarse-grained stream architecture for cryo-electron microscopy images 3D reconstruction.
Proceedings of the ACM/SIGDA 20th International Symposium on Field Programmable Gate Arrays, 2012

Accelerating Millions of Short Reads Mapping on a Heterogeneous Architecture with FPGA Accelerator.
Proceedings of the 2012 IEEE 20th Annual International Symposium on Field-Programmable Custom Computing Machines, 2012

2011
Analysis and performance results of computing betweenness centrality on IBM Cyclops64.
J. Supercomput., 2011

Revisiting Multiple Pattern Matching Algorithms for Multi-Core Architecture.
J. Comput. Sci. Technol., 2011

Dawning Nebulae: A PetaFLOPS Supercomputer with a Heterogeneous Structure.
J. Comput. Sci. Technol., 2011

Numerical assessment of flood hazard risk to people and vehicles in flash floods.
Environ. Model. Softw., 2011

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism
CoRR, 2011

Fast implementation of DGEMM on Fermi GPU.
Proceedings of the Conference on High Performance Computing Networking, 2011

Poster: revisiting virtual channel memory for performance and fairness on multi-core architecture.
Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31, 2011

Experience of parallelizing cryo-EM 3D reconstruction on a CPU-GPU heterogeneous system.
Proceedings of the 20th ACM International Symposium on High Performance Distributed Computing, 2011

Building algorithmically nonstop fault tolerant MPI programs.
Proceedings of the 18th International Conference on High Performance Computing, 2011

Performance analysis and optimization of molecular dynamics simulation on <i>Godson-T</i> many-core processor.
Proceedings of the 8th Conference on Computing Frontiers, 2011

2010
Automatically Tuned Dynamic Programming with an Algorithm-by-Blocks.
Proceedings of the 16th IEEE International Conference on Parallel and Distributed Systems, 2010

Preliminary Investigation of Accelerating Molecular Dynamics Simulation on Godson-T Many-Core Processor.
Proceedings of the Euro-Par 2010 Parallel Processing Workshops, 2010

2009
Improving Performance of Dynamic Programming via Parallelism and Locality on Multicore Architectures.
IEEE Trans. Parallel Distributed Syst., 2009

Extending Amdahl's law in the multicore era.
SIGMETRICS Perform. Evaluation Rev., 2009

Characterizing Betweenness Centrality Algorithm on Multi-core Architectures.
Proceedings of the IEEE International Symposium on Parallel and Distributed Processing with Applications, 2009

Single-particle 3d reconstruction from cryo-electron microscopy images on GPU.
Proceedings of the 23rd international conference on Supercomputing, 2009

A Parallel Algorithm for Computing Betweenness Centrality.
Proceedings of the ICPP 2009, 2009

High Performance Matrix Multiplication on Many Cores.
Proceedings of the Euro-Par 2009 Parallel Processing, 2009

2008
Experience on optimizing irregular computation for memory hierarchy in manycore architecture.
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008

Just-In-Time Locality and Percolation for Optimizing Irregular Applications on a Manycore Architecture.
Proceedings of the Languages and Compilers for Parallel Computing, 2008

2007
Cache oblivious algorithms for nonserial polyadic programming.
J. Supercomput., 2007

Regular Paper: A Study of Architectural Optimization Methods in Bioinformatics Applications.
Int. J. High Perform. Comput. Appl., 2007

A parallel dynamic programming algorithm on a multi-core architecture.
Proceedings of the SPAA 2007: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, 2007

Implementation of the Smith-Waterman algorithm on a reconfigurable supercomputing platform.
Proceedings of the 1st international workshop on High-performance reconfigurable computing technology and applications, 2007

2006
Improvement of Performance of MegaBlast Algorithm for DNA Sequence Alignment.
J. Comput. Sci. Technol., 2006

Biology - Locality and parallelism optimization for dynamic programming algorithm in bioinformatics.
Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

An experimental study of optimizing bioinformatics applications.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Improving locality of nonserial polyadic dynamic programming.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Load Balancing and Parallel Multiple Sequence Alignment with Tree Accumulation.
Proceedings of the Euro-Par 2006, Parallel Processing, 12th International Euro-Par Conference, Dresden, Germany, August 28, 2006

2005
Load Balancing Algorithm in Cluster-based RNA secondary structure Prediction.
Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC 2005), 2005

An Optimized Algorithm of High Spatial-temporal Efficiency for MegaBlast.
Proceedings of the 11th International Conference on Parallel and Distributed Systems, 2005

An Efficient Dynamic Programming Algorithm and Implementation for RNA Secondary Structure Prediction.
Proceedings of the Computational Science, 2005

Exploiting Parallelization for RNA Secondary Structure Prediction in Cluster.
Proceedings of the Computational Science, 2005


  Loading...