Shigang Li

Orcid: 0000-0003-0022-7865

Affiliations:
  • Chinese Academy of Sciences, Institute of Computing Technology, Beijing, China
  • University of Science and Technology Beijing, China (PhD 2014)


According to our database1, Shigang Li authored at least 60 papers between 2010 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost.
IEEE Trans. Parallel Distributed Syst., August, 2024

POSTER: ParGNN: Efficient Training for Large-Scale Graph Neural Network on GPU Clusters.
Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024

A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network.
Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, 2024

2023
AGCM-3DLF: Accelerating Atmospheric General Circulation Model via 3-D Parallelization and Leap-Format.
IEEE Trans. Parallel Distributed Syst., March, 2023

ASDL: A Unified Interface for Gradient Preconditioning in PyTorch.
CoRR, 2023

AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication.
CoRR, 2023

Large-Scale Simulation of Structural Dynamics Computing on GPU Clusters.
Proceedings of the International Conference for High Performance Computing, 2023

ANT-MOC: Scalable Neutral Particle Transport Using 3D Method of Characteristics on Multi-GPU Systems.
Proceedings of the International Conference for High Performance Computing, 2023

Co-design Hardware and Algorithm for Vector Search.
Proceedings of the International Conference for High Performance Computing, 2023

A Scalable Hybrid Total FETI Method for Massively Parallel FEM Simulations.
Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2023

PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information Matrices.
Proceedings of the Sixth Conference on Machine Learning and Systems, 2023

Asynch-SGBDT: Train Stochastic Gradient Boosting Decision Trees in an Asynchronous Parallel Manner.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

2022
VenusAI: An artificial intelligence platform for scientific discovery on supercomputers.
J. Syst. Archit., 2022

Efficient Quantized Sparse Matrix Operations on Tensor Cores.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

HammingMesh: A Network Topology for Large-Scale Deep Learning.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

Near-optimal sparse allreduce for distributed deep learning.
Proceedings of the PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2, 2022

A data-centric optimization framework for machine learning.
Proceedings of the ICS '22: 2022 International Conference on Supercomputing, Virtual Event, June 28, 2022

2021
Breaking (Global) Barriers in Parallel Stochastic Optimization With Wait-Avoiding Group Averaging.
IEEE Trans. Parallel Distributed Syst., 2021

Why Dataset Properties Bound the Scalability of Parallel Machine Learning Training Algorithms.
IEEE Trans. Parallel Distributed Syst., 2021

Flare: flexible in-network allreduce.
Proceedings of the International Conference for High Performance Computing, 2021

Chimera: efficiently training large-scale neural networks with bidirectional pipelines.
Proceedings of the International Conference for High Performance Computing, 2021

Asynchronous Decentralized SGD with Quantized and Local Updates.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Data Movement Is All You Need: A Case Study on Optimizing Transformers.
Proceedings of the Fourth Conference on Machine Learning and Systems, 2021

2020
FastNBL: fast neighbor lists establishment for molecular dynamics simulation based on bitwise operations.
J. Supercomput., 2020

WP-SGD: Weighted parallel SGD for distributed unbalanced-workload training system.
J. Parallel Distributed Comput., 2020

The static parallel distribution algorithms for hybrid density-functional calculations in HONPAS package.
Int. J. High Perform. Comput. Appl., 2020

Deep Learning for Post-Processing Ensemble Weather Forecasts.
CoRR, 2020

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging.
CoRR, 2020

Taming unbalanced training workloads in deep learning with partial collective operations.
Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

A Highly Efficient Dynamical Core of Atmospheric General Circulation Model based on Leap-Format.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

2019
Correction to: FastNBL: fast neighbor lists establishment for molecular dynamics simulation based on bitwise operations.
J. Supercomput., 2019

Efficient parallel optimizations of a high-performance SIFT on GPUs.
J. Parallel Distributed Comput., 2019

Predicting Weather Uncertainty with Deep Convnets.
CoRR, 2019

The Scalability for Parallel Machine Learning Training Algorithm: Dataset Matters.
CoRR, 2019

OpenKMC: a KMC design for hundred-billion-atom simulation using millions of cores on Sunway Taihulight.
Proceedings of the International Conference for High Performance Computing, 2019

swMD: Performance Optimizations for Molecular Dynamics Simulation on Sunway Taihulight.
Proceedings of the 2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, 2019

Using Gradient Based Multikernel Gaussian Process and Meta-Acquisition Function to Accelerate SMBO.
Proceedings of the 31st IEEE International Conference on Tools with Artificial Intelligence, 2019

2018
Cache-Oblivious MPI All-to-All Communications Based on Morton Order.
IEEE Trans. Parallel Distributed Syst., 2018

Using Known Information to Accelerate HyperParameters Optimization Based on SMBO.
CoRR, 2018

Asynchronous Parallel Sampling Gradient Boosting Decision Tree.
CoRR, 2018

Communication-Avoiding for Dynamical Core of Atmospheric General Circulation Model.
Proceedings of the 47th International Conference on Parallel Processing, 2018

Massively Scaling the Metal Microscopic Damage Simulation on Sunway TaihuLight Supercomputer.
Proceedings of the 47th International Conference on Parallel Processing, 2018

AGCM3D: A Highly Scalable Finite-Difference Dynamical Core of Atmospheric General Circulation Model Based on 3D Decomposition.
Proceedings of the 24th IEEE International Conference on Parallel and Distributed Systems, 2018

2017
Hybrid-optimization strategy for the communication of large-scale Kinetic Monte Carlo simulation.
Comput. Phys. Commun., 2017

Kernel optimization for short-range molecular dynamics.
Comput. Phys. Commun., 2017

Asynchronous COMID: the theoretic basis for transmitted data sparsification tricks on Parameter Server.
CoRR, 2017

Weighted parallel SGD for distributed unbalanced-workload training system.
CoRR, 2017

POSTER: Cache-Oblivious MPI All-to-All Communications on Many-Core Architectures.
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

2016
A Cross-Platform SpMV Framework on Many-Core Architectures.
ACM Trans. Archit. Code Optim., 2016

Parallel Processing Systems for Big Data: A Survey.
Proc. IEEE, 2016

2015
Automatic tuning of sparse matrix-vector multiplication on multicore clusters.
Sci. China Inf. Sci., 2015

Fast Convolution Operations on Many-Core Architectures.
Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications, 2015

Analyzing MPI-3.0 Process-Level Shared Memory: A Case Study with Stencil Computations.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

2014
Improved MPI collectives for MPI processes in shared address spaces.
Clust. Comput., 2014

2013
Asynchronous Work Stealing on Distributed Memory Systems.
Proceedings of the 21st Euromicro International Conference on Parallel, 2013

NUMA-aware shared-memory collective communication for MPI.
Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, 2013

2011
Management of Non-functional Attributes of Parallel Components.
Proceedings of the International Conference on Computational Science, 2011

Extending Synchronization Constructs in OpenMP to Exploit Pipeline Parallelism on Heterogeneous Multi-core.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2011

Scheduling Multi-paradigm and Multi-grain Parallel Components on Heterogeneous Platforms.
Proceedings of the Sixth Chinagrid Annual Conference, ChinaGrid 2011, Dalian, Liaoning, 2011

2010
Support for OpenMP Tasks on Cell Architecture.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2010


  Loading...