2024
Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization.
CoRR, 2024

FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment.
CoRR, 2024

DataSculpt: Crafting Data Landscapes for LLM Post-Training through Multi-objective Partitioning.
CoRR, 2024

BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline.
CoRR, 2024

PQCache: Product Quantization-based KVCache for Long Context LLM Inference.
CoRR, 2024

Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs.
CoRR, 2024

Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge.
CoRR, 2024

Enabling Parallelism Hot Switching for Efficient Training of Large Language Models.
Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, 2024

LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

2023
Hetu: a highly efficient automatic parallel distributed deep learning system.
Sci. China Inf. Sci., January, 2023

Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent.
Proc. VLDB Endow., 2023

FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement.
Proc. ACM Manag. Data, 2023

Improving Automatic Parallel Training via Balanced Memory Workload Optimization.
CoRR, 2023

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning.
Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023

2022
Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism.
Proc. VLDB Endow., 2022

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning.
CoRR, 2022

HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System.
CoRR, 2022

HET-GMP: A Graph-based System Approach to Scaling Large Embedding Model Training.
Proceedings of the SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12, 2022

TSPLIT: Fine-grained GPU Memory Management for Efficient DNN Training via Tensor Splitting.
Proceedings of the 38th IEEE International Conference on Data Engineering, 2022

2021
HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework.
Proc. VLDB Endow., 2021

Dense-to-Sparse Gate for Mixture-of-Experts.
CoRR, 2021

Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce.
Proceedings of the SIGMOD '21: International Conference on Management of Data, 2021