2025

Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler.

[DOI]

Size Zheng

Wenlei Bao

CoRR, April, 2025

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism.

[DOI]

CoRR, April, 2025

TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives.

[DOI]

CoRR, March, 2025

Minder: Faulty Machine Detection for Large-scale Distributed Model Training.

[DOI]

Proceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation, 2025

2024

MegaScale: Scaling Large Language Model Training to More Than 10, 000 GPUs.

[DOI]

Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, 2024

2022

Collie: Finding Performance Anomalies in RDMA Subsystems.

[DOI]

Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, 2022

2020

EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform.

[DOI]

Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2020

2016

RDMA over Commodity Ethernet at Scale.

[DOI]

Proceedings of the ACM SIGCOMM 2016 Conference, Florianopolis, Brazil, August 22-26, 2016, 2016