2025
Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler.
CoRR, April, 2025

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism.
CoRR, April, 2025

TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives.
CoRR, March, 2025

Minder: Faulty Machine Detection for Large-scale Distributed Model Training.
Proceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation, 2025

2024
MegaScale: Scaling Large Language Model Training to More Than 10, 000 GPUs.
Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, 2024

2022
Collie: Finding Performance Anomalies in RDMA Subsystems.
Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, 2022

2020
EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2020

2016
RDMA over Commodity Ethernet at Scale.
Proceedings of the ACM SIGCOMM 2016 Conference, Florianopolis, Brazil, August 22-26, 2016, 2016