Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, April, 2025
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, April, 2025
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives.
,
,
,
,
,
,
,
,
,
,
,
CoRR, March, 2025
Minder: Faulty Machine Detection for Large-scale Distributed Model Training.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation, 2025
MegaScale: Scaling Large Language Model Training to More Than 10, 000 GPUs.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, 2024
Collie: Finding Performance Anomalies in RDMA Subsystems.
Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, 2022
EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2020
RDMA over Commodity Ethernet at Scale.
Proceedings of the ACM SIGCOMM 2016 Conference, Florianopolis, Brazil, August 22-26, 2016, 2016