Toward Efficient Online Scheduling for Distributed Machine Learning Systems.
IEEE Trans. Netw. Sci. Eng., 2022
On scheduling ring-all-reduce learning jobs in multi-tenant GPU clusters with communication contention.
Proceedings of the MobiHoc '22: The Twenty-third International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, Seoul, Republic of Korea, October 17, 2022
GADGET: Online Resource Optimization for Scheduling Ring-All-Reduce Learning Jobs.
Proceedings of the IEEE INFOCOM 2022, 2022
A Sum-of-Ratios Multi-Dimensional-Knapsack Decomposition for DNN Resource Scheduling.
Proceedings of the 40th IEEE Conference on Computer Communications, 2021