semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage.
,
,
,
,
,
,
,
,
,
,
,
CoRR, April, 2025
FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation.
,
,
,
,
,
,
,
,
,
,
,
CoRR, April, 2025
FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics.
Proceedings of the Seventh Annual Conference on Machine Learning and Systems, 2024
FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning.
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024
FlashDecoding++: Faster Large Language Model Inference on GPUs.
CoRR, 2023