2025

semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage.

[DOI]

Ke Hong

Lufang Chen

CoRR, April, 2025

FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation.

[DOI]

CoRR, April, 2025

2024

FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics.

[DOI]

Proceedings of the Seventh Annual Conference on Machine Learning and Systems, 2024

FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning.

[DOI]

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024

2023

FlashDecoding++: Faster Large Language Model Inference on GPUs.

[DOI]

CoRR, 2023