Hao Zhang

Orcid: 0009-0003-8392-3977

Affiliations:
  • University of California, San Diego, CA, USA
  • University of California, Berkeley, CA, USA


According to our database1, Hao Zhang authored at least 24 papers between 2021 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Efficient LLM Scheduling by Learning to Rank.
CoRR, 2024

MPC-Minimized Secure LLM Inference.
CoRR, 2024

Empowering 1000 tokens/second on-device LLM prefilling with mllm-NPU.
CoRR, 2024

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput.
CoRR, 2024

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length.
CoRR, 2024

Toward Inference-optimal Mixture-of-Expert Large Language Models.
CoRR, 2024

MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving.
CoRR, 2024

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.
Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation, 2024

Online Speculative Decoding.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

WiP: Efficient LLM Prefilling with Mobile NPU.
Proceedings of the Workshop on Edge and Mobile Foundation Models, 2024

2023
LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers.
CoRR, 2023

Efficient Memory Management for Large Language Model Serving with PagedAttention.
Proceedings of the 29th Symposium on Operating Systems Principles, 2023

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving.
Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation, 2023

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

On Optimizing the Communication of Model Parallelism.
Proceedings of the Sixth Conference on Machine Learning and Systems, 2023

2022
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning.
CoRR, 2022

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning.
Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, 2022

2021
Simple and Automatic Distributed Machine Learning on Ray.
Proceedings of the KDD '21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models.
Proceedings of the 38th International Conference on Machine Learning, 2021


  Loading...