Yuxiong He

Orcid: 0000-0001-8887-7752

According to our database1, Yuxiong He authored at least 141 papers between 2004 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation.
CoRR, 2024

STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning.
CoRR, 2024

FastPersist: Accelerating Model Checkpointing in Deep Learning.
CoRR, 2024

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design.
CoRR, 2024

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference.
CoRR, 2024

Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs.
Proceedings of the 2024 USENIX Annual Technical Conference, 2024

System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models.
Proceedings of the 43rd ACM Symposium on Principles of Distributed Computing, 2024

System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

ZeRO++: Extremely Efficient Collective Communication for Large Model Training.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024

2023
SHARP: An Adaptable, Energy-Efficient Accelerator for Recurrent Neural Networks.
ACM Trans. Embed. Comput. Syst., March, 2023

ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks.
CoRR, 2023

ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers.
CoRR, 2023

DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies.
CoRR, 2023

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models.
CoRR, 2023

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention.
CoRR, 2023

RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model.
CoRR, 2023

DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales.
CoRR, 2023

ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats.
CoRR, 2023

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training.
CoRR, 2023

Selective Guidance: Are All the Denoising Steps of Guided Diffusion Important?
CoRR, 2023

A Comprehensive Study on Post-Training Quantization for Large Language Models.
CoRR, 2023

A Novel Tensor-Expert Hybrid Parallelism Approach to Scale Mixture-of-Experts Training.
CoRR, 2023

Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases.
CoRR, 2023

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training.
Proceedings of the 37th International Conference on Supercomputing, 2023

HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs.
Proceedings of the 37th International Conference on Supercomputing, 2023

Understanding Int4 Quantization for Language Models: Latency Speedup, Composability, and Failure Cases.
Proceedings of the International Conference on Machine Learning, 2023

DySR: Adaptive Super-Resolution via Algorithm and System Co-design.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam.
Proceedings of the Eleventh International Conference on Learning Representations, 2023

Scaling Vision-Language Models with Sparse Mixture of Experts.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023

Revisiting the Efficiency-Accuracy Tradeoff in Adapting Transformer Models via Adversarial Fine-Tuning.
Proceedings of the ECAI 2023 - 26th European Conference on Artificial Intelligence, September 30 - October 4, 2023, Kraków, Poland, 2023

2022
DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing.
CoRR, 2022

Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers.
CoRR, 2022

BiFeat: Supercharge GNN Training via Graph Feature Quantization.
CoRR, 2022

Compressing Pre-trained Transformers via Low-Bit NxM Sparsity for Natural Language Understanding.
CoRR, 2022

Extreme Compression for Pre-trained Transformers Made Simple and Efficient.
CoRR, 2022

ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language Models via Efficient Large-Batch Adversarial Noise.
CoRR, 2022

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model.
CoRR, 2022

GraSP: Optimizing Graph-based Nearest Neighbor Search with Subgraph Sampling and Pruning.
Proceedings of the WSDM '22: The Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event / Tempe, AZ, USA, February 21, 2022

DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale.
Proceedings of the International Conference on Machine Learning, 2022

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed.
Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

Task Offloading Based on GRU Model in IoT.
Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering, 2022

Adversarial Data Augmentation for Task-Specific Knowledge Distillation of Pre-trained Transformers.
Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, 2022

2021
Scalable and Efficient MoE Training for Multitask Multilingual Models.
CoRR, 2021

Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training.
CoRR, 2021

ZeRO-Offload: Democratizing Billion-Scale Model Training.
Proceedings of the 2021 USENIX Annual Technical Conference, 2021

ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning.
Proceedings of the International Conference for High Performance Computing, 2021

SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training using Gradient Similarity Measurement.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

NxMTransformer: Semi-Structured Sparsification for Natural Language Understanding via ADMM.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed.
Proceedings of the 38th International Conference on Machine Learning, 2021

2020
Fast LSTM by dynamic decomposition on cloud and distributed systems.
Knowl. Inf. Syst., 2020

Local trend discovery on real-time microblogs with uncertain locations in tight memory environments.
GeoInformatica, 2020

APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm.
CoRR, 2020

Improving Approximate Nearest Neighbor Search through Learned Adaptive Early Termination.
Proceedings of the 2020 International Conference on Management of Data, 2020

ZeRO: memory optimizations toward training trillion parameter models.
Proceedings of the International Conference for High Performance Computing, 2020

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping.
Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters.
Proceedings of the KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2020

2019
Communication-Aware Scheduling of Precedence-Constrained Tasks.
SIGMETRICS Perform. Evaluation Rev., 2019

LSTM-Sharp: An Adaptable, Energy-Efficient Hardware Accelerator for Long Short-Term Memory.
CoRR, 2019

ZeRO: Memory Optimization Towards Training A Trillion Parameter Models.
CoRR, 2019

AntMan: Sparse Low-Rank Compression to Accelerate RNN inference.
CoRR, 2019

Accelerating Large Scale Deep Learning Inference through DeepCPU at Microsoft.
Proceedings of the 2019 USENIX Conference on Operational Machine Learning, 2019

Deep Learning Inference Service at Microsoft.
Proceedings of the 2019 USENIX Conference on Operational Machine Learning, 2019

Fast LSTM Inference by Dynamic Decomposition on Cloud Systems.
Proceedings of the 2019 IEEE International Conference on Data Mining, 2019

GRNN: Low-Latency and Scalable RNN Inference on GPUs.
Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, March 25-28, 2019, 2019

GRIP: Multi-Store Capacity-Optimized High-Performance Nearest Neighbor Search for Vector Search Engine.
Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019

2018
Efficient Deep Neural Network Serving: Fast and Furious.
IEEE Trans. Netw. Serv. Manag., 2018

Stochastic Modeling and Optimization of Stragglers.
IEEE Trans. Cloud Comput., 2018

Zoom: SSD-based Vector Search for Optimizing Accuracy, Latency and Memory.
CoRR, 2018

Better Caching in Search Advertising Systems with Rapid Refresh Predictions.
Proceedings of the 2018 World Wide Web Conference on World Wide Web, 2018

DeepCPU: Serving RNN-based Deep Learning Models 10x Faster.
Proceedings of the 2018 USENIX Annual Technical Conference, 2018

Navigating with Graph Representations for Fast and Scalable Decoding of Neural Language Models.
Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, 2018

Learning Intrinsic Sparse Structures within Long Short-Term Memory.
Proceedings of the 6th International Conference on Learning Representations, 2018

2017
Obtaining and Managing Answer Quality for Online Data-Intensive Services.
ACM Trans. Model. Perform. Evaluation Comput. Syst., 2017

Learning Intrinsic Sparse Structures within Long Short-term Memory.
CoRR, 2017

Optimal Reissue Policies for Reducing Tail Latency.
Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, 2017

BitFunnel: Revisiting Signatures for Search.
Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017

HyperDrive: exploring hyperparameters with POP scheduling.
Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, Las Vegas, NV, USA, December 11, 2017

Swayam: distributed autoscaling to meet SLAs of machine learning inference services with resource efficiency.
Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, Las Vegas, NV, USA, December 11, 2017

Exploiting heterogeneity for tail latency and energy efficiency.
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017

When Good Enough Is Better: Energy-Aware Scheduling for Multicore Servers.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Workload analysis and caching strategies for search advertising systems.
Proceedings of the 2017 Symposium on Cloud Computing, SoCC 2017, Santa Clara, CA, USA, 2017

Optimizing CNNs on Multicores for Scalability, Performance and Goodput.
Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017

2016
Prediction and Predictability for Search Query Acceleration.
ACM Trans. Web, 2016

Venus: Scalable Real-Time Spatial Queries on Microblogs with Adaptive Load Shedding.
IEEE Trans. Knowl. Data Eng., 2016

SERF: efficient scheduling for fast deep neural network serving via judicious parallelism.
Proceedings of the International Conference for High Performance Computing, 2016

Work stealing for interactive services to meet target latency.
Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2016

GeoTrend: spatial trending queries on real-time microblogs.
Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS 2016, Burlingame, California, USA, October 31, 2016

TPC: Target-Driven Parallelism Combining Prediction and Correction to Reduce Tail Latency in Interactive Services.
Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, 2016

2015
Online Resource Management for Carbon-Neutral Cloud Computing.
Proceedings of the Handbook on Data Centers, 2015

Processing and Optimizing Main Memory Spatial-Keyword Queries.
Proc. VLDB Endow., 2015

Delayed-Dynamic-Selective (DDS) Prediction for Reducing Extreme Tail Latency in Web Search.
Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 2015

Optimal Aggregation Policy for Reducing Tail Latency of Web Search.
Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2015

BATS: Budget-Constrained Autoscaling for Cloud Performance Optimization.
Proceedings of the 23rd IEEE International Symposium on Modeling, 2015

Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems.
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015

Measuring and Managing Answer Quality for Online Data-Intensive Services.
Proceedings of the 2015 IEEE International Conference on Autonomic Computing, 2015

Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services.
Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, 2015

2014
Energy-efficient multiprocessor scheduling for flow time and makespan.
Theor. Comput. Sci., 2014

Hybrid query execution engine for large attributed graphs.
Inf. Syst., 2014

A Theoretical Foundation for Scheduling and Designing Heterogeneous Processors for Interactive Applications.
Proceedings of the Distributed Computing - 28th International Symposium, 2014

Predictive parallelization: taming tail latencies in web search.
Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2014

Mercury: A memory-constrained spatio-temporal real-time search on microblogs.
Proceedings of the IEEE 30th International Conference on Data Engineering, Chicago, 2014

Mars: Real-time spatio-temporal queries on microblogs.
Proceedings of the IEEE 30th International Conference on Data Engineering, Chicago, 2014

2013
Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs.
Proc. VLDB Endow., 2013

Solving Graph Isomorphism Using Parameterized Matching.
Proceedings of the String Processing and Information Retrieval, 2013

COCA: online distributed resource management for cost minimization and carbon neutrality in data centers.
Proceedings of the International Conference for High Performance Computing, 2013

Energy-Efficient Scheduling for Best-Effort Interactive Services to Achieve High Response Quality.
Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

Power-effiicent resource allocation in MapReduce clusters.
Proceedings of the 2013 IFIP/IEEE International Symposium on Integrated Network Management (IM 2013), 2013

Performance Inconsistency in Large Scale Data Processing Clusters.
Proceedings of the 10th International Conference on Autonomic Computing, 2013

Exploiting Processor Heterogeneity in Interactive Services.
Proceedings of the 10th International Conference on Autonomic Computing, 2013

Adaptive parallelism for web search.
Proceedings of the Eighth Eurosys Conference 2013, 2013

Topic 3: Scheduling and Load Balancing - (Introduction).
Proceedings of the Euro-Par 2013 Parallel Processing, 2013

QACO: exploiting partial execution in web servers.
Proceedings of the ACM Cloud and Autonomic Computing Conference, 2013

2012
Horton: Online Query Execution Engine for Large Distributed Graphs.
Proceedings of the IEEE 28th International Conference on Data Engineering (ICDE 2012), 2012

Provably-Efficient Job Scheduling for Energy and Fairness in Geographically Distributed Data Centers.
Proceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems, 2012

Budget-based control for interactive services with adaptive execution.
Proceedings of the 9th International Conference on Autonomic Computing, 2012

Zeta: scheduling interactive services with partial execution.
Proceedings of the ACM Symposium on Cloud Computing, SOCC '12, 2012

G-SPARQL: a hybrid engine for querying large attributed graphs.
Proceedings of the 21st ACM International Conference on Information and Knowledge Management, 2012

2011
Speed Scaling for Energy and Performance with Instantaneous Parallelism.
Proceedings of the Theory and Practice of Algorithms in (Computer) Systems, 2011

Scheduling Functionally Heterogeneous Systems with Utilization Balancing.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Tians Scheduling: Using Partial Processing in Best-Effort Applications.
Proceedings of the 2011 International Conference on Distributed Computing Systems, 2011

Scheduling for data center interactive services.
Proceedings of the 49th Annual Allerton Conference on Communication, 2011

Position Paper: Embracing Heterogeneity - Improving Energy Efficiency for Interactive Services on Heterogeneous Data Center Hardware.
Proceedings of the AI for Data Center Management and Cloud Computing, 2011

2010
Improved results for scheduling batched parallel jobs by using a generalized analysis framework.
J. Parallel Distributed Comput., 2010

Energy-Efficient Multiprocessor Scheduling for Flow Time and Makespan
CoRR, 2010

The Cilkview scalability analyzer.
Proceedings of the SPAA 2010: Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures, 2010

2008
Provably Efficient Online Nonclairvoyant Adaptive Scheduling.
IEEE Trans. Parallel Distributed Syst., 2008

Adaptive work-stealing with parallelism feedback.
ACM Trans. Comput. Syst., 2008

2007
Adaptive work stealing with parallelism feedback.
Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2007

Provably Efficient Online Non-clairvoyant Adaptive Scheduling.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Adaptive Scheduling with Parallelism Feedback.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Adaptive Scheduling of Parallel Jobs on Functionally Heterogeneous Resources.
Proceedings of the 2007 International Conference on Parallel Processing (ICPP 2007), 2007

2006
Provably Efficient Two-Level Adaptive Scheduling.
Proceedings of the Job Scheduling Strategies for Parallel Processing, 2006

An Empirical Evaluation ofWork Stealing with Parallelism Feedback.
Proceedings of the 26th IEEE International Conference on Distributed Computing Systems (ICDCS 2006), 2006

2004
Secure communications between bandwidth brokers.
ACM SIGOPS Oper. Syst. Rev., 2004


  Loading...