Ammar Ahmad Awan
Orcid: 0000-0002-6272-3760Affiliations:
- The Ohio State University, Columbus, OH, USA
According to our database1,
Ammar Ahmad Awan
authored at least 47 papers
between 2012 and 2024.
Collaborative distances:
Collaborative distances:
Timeline
Legend:
Book In proceedings Article PhD thesis Dataset OtherLinks
Online presence:
-
on orcid.org
-
on dl.acm.org
On csauthors.net:
Bibliography
2024
CoRR, 2024
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference.
CoRR, 2024
2023
DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies.
CoRR, 2023
DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention.
CoRR, 2023
DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales.
CoRR, 2023
A Novel Tensor-Expert Hybrid Parallelism Approach to Scale Mixture-of-Experts Training.
CoRR, 2023
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training.
Proceedings of the 37th International Conference on Supercomputing, 2023
2022
DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.
Proceedings of the SC22: International Conference for High Performance Computing, 2022
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale.
Proceedings of the International Conference on Machine Learning, 2022
1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed.
Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022
2021
1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed.
Proceedings of the 38th International Conference on Machine Learning, 2021
2020
Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects.
IEEE Micro, 2020
HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow.
Proceedings of the High Performance Computing - 35th International Conference, 2020
GEMS: GPU-enabled memory-aware model-parallelism system for distributed DNN training.
Proceedings of the International Conference for High Performance Computing, 2020
Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020
NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems.
Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020
2019
IEEE Trans. Parallel Distributed Syst., 2019
Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?
Parallel Comput., 2019
HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow.
CoRR, 2019
OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks.
Proceedings of the 2019 IEEE/ACM Performance Modeling, 2019
Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera.
Proceedings of the Third IEEE/ACM Workshop on Deep Learning on Supercomputers, 2019
Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019
Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters.
Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation.
Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019
2018
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Proceedings of the 25th European MPI Users' Group Meeting, 2018
OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training.
Proceedings of the 25th IEEE International Conference on High Performance Computing, 2018
2017
An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures.
Proceedings of the Machine Learning on HPC Environments, 2017
S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters.
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017
Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning.
Proceedings of the 46th International Conference on Parallel Processing, 2017
2016
CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters.
Parallel Comput., 2016
Proceedings of the 23rd European MPI Users' Group Meeting, EuroMPI 2016, 2016
CUDA M3: Designing Efficient CUDA Managed Memory-Aware MPI by Exploiting GDR and IPC.
Proceedings of the 23rd IEEE International Conference on High Performance Computing, 2016
Proceedings of the IEEE/ACM 16th International Symposium on Cluster, 2016
2015
Designing Non-blocking Personalized Collectives with Near Perfect Overlap for RDMA-Enabled Clusters.
Proceedings of the High Performance Computing - 30th International Conference, 2015
GPU-Aware Design, Implementation, and Evaluation of Non-blocking Collective Benchmarks.
Proceedings of the 22nd European MPI Users' Group Meeting, 2015
A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X.
Proceedings of the OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies, 2015
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015
Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015
2013
J. Supercomput., 2013
CoRR, 2013
Proceedings of the 13th IEEE/ACM International Symposium on Cluster, 2013
REECH-ME: Regional Energy Efficient Cluster Heads Based on Maximum Energy Routing Protocol for WSNs.
Proceedings of the 2013 Eighth International Conference on Broadband and Wireless Computing, 2013
DREEM-ME: Distributed Regional Energy Efficient Multi-hop Routing Protocol Based on Maximum Energy in WSNs.
Proceedings of the 2013 Eighth International Conference on Broadband and Wireless Computing, 2013
2012
Proceedings of the 13th International Conference on Parallel and Distributed Computing, 2012
Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, 2012