Ching-Hsiang Chu
Orcid: 0000-0002-6752-3135
According to our database1,
Ching-Hsiang Chu
authored at least 37 papers
between 2011 and 2024.
Collaborative distances:
Collaborative distances:
Timeline
Legend:
Book In proceedings Article PhD thesis Dataset OtherLinks
Online presence:
-
on orcid.org
On csauthors.net:
Bibliography
2024
Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression.
Proceedings of the International Conference for High Performance Computing, 2024
Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation.
Proceedings of the Seventh Annual Conference on Machine Learning and Systems, 2024
2023
Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using SYNDICATE.
Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, 2023
2022
Software-hardware co-design for fast and scalable training of deep learning recommendation models.
Proceedings of the ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18, 2022
2021
The MVAPICH project: Transforming research into high-performance MPI library for HPC community.
J. Comput. Sci., 2021
High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models.
CoRR, 2021
Proceedings of the High Performance Computing - 36th International Conference, 2021
Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems.
Proceedings of the 21st IEEE/ACM International Symposium on Cluster, 2021
2020
Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects.
IEEE Micro, 2020
FALCON-X: Zero-copy MPI derived datatype processing on modern CPU and GPU architectures.
J. Parallel Distributed Comput., 2020
NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems.
Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020
Proceedings of the IEEE International Conference on Cluster Computing, 2020
2019
IEEE Trans. Parallel Distributed Syst., 2019
Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?
Parallel Comput., 2019
Performance Evaluation of MPI Libraries on GPU-Enabled OpenPOWER Architectures: Early Experiences.
Proceedings of the High Performance Computing, 2019
OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks.
Proceedings of the 2019 IEEE/ACM Performance Modeling, 2019
C-GDR: High-Performance Container-Aware GPUDirect MPI Communication Schemes on RDMA Networks.
Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019
Designing a Profiling and Visualization Tool for Scalable and In-depth Analysis of High-Performance GPU Clusters.
Proceedings of the 26th IEEE International Conference on High Performance Computing, 2019
High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems.
Proceedings of the 26th IEEE International Conference on High Performance Computing, 2019
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation.
Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019
Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures.
Proceedings of the 12th Workshop on General Purpose Processing Using GPUs, 2019
2018
Distributed Topology Control for Energy-Efficient and Reliable Wireless Communications.
IEEE Syst. J., 2018
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Proceedings of the 25th European MPI Users' Group Meeting, 2018
Designing High-Performance In-Memory Key-Value Operations with Persistent GPU Kernels and OpenSHMEM.
Proceedings of the OpenSHMEM and Related Technologies. OpenSHMEM in the Era of Extreme Heterogeneity, 2018
OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training.
Proceedings of the 25th IEEE International Conference on High Performance Computing, 2018
2017
MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling.
Proceedings of the 46th International Conference on Parallel Processing, 2017
Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning.
Proceedings of the 46th International Conference on Parallel Processing, 2017
2016
CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters.
Parallel Comput., 2016
Efficient Reliability Support for Hardware Multicast-Based Broadcast in GPU-enabled Streaming Applications.
Proceedings of the First International Workshop on Communication Optimizations in HPC, 2016
Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters.
Proceedings of the 28th International Symposium on Computer Architecture and High Performance Computing, 2016
Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-Enabled Systems.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016
Proceedings of the IEEE/ACM 16th International Symposium on Cluster, 2016
2015
A Case for Non-blocking Collectives in OpenSHMEM: Design, Implementation, and Performance Evaluation using MVAPICH2-X.
Proceedings of the OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies, 2015
Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015
2014
Proceedings of the IEEE International Conference on Communications, 2014
2013
EURASIP J. Wirel. Commun. Netw., 2013
2011
Improving SCTP Performance by Jitter-Based Congestion Control over Wired-Wireless Networks.
EURASIP J. Wirel. Commun. Netw., 2011