2025
Optimizing Nuclear Configuration Interaction Calculations on GPUs: A Comparative Performance Study of Programming Models.
Proceedings of the ISC High Performance 2025 Research Paper Proceedings (40th International Conference), 2025

Maximizing Power-Constrained Supercomputing Throughput.
Proceedings of the ISC High Performance 2025 Research Paper Proceedings (40th International Conference), 2025

2024
Evaluating the potential of disaggregated memory systems for HPC applications.
Concurr. Comput. Pract. Exp., August, 2024

Performance Modeling and Analysis of a de Bruijn Graph Based Local Assembly Kernel on Multiple Vendor GPUs.
Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024

A Workflow Roofline Model for End-to-End Workflow Performance Analysis.
Proceedings of the International Conference for High Performance Computing, 2024

2023
Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters.
Proceedings of the International Conference for High Performance Computing, 2023

Evaluating the Performance of One-sided Communication on CPUs and GPUs.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

2022
Instruction Roofline: An insightful visual performance model for GPUs.
Concurr. Comput. Pract. Exp., 2022

A Methodology for Evaluating Tightly-integrated and Disaggregated Accelerated Architectures.
Proceedings of the IEEE/ACM International Workshop on Performance Modeling, 2022

2021
Accelerating large scale <i>de novo</i> metagenome assembly using GPUs.
Proceedings of the International Conference for High Performance Computing, 2021

Evaluating Performance and Portability of a core bioinformatics kernel on multiple vendor GPUs.
Proceedings of the International Workshop on Performance, 2021

A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver.
Proceedings of the 2021 SIAM Conference on Applied and Computational Discrete Algorithms, 2021

2020
APMT: an automatic hardware counter-based performance modeling tool for HPC applications.
CCF Trans. High Perform. Comput., 2020

Leveraging One-Sided Communication for Sparse Triangular Solvers.
Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing, 2020

LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

GPU accelerated partial order multiple sequence alignment for long reads self-correction.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops, 2020

2019
An automatic performance model-based scheduling tool for coupled climate system models.
J. Parallel Distributed Comput., 2019

An Instruction Roofline Model for GPUs.
Proceedings of the 2019 IEEE/ACM Performance Modeling, 2019

2017
Redesigning CAM-SE for peta-scale climate modeling performance and ultra-high resolution on Sunway TaihuLight.
Proceedings of the International Conference for High Performance Computing, 2017

2016
Refactoring and optimizing the community atmosphere model (CAM) on the sunway taihulight supercomputer.
Proceedings of the International Conference for High Performance Computing, 2016

2014
CESMTuner: An Auto-tuning Framework for the Community Earth System Model.
Proceedings of the 2014 IEEE International Conference on High Performance Computing and Communications, 2014