2025
Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel Applications.
IEEE Trans. Parallel Distributed Syst., February, 2025
Triton-Viz: Visualizing GPU Programming in AI Courses.
Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1, 2025
2024
DeepContext: A Context-aware, Cross-platform, and Cross-framework Tool for Performance Profiling and Analysis of Deep Learning Workloads.
CoRR, 2024
Purpose Enhanced Reasoning through Iterative Prompting: Uncover Latent Robustness of ChatGPT on Code Comprehension.
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024
Autost: Training-Free Neural Architecture Search For Spiking Transformers.
Proceedings of the IEEE International Conference on Acoustics, 2024
EasyView: Bringing Performance Profiles into Integrated Development Environments.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2024
DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2024
2023
DrGPU: A Top-Down Profiler for GPU Applications.
Proceedings of the 2023 ACM/SPEC International Conference on Performance Engineering, 2023
DroidPerf: Profiling Memory Objects on Android Devices.
Proceedings of the 29th Annual International Conference on Mobile Computing and Networking, 2023
DJXPerf: Identifying Memory Inefficiencies via Object-Centric Profiling for Java.
Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization, 2023
2022
BinGo: Pinpointing Concurrency Bugs in Go via Binary Analysis.
CoRR, 2022
Graph Neural Networks Based Memory Inefficiency Detection Using Selective Sampling.
Proceedings of the SC22: International Conference for High Performance Computing, 2022
OJXPERF: Featherlight Object Replica Detection for Java Programs.
Proceedings of the 44th IEEE/ACM 44th International Conference on Software Engineering, 2022
ValueExpert: exploring value patterns in GPU-accelerated applications.
Proceedings of the ASPLOS '22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022, 2022
2021
NumaPerf: Predictive and Full NUMA Profiling.
CoRR, 2021
Toward efficient interactions between Python and native libraries.
Proceedings of the ESEC/FSE '21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021
NumaPerf: predictive NUMA profiling.
Proceedings of the ICS '21: 2021 International Conference on Supercomputing, 2021
2020
Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect.
IEEE Trans. Parallel Distributed Syst., 2020
Efficient Abortable-locking Protocol for Multi-level NUMA Systems: Design and Correctness.
ACM Trans. Parallel Comput., 2020
GVProf: a value profiler for GPU-based clusters.
Proceedings of the International Conference for High Performance Computing, 2020
DrCCTProf: a fine-grained call path profiler for ARM-based clusters.
Proceedings of the International Conference for High Performance Computing, 2020
ZeroSpy: exploring software inefficiency with redundant zeros.
Proceedings of the International Conference for High Performance Computing, 2020
ScalAna: automating scaling loss detection with graph analysis.
Proceedings of the International Conference for High Performance Computing, 2020
Identifying scalability bottlenecks for large-scale parallel programs with graph analysis.
Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020
What every scientific programmer should know about compiler optimizations?
Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020
ATMem: adaptive data placement in graph applications on heterogeneous memories.
Proceedings of the CGO '20: 18th ACM/IEEE International Symposium on Code Generation and Optimization, 2020
2019
Intelligent-Unrolling: Exploiting Regular Patterns in Irregular Applications.
CoRR, 2019
Pinpointing performance inefficiencies in Java.
Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019
Pinpointing performance inefficiencies via lightweight variance profiling.
Proceedings of the International Conference for High Performance Computing, 2019
Lightweight hardware transactional memory profiling.
Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019
Redundant loads: a software inefficiency indicator.
Proceedings of the 41st International Conference on Software Engineering, 2019
Can we trust profiling results?: understanding and fixing the inaccuracy in modern profilers.
Proceedings of the ACM International Conference on Supercomputing, 2019
CPpf: a prefetch aware LLC partitioning approach.
Proceedings of the 48th International Conference on Parallel Processing, 2019
Featherlight Reuse-Distance Measurement.
Proceedings of the 25th IEEE International Symposium on High Performance Computer Architecture, 2019
Transforming Query Sequences for High-Throughput B+ Tree Processing on Many-Core Processors.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2019
2018
LWPTool: A Lightweight Profiler to Guide Data Layout Optimization.
IEEE Trans. Parallel Distributed Syst., 2018
NUMA-Caffe: NUMA-Aware Deep Learning Neural Networks.
ACM Trans. Archit. Code Optim., 2018
Start Late or Finish Early: A Distributed Graph Processing System with Redundancy Reduction.
Proc. VLDB Endow., 2018
An Evaluation of Vectorization and Cache Reuse Tradeoffs on Modern CPUs.
Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores, 2018
Featherlight on-the-fly false-sharing detection.
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018
Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite.
Proceedings of the 2018 IEEE International Symposium on Workload Characterization, 2018
ProfDP: A Lightweight Profiler to Guide Data Placement in Heterogeneous Memory Systems.
Proceedings of the 32nd International Conference on Supercomputing, 2018
Towards Efficient SpMV on Sunway Manycore Architectures.
Proceedings of the 32nd International Conference on Supercomputing, 2018
CVR: efficient vectorization of SpMV on x86 processors.
Proceedings of the 2018 International Symposium on Code Generation and Optimization, 2018
CUDAAdvisor: LLVM-based runtime profiling for modern GPUs.
Proceedings of the 2018 International Symposium on Code Generation and Optimization, 2018
Lightweight detection of cache conflicts.
Proceedings of the 2018 International Symposium on Code Generation and Optimization, 2018
Watching for Software Inefficiencies with Witch.
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018
2017
An Efficient Abortable-locking Protocol for Multi-level NUMA Systems.
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017
DR-BW: Identifying Bandwidth Contention in NUMA Architectures with Supervised Learning.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017
FLEP: Enabling Flexible and Efficient Preemption on GPUs.
Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017
REDSPY: Exploring Value Locality in Software.
Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017
Locality-Aware CTA Clustering for Modern GPUs.
Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017
2016
Correctness of Hierarchical MCS Locks with Timeout.
CoRR, 2016
Characterizing emerging heterogeneous memory.
Proceedings of the 2016 ACM SIGPLAN International Symposium on Memory Management, Santa Barbara, CA, USA, June 14, 2016
HIPS Introduction and Committees.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016
SMT-Aware Instantaneous Footprint Optimization.
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, 2016
Understanding Data Analytics Workloads on Intel(R) Xeon Phi(R).
Proceedings of the 18th IEEE International Conference on High Performance Computing and Communications; 14th IEEE International Conference on Smart City; 2nd IEEE International Conference on Data Science and Systems, 2016
StructSlim: a lightweight profiler to guide structure splitting.
Proceedings of the 2016 International Symposium on Code Generation and Optimization, 2016
Cheetah: detecting false sharing efficiently and effectively.
Proceedings of the 2016 International Symposium on Code Generation and Optimization, 2016
<i>memif</i>: Towards Programming Heterogeneous Memory Asynchronously.
Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, 2016
2015
ScaAnalyzer: a tool to identify memory scalability bottlenecks in parallel programs.
Proceedings of the International Conference for High Performance Computing, 2015
Characterizing Data Analytics Workloads on Intel Xeon Phi.
Proceedings of the 2015 IEEE International Symposium on Workload Characterization, 2015
Towards Hybrid Programming in Big Data.
Proceedings of the 7th USENIX Workshop on Hot Topics in Cloud Computing, 2015
Runtime Value Numbering: A Profiling Technique to Pinpoint Redundant Computations.
Proceedings of the 2015 International Conference on Parallel Architectures and Compilation, 2015
2014
A tool to analyze the performance of multithreaded programs on NUMA architectures.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014
Call Paths for Pin Tools.
Proceedings of the 12th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2014
ArrayTool: a lightweight profiler to guide array regrouping.
Proceedings of the International Conference on Parallel Architectures and Compilation, 2014
2013
A data-centric profiler for parallel programs.
Proceedings of the International Conference for High Performance Computing, 2013
OMPT: An OpenMP Tools Application Programming Interface for Performance Analysis.
Proceedings of the OpenMP in the Era of Low Power Devices and Accelerators, 2013
Pinpointing data locality bottlenecks with low overhead.
Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2013
A new approach for performance analysis of openMP programs.
Proceedings of the International Conference on Supercomputing, 2013
Evaluating task scheduling in hadoop-based cloud systems.
Proceedings of the 2013 IEEE International Conference on Big Data (IEEE BigData 2013), 2013
2011
Automatic performance debugging of SPMD-style parallel programs.
J. Parallel Distributed Comput., 2011
Towards quantitative analysis of data intensive computing: a case study of Hadoop.
Proceedings of the 8th International Conference on Autonomic Computing, 2011
Pinpointing data locality problems using data-centric analysis.
Proceedings of the CGO 2011, 2011
2010
Automatic Performance Debugging of SPMD Parallel Programs
CoRR, 2010
2009
Similarity Analysis in Automatic Performance Debugging of SPMD Parallel Programs
CoRR, 2009
2008
A Fast-Start, Fault-Tolerant MPI Launcher on Dawning Supercomputers.
Proceedings of the Ninth International Conference on Parallel and Distributed Computing, 2008
A Dynamic Provisioning Framework for Multi-tier Internet Applications in Virtualized Data Center.
Proceedings of the Ninth International Conference on Parallel and Distributed Computing, 2008