Developing an Interactive OpenMP Programming Book with Large Language Models.
Proceedings of the Advancing OpenMP for Future Accelerators, 2024

RTune: Towards Automated and Coordinated Optimization of Computing and Computational Objectives of Parallel Iterative Applications.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2024

Exploring OpenMP GPU Offloading for Implementing Convolutional Neural Networks.
Proceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores, 2023

Generating and Analyzing Program Call Graphs using Ontology.
Proceedings of the IEEE/ACM Workshop on Programming and Performance Visualization Tools, 2022

Exploring source-to-source compiler transformation of OpenMP SIMD constructs for Intel AVX and Arm SVE vector architectures.
Proceedings of the PMAM@PPoPP 2022: Proceedings of the Thirteenth International Workshop on Programming Models and Applications for Multicores and Manycores, Virtual Event / Seoul, Republic of Korea, April 2, 2022

Stacking Feature Maps of Multi-scaled Medical Images in U-Net for 3D Head and Neck Tumor Segmentation.
Proceedings of the Head and Neck Tumor Segmentation and Outcome Prediction, 2022

Applying Quadratic Penalty Method for Intensity-Based Deformable Image Registration on BraTS-Reg Challenge 2022.
Proceedings of the Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, 2022

Experimenting FedML and NVFLARE for Federated Tumor Segmentation Challenge.
Proceedings of the Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, 2022

UPIR: Toward the Design of Unified Parallel Intermediate Representation for Parallel Programming Models.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2022

Extending OpenMP for Machine Learning-Driven Adaptation.
Proceedings of the Accelerator Programming Using Directives - 8th International Workshop, 2021

RDS: a cloud-based metaservice for detecting data races in parallel programs.
Proceedings of the UCC '21: 2021 IEEE/ACM 14th International Conference on Utility and Cloud Computing, Leicester, United Kingdom, December 6, 2021

CUDAMicroBench: Microbenchmarks to Assist CUDA Performance Programming.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops, 2021

An Ensemble Approach to Automatic Brain Tumor Segmentation.
Proceedings of the Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, 2021

Enhancing DataRaceBench for Evaluating Data Race Detection Tools.
Proceedings of the 4th IEEE/ACM International Workshop on Software Correctness for HPC Applications, 2020

Extending FreeCompilerCamp.org as an Online Self-Learning Platform for Compiler Development.
Proceedings of the IEEE/ACM Workshop on Education for High-Performance Computing, 2020

Supporting Data Shuffle Between Threads in OpenMP.
Proceedings of the OpenMP: Portable Multi-Level Parallelism on Modern Systems, 2020

Extending OpenMP Map Clause to Bridge Storage and Device Memory.
Proceedings of the 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing, 2019

Ompparser: A Standalone and Unified OpenMP Parser.
Proceedings of the OpenMP: Conquering the Full Hardware Spectrum, 2019

Extending OpenMP Metadirective Semantics for Runtime Adaptation.
Proceedings of the OpenMP: Conquering the Full Hardware Spectrum, 2019

A Cross-Layer Solution in Scientific Workflow System for Tackling Data Movement Challenge.
CoRR, 2018

Principles of Memory-Centric Programming for High Performance Computing.
Proceedings of the Workshop on Memory Centric Programming for HPC, 2017

Evaluation of Knight Landing High Bandwidth Memory for HPC Workloads.
Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms, 2017

HOMP: Automated Distribution of Parallel Loops and Data in Highly Parallel Accelerator-Based Systems.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

Comparison of Threading Programming Models.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Compiler transformation of nested loops for general purpose GPUs.
Concurr. Comput. Pract. Exp., 2016

A Proposal to OpenMP for Addressing the CPU Oversubscription Challenge.
Proceedings of the OpenMP: Memory, Devices, and Tasks, 2016

Comparison of Spark Resource Managers and Distributed File Systems.
Proceedings of the 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), 2016

Programming Models, Languages, and Compilers for Manycore and Heterogeneous Architectures.
Sci. Program., 2015

Supporting multiple accelerators in high-level programming models.
Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, 2015

Reduction Operations in Parallel Loops for GPGPUs.
Proceedings of the 2014 PPOPP International Workshop on Programming Models and Applications for Multicores and Manycores, 2014

NAS Parallel Benchmarks for GPGPUs Using a Directive-Based Programming Model.
Proceedings of the Languages and Compilers for Parallel Computing, 2014

Predicting Cache Contention for Multithread Applications at Compile Time.
Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014

Compile Time Modeling of Off-Chip Memory Bandwidth for Parallel Loops.
Proceedings of the Languages and Compilers for Parallel Computing, 2013

Compiling a High-Level Directive-Based Programming Model for GPGPUs.
Proceedings of the Languages and Compilers for Parallel Computing, 2013

Early Experiences with the OpenMP Accelerator Model.
Proceedings of the OpenMP in the Era of Low Power Devices and Accelerators, 2013

A Prototype Implementation of OpenMP Task Dependency Support.
Proceedings of the OpenMP in the Era of Low Power Devices and Accelerators, 2013

Integrating Asynchronous Task Parallelism with MPI.
Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

Performance and Power Characteristics of Matrix Multiplication Algorithms on Multicore and Shared Memory Machines.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Compile-Time Detection of False Sharing via Loop Cost Modeling.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

Acceleration of bulk memory operations in a heterogeneous multicore architecture.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

Integrating MPI with Asynchronous Task Parallelism.
Proceedings of the Recent Advances in the Message Passing Interface, 2011

Hardware and Software Tradeoffs for Task Synchronization on Manycore Architectures.
Proceedings of the Euro-Par 2011 Parallel Processing - 17th International Conference, 2011

The habanero multicore software research project.
Proceedings of the Companion to the 24th Annual ACM SIGPLAN Conference on Object-Oriented Programming, 2009

Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement.
Proceedings of the Languages and Compilers for Parallel Computing, 2009

JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA.
Proceedings of the Euro-Par 2009 Parallel Processing, 2009

Scientific workflow scheduling in computational grids - Planning, reservation, and data/network-awareness.
Proceedings of the 8th IEEE/ACM International Conference on Grid Computing (GRID 2007), 2007

Campus Grids Meet Applications: Modeling, Metascheduling and Integration.
J. Grid Comput., 2006

A Feature-Rich Workflow Description Language that Supports Resource Co-allocations.
Proceedings of the High Performance Computing and Grids in Action, 2006

An OGSI-compliant portal for campus grids.
Proceedings of the Enhanced Interoperable Systems. Proceedings of the 10th ISPE International Conference on Concurrent Engineering (ISPE CE 2003), 2003