2025
Balanced and Elastic End-to-end Training of Dynamic LLMs.
CoRR, May, 2025
2024
The Landscape of GPU-Centric Communication.
CoRR, 2024
A Sparse Tensor Generator with Efficient Feature Extraction.
CoRR, 2024
Optimizing GNN-Based Multiple Object Tracking on a Graphcore IPU.
Proceedings of the High Performance Computing. ISC High Performance 2024 International Workshops, 2024
P-MoVE: Performance Monitoring and Visualization with Encoded Knowledge.
Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024
Autonomous Execution for Multi-GPU Systems: Compiler Support.
Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024
Snoopie: A Multi-GPU Communication Profiler and Visualizer.
Proceedings of the 38th ACM International Conference on Supercomputing, 2024
2023
Precise Event Sampling on AMD Versus Intel: Quantitative and Qualitative Comparison.
IEEE Trans. Parallel Distributed Syst., May, 2023
Precise event sampling-based data locality tools for AMD multicore architectures.
Concurr. Comput. Pract. Exp., 2023
Bringing Order to Sparsity: A Sparse Matrix Reordering Study on Multicore CPUs.
Proceedings of the International Conference for High Performance Computing, 2023
Multi-GPU Communication Schemes for Iterative Solvers: When CPUs are Not in Charge.
Proceedings of the 37th International Conference on Supercomputing, 2023
2022
ReuseTracker: Fast Yet Accurate Multicore Reuse Distance Analyzer.
ACM Trans. Archit. Code Optim., 2022
Mixed and Multi-Precision SpMV for GPUs with Row-wise Precision Selection.
Proceedings of the 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2022
2021
A Split Execution Model for SpTRSV.
IEEE Trans. Parallel Distributed Syst., 2021
A computational-graph partitioning method for training memory-constrained DNNs.
Parallel Comput., 2021
Structured Adaptive Mesh Refinement Adaptations to Retain Performance Portability With Increasing Heterogeneity.
Comput. Sci. Eng., 2021
Monitoring Collective Communication Among GPUs.
Proceedings of the Euro-Par 2021: Parallel Processing Workshops, 2021
Low-Overhead Reuse Distance Profiling Tool for Multicore.
Proceedings of the Euro-Par 2021: Parallel Processing Workshops, 2021
2020
TIGER: Topology-aware Assignment using Ising machines Application to Classical Algorithm Tasks and Quantum Circuit Gates.
CoRR, 2020
Adaptive Level Binning: A New Algorithm for Solving Sparse Triangular Systems.
Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 2020
Tiling-Based Programming Model for Structured Grids on GPU Clusters.
Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 2020
A Prediction Framework for Fast Sparse Triangular Solves.
Proceedings of the Euro-Par 2020: Parallel Processing, 2020
ComScribe: Identifying Intra-node GPU Communication.
Proceedings of the Benchmarking, Measuring, and Optimizing, 2020
2019
Communication analysis and optimization of 3D front tracking method for multiphase flow simulations.
Int. J. High Perform. Comput. Appl., 2019
Asynchronous AMR on Multi-GPUs.
Proceedings of the High Performance Computing, 2019
ComDetective: a lightweight communication detection tool for threads.
Proceedings of the International Conference for High Performance Computing, 2019
Program analysis for process migration.
Proceedings of the 8th ACM SIGPLAN International Workshop on State Of the Art in Program Analysis, 2019
2018
Load Balancing for Parallel Multiphase Flow Simulation.
Sci. Program., 2018
Output nondeterminism detection for programming models combining dataflow with shared memory.
Parallel Comput., 2018
Special issue on High performance computing conference (BASARIM-2017).
Concurr. Comput. Pract. Exp., 2018
BindMe: A thread binding library with advanced mapping algorithms.
Concurr. Comput. Pract. Exp., 2018
Fast multidimensional reduction and broadcast operations on GPU for machine learning.
Concurr. Comput. Pract. Exp., 2018
Phase asynchronous AMR execution for productive and performant astrophysical flows.
Proceedings of the International Conference for High Performance Computing, 2018
Phase-Based Data Placement Scheme for Heterogeneous Memory Systems.
Proceedings of the 30th International Symposium on Computer Architecture and High Performance Computing, 2018
Runtime Determinacy Race Detection for OpenMP Tasks.
Proceedings of the Euro-Par 2018: Parallel Processing, 2018
2017
Trends in Data Locality Abstractions for HPC Systems.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
IEEE Trans. Parallel Distributed Syst., 2017
Access pattern-aware data placement for hybrid DRAM/NVM.
Turkish J. Electr. Eng. Comput. Sci., 2017
Object Placement for High Bandwidth Memory Augmented with High Capacity Memory.
Proceedings of the 29th International Symposium on Computer Architecture and High Performance Computing, 2017
EmbedSanitizer: Runtime Race Detection Tool for 32-bit Embedded ARM.
Proceedings of the Runtime Verification - 17th International Conference, 2017
Overlapping Data Transfers with Computation on GPU with Tiles.
Proceedings of the 46th International Conference on Parallel Processing, 2017
Nonintrusive AMR Asynchrony for Communication Optimization.
Proceedings of the Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28, 2017
2016
BoxLib with Tiling: An Adaptive Mesh Refinement Software Framework.
SIAM J. Sci. Comput., 2016
BoxLib with Tiling: An AMR Software Framework.
CoRR, 2016
TiDA: High-Level Programming Abstractions for Data Locality Management.
Proceedings of the High Performance Computing - 31st International Conference, 2016
Perilla: metadata-based optimizations of an asynchronous runtime for adaptive mesh refinement.
Proceedings of the International Conference for High Performance Computing, 2016
2015
ExaSAT: An exascale co-design tool for performance modeling.
Int. J. High Perform. Comput. Appl., 2015
2014
Abstract machine models and proxy architectures for exascale computing.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance Computing, 2014
2013
A new approach to interactive viewpoint selection for volume data sets.
Inf. Vis., 2013
Modeling and predicting performance of high performance computing applications on hardware accelerators.
Int. J. High Perform. Comput. Appl., 2013
Software Design Space Exploration for Exascale Combustion Co-design.
Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013
2012
Domain-specific translator and optimizer for massive on- chip parallelism.
PhD thesis, 2012
Hands-on Performance Tuning of 3D Finite Difference Earthquake Simulation on GPU Fermi Chipset.
Proceedings of the International Conference on Computational Science, 2012
Accelerating a 3D Finite-Difference Earthquake Simulation with a C-to-CUDA Translator.
Comput. Sci. Eng., 2012
Interactive data-centric viewpoint selection.
Proceedings of the Visualization and Data Analysis 2012, 2012
2011
Modeling and predicting application performance on hardware accelerators.
Proceedings of the 2011 IEEE International Symposium on Workload Characterization, 2011
Mint: realizing CUDA performance in 3D stencil methods with annotated C.
Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31, 2011
2009
An Adaptive Sub-sampling Method for In-memory Compression of Scientific Data.
Proceedings of the 2009 Data Compression Conference (DCC 2009), 2009