Abdelhalim Amer

ACM Trans. Parallel Comput., 2020

GVProf: a value profiler for GPU-based clusters.

[DOI]

Keren Zhou

Yueming Hao

Xiaozhu Meng

Proceedings of the International Conference for High Performance Computing, 2020

DrCCTProf: a fine-grained call path profiler for ARM-based clusters.

[DOI]

Qidong Zhao

Proceedings of the International Conference for High Performance Computing, 2020

ZeroSpy: exploring software inefficiency with redundant zeros.

[DOI]

Proceedings of the International Conference for High Performance Computing, 2020

ScalAna: automating scaling loss detection with graph analysis.

[DOI]

Proceedings of the International Conference for High Performance Computing, 2020

Identifying scalability bottlenecks for large-scale parallel programs with graph analysis.

[DOI]

Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

What every scientific programmer should know about compiler optimizations?

[DOI]

Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020

ATMem: adaptive data placement in graph applications on heterogeneous memories.

[DOI]

Proceedings of the CGO '20: 18th ACM/IEEE International Symposium on Code Generation and Optimization, 2020

2019

Intelligent-Unrolling: Exploiting Regular Patterns in Irregular Applications.

[DOI]

CoRR, 2019

Pinpointing performance inefficiencies in Java.

[DOI]

Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019

Pinpointing performance inefficiencies via lightweight variance profiling.

[DOI]

Proceedings of the International Conference for High Performance Computing, 2019

Lightweight hardware transactional memory profiling.

[DOI]

Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

Redundant loads: a software inefficiency indicator.

[DOI]

Proceedings of the 41st International Conference on Software Engineering, 2019

Can we trust profiling results?: understanding and fixing the inaccuracy in modern profilers.

[DOI]

Proceedings of the ACM International Conference on Supercomputing, 2019

CPpf: a prefetch aware LLC partitioning approach.

[DOI]

Jun Xiao

Andy D. Pimentel

Proceedings of the 48th International Conference on Parallel Processing, 2019

Featherlight Reuse-Distance Measurement.

[DOI]

Qingsen Wang

Proceedings of the 25th IEEE International Symposium on High Performance Computer Architecture, 2019

Transforming Query Sequences for High-Throughput B+ Tree Processing on Many-Core Processors.

[DOI]

Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2019

2018

LWPTool: A Lightweight Profiler to Guide Data Layout Optimization.

[DOI]

IEEE Trans. Parallel Distributed Syst., 2018

NUMA-Caffe: NUMA-Aware Deep Learning Neural Networks.

[DOI]

Shuaiwen Leon Song

Sriram Krishnamoorthy

Abhinav Vishnu

Dipanjan Sengupta

ACM Trans. Archit. Code Optim., 2018

Start Late or Finish Early: A Distributed Graph Processing System with Redundancy Reduction.

[DOI]

Proc. VLDB Endow., 2018

An Evaluation of Vectorization and Cache Reuse Tradeoffs on Modern CPUs.

[DOI]

Du Shen

Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores, 2018

Featherlight on-the-fly false-sharing detection.

[DOI]

Shasha Wen

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018

Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite.

[DOI]

Proceedings of the 2018 IEEE International Symposium on Workload Characterization, 2018

ProfDP: A Lightweight Profiler to Guide Data Placement in Heterogeneous Memory Systems.

[DOI]

Proceedings of the 32nd International Conference on Supercomputing, 2018

Towards Efficient SpMV on Sunway Manycore Architectures.

[DOI]

Proceedings of the 32nd International Conference on Supercomputing, 2018

CVR: efficient vectorization of SpMV on x86 processors.

[DOI]

Proceedings of the 2018 International Symposium on Code Generation and Optimization, 2018

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs.

[DOI]

Proceedings of the 2018 International Symposium on Code Generation and Optimization, 2018

Lightweight detection of cache conflicts.

[DOI]

Shuaiwen Leon Song

Sriram Krishnamoorthy

Proceedings of the 2018 International Symposium on Code Generation and Optimization, 2018

Watching for Software Inefficiencies with Witch.

[DOI]

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018

2017

An Efficient Abortable-locking Protocol for Multi-level NUMA Systems.

[DOI]

Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

DR-BW: Identifying Bandwidth Contention in NUMA Architectures with Supervised Learning.

[DOI]

Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

FLEP: Enabling Flexible and Efficient Preemption on GPUs.

[DOI]

Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017

REDSPY: Exploring Value Locality in Software.

[DOI]

Shasha Wen

Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017

Locality-Aware CTA Clustering for Modern GPUs.

[DOI]

Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017

2016

Correctness of Hierarchical MCS Locks with Timeout.

[DOI]

CoRR, 2016

Characterizing emerging heterogeneous memory.

[DOI]

Du Shen

Felix Xiaozhu Lin

Proceedings of the 2016 ACM SIGPLAN International Symposium on Memory Management, Santa Barbara, CA, USA, June 14, 2016

HIPS Introduction and Committees.

[DOI]

David Böhme

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

SMT-Aware Instantaneous Footprint Optimization.

[DOI]

Shuaiwen Leon Song

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, 2016

Understanding Data Analytics Workloads on Intel(R) Xeon Phi(R).

[DOI]

Proceedings of the 18th IEEE International Conference on High Performance Computing and Communications; 14th IEEE International Conference on Smart City; 2nd IEEE International Conference on Data Science and Systems, 2016

StructSlim: a lightweight profiler to guide structure splitting.

[DOI]

Proceedings of the 2016 International Symposium on Code Generation and Optimization, 2016

Cheetah: detecting false sharing efficiently and effectively.

[DOI]

Tongping Liu

Proceedings of the 2016 International Symposium on Code Generation and Optimization, 2016

<i>memif</i>: Towards Programming Heterogeneous Memory Asynchronously.

[DOI]

Felix Xiaozhu Lin

Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, 2016

2015

ScaAnalyzer: a tool to identify memory scalability bottlenecks in parallel programs.

[DOI]

Bo Wu

Proceedings of the International Conference for High Performance Computing, 2015

Characterizing Data Analytics Workloads on Intel Xeon Phi.

[DOI]

Proceedings of the 2015 IEEE International Symposium on Workload Characterization, 2015

Towards Hybrid Programming in Big Data.

[DOI]

Proceedings of the 7th USENIX Workshop on Hot Topics in Cloud Computing, 2015

Runtime Value Numbering: A Profiling Technique to Pinpoint Redundant Computations.

[DOI]

Shasha Wen

Proceedings of the 2015 International Conference on Parallel Architectures and Compilation, 2015

2014

A tool to analyze the performance of multithreaded programs on NUMA architectures.

[DOI]

Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

Call Paths for Pin Tools.

[DOI]

Proceedings of the 12th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2014

ArrayTool: a lightweight profiler to guide array regrouping.

[DOI]

Kamal Sharma

Proceedings of the International Conference on Parallel Architectures and Compilation, 2014

2013

A data-centric profiler for parallel programs.

[DOI]

Alexandre E. Eichenberger

Proceedings of the International Conference for High Performance Computing, 2013

OMPT: An OpenMP Tools Application Programming Interface for Performance Analysis.

[DOI]

Proceedings of the OpenMP in the Era of Low Power Devices and Accelerators, 2013

Pinpointing data locality bottlenecks with low overhead.

[DOI]

Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2013

A new approach for performance analysis of openMP programs.

[DOI]

Michael W. Fagan

Proceedings of the International Conference on Supercomputing, 2013

Evaluating task scheduling in hadoop-based cloud systems.

[DOI]

Proceedings of the 2013 IEEE International Conference on Big Data (IEEE BigData 2013), 2013

2011

Automatic performance debugging of SPMD-style parallel programs.

[DOI]

J. Parallel Distributed Comput., 2011

Towards quantitative analysis of data intensive computing: a case study of Hadoop.

[DOI]

Proceedings of the 8th International Conference on Autonomic Computing, 2011

Pinpointing data locality problems using data-centric analysis.

[DOI]