John M. Mellor-Crummey

Orcid: 0000-0002-9026-5453

Affiliations:
  • Rice University, Houston, USA


According to our database1, John M. Mellor-Crummey authored at least 138 papers between 1987 and 2024.

Collaborative distances:

Awards

ACM Fellow

ACM Fellow 2013, "For contributions to parallel and high performance computing.".

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Refining HPCToolkit for application performance analysis at exascale.
Int. J. High Perform. Comput. Appl., 2024

Matrix-Free Finite Volume Kernels on a Dataflow Architecture.
CoRR, 2024

Priority Sampling of Large Language Models for Compilers.
Proceedings of the 4th Workshop on Machine Learning and Systems, 2024

2023
Towards Accelerating High-Order Stencils on Modern GPUs and Emerging Architectures with a Portable Framework.
CoRR, 2023

LoopTune: Optimizing Tensor Computations with Reinforcement Learning.
CoRR, 2023

2022
An Automated Tool for Analysis and Tuning of GPU-Accelerated Code in HPC Applications.
IEEE Trans. Parallel Distributed Syst., 2022

Accelerating high-order stencils on GPUs.
Concurr. Comput. Pract. Exp., 2022

Improving Tool Support for Nested Parallel Regions with Introspection Consistency.
Proceedings of the OpenMP in a Modern World: From Multi-device Support to Meta Programming, 2022

Low overhead and context sensitive profiling of CPU-accelerated applications.
Proceedings of the ICS '22: 2022 International Conference on Supercomputing, Virtual Event, June 28, 2022

Preparing for performance analysis at exascale.
Proceedings of the ICS '22: 2022 International Conference on Supercomputing, Virtual Event, June 28, 2022

ValueExpert: exploring value patterns in GPU-accelerated applications.
Proceedings of the ASPLOS '22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022, 2022

2021
Measurement and analysis of GPU-accelerated applications with HPCToolkit.
Parallel Comput., 2021

Measurement and Analysis of GPU-Accelerated OpenCL Computations on Intel GPUs.
Proceedings of the IEEE/ACM International Workshop on Programming and Performance Visualization Tools, 2021

Parallel binary code analysis.
Proceedings of the PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021

Using the Semi-Stencil Algorithm to Accelerate High-Order Stencils on GPUs.
Proceedings of the 2021 International Workshop on Performance Modeling, 2021

GPA: A GPU Performance Advisor Based on Instruction Sampling.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2021

2020
Parallelizing Binary Code Analysis.
CoRR, 2020

GVProf: a value profiler for GPU-based clusters.
Proceedings of the International Conference for High Performance Computing, 2020

A tool for top-down performance analysis of GPU-accelerated applications.
Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

Using sample-based time series data for automated diagnosis of scalability losses in parallel programs.
Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

Accelerating High-Order Stencils on GPUs.
Proceedings of the 2020 IEEE/ACM Performance Modeling, 2020

Tools for top-down performance analysis of GPU-accelerated applications.
Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020

2019
Understanding congestion in high performance interconnection networks using sampling.
Proceedings of the International Conference for High Performance Computing, 2019

Lightweight, Packet-Centric Monitoring of Network Traffic and Congestion Implemented in P4.
Proceedings of the 2019 IEEE Symposium on High-Performance Interconnects, 2019

A Tool for Performance Analysis of GPU-Accelerated Applications.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2019

2018
Dynamic data race detection for OpenMP programs.
Proceedings of the International Conference for High Performance Computing, 2018

Automated Analysis of Time Series Data to Understand Parallel Program Behaviors.
Proceedings of the 32nd International Conference on Supercomputing, 2018

2016
MPI-ACC: Accelerator-Aware MPI for Scientific Applications.
IEEE Trans. Parallel Distributed Syst., 2016

Performance Analysis and Optimization of a Hybrid Distributed Reverse Time Migration Application.
CoRR, 2016

A Practical Solution to the Cactus Stack Problem.
Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, 2016

A wait-free queue as fast as fetch-and-add.
Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2016

Contention-conscious, locality-preserving locks.
Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2016

Performance Analysis and Optimization of a Hybrid Seismic Imaging Application.
Proceedings of the International Conference on Computational Science 2016, 2016

Design and Verification of Distributed Phasers.
Proceedings of the Euro-Par 2016: Parallel Processing, 2016

2015
Distributed Phasers.
CoRR, 2015

Barrier elision for production parallel programs.
Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2015

High performance locks for multi-level NUMA systems.
Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2015

Communication Avoiding Algorithms: Analysis and Code Generation for Parallel Systems.
Proceedings of the 2015 International Conference on Parallel Architectures and Compilation, 2015

2014
Portable, MPI-interoperable coarray fortran.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

A tool to analyze the performance of multithreaded programs on NUMA architectures.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

Test-driven repair of data races in structured parallel programs.
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014

Autotuning Tensor Transposition.
Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014

Author retrospective: compilation techniques for block-cyclic distributions.
Proceedings of the ACM International Conference on Supercomputing 25th Anniversary Volume, 2014

Call Paths for Pin Tools.
Proceedings of the 12th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2014

ArrayTool: a lightweight profiler to guide array regrouping.
Proceedings of the International Conference on Parallel Architectures and Compilation, 2014

2013
A data-centric profiler for parallel programs.
Proceedings of the International Conference for High Performance Computing, 2013

Effective sampling-driven performance tools for GPU-accelerated supercomputers.
Proceedings of the International Conference for High Performance Computing, 2013

OMPT: An OpenMP Tools Application Programming Interface for Performance Analysis.
Proceedings of the OpenMP in the Era of Low Power Devices and Accelerators, 2013

Pinpointing data locality bottlenecks with low overhead.
Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2013

Managing Asynchronous Operations in Coarray Fortran 2.0.
Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

A new approach for performance analysis of openMP programs.
Proceedings of the International Conference on Supercomputing, 2013

On the efficacy of GPU-integrated MPI for scientific applications.
Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, 2013

2012
DeadSpy: a tool to pinpoint program inefficiencies.
Proceedings of the 10th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2012

2011
Using Sampling to Understand Parallel Program Performance.
Proceedings of the Tools for High Performance Computing 2011, 2011

HIPS Keynote.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Implementation and Performance Evaluation of the HPC Challenge Benchmarks in Coarray Fortran 2.0.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Scalable fine-grained call path tracing.
Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31, 2011

Pinpointing data locality problems using data-centric analysis.
Proceedings of the CGO 2011, 2011

2010
Teaching parallel programming: a roundtable discussion.
XRDS, 2010

HPCTOOLKIT: tools for performance analysis of optimized parallel programs.
Concurr. Comput. Pract. Exp., 2010

Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles.
Proceedings of the Conference on High Performance Computing Networking, 2010

Analyzing lock contention in multithreaded applications.
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010

Hiding latency in Coarray Fortran 2.0.
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, 2010

Effectively Presenting Call Path Profiles of Application Performance.
Proceedings of the 39th International Conference on Parallel Processing, 2010

2009
Identifying Performance Bottlenecks in Work-Stealing Computations.
Computer, 2009

Diagnosing performance bottlenecks in emerging petascale applications.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2009

Effective performance measurement and analysis of multithreaded applications.
Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009

Binary analysis for measurement and attribution of program performance.
Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, 2009

2008
Where will all the threads come from?
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008

Pinpointing and Exploiting Opportunities for Enhancing Data Reuse.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2008

2007
Application Insight Through Performance Modeling.
Proceedings of the 26th IEEE International Performance Computing and Communications Conference, 2007

Scalability analysis of SPMD codes using expectations.
Proceedings of the 21th Annual International Conference on Supercomputing, 2007

2006
Automatic tuning of whole applications using direct search and a performance-based transformation system.
J. Supercomput., 2006

Experiences with Sweep3D implementations in Co-array Fortran.
J. Supercomput., 2006

PRec-I-DCM3: a parallel framework for fast and accurate large-scale phylogeny reconstruction.
Int. J. Bioinform. Res. Appl., 2006

2005
SFCGen: A framework for efficient generation of multi-dimensional space-filling curves by recursion.
ACM Trans. Math. Softw., 2005

Telescoping Languages: A System for Automatic Generation of Domain Languages.
Proc. IEEE, 2005

New Grid Scheduling and Rescheduling Methods in the GrADS Project.
Int. J. Parallel Program., 2005

Improving Performance by Reducing the Memory Footprint of Scientific Applications.
Int. J. High Perform. Comput. Appl., 2005

An evaluation of global address space languages: co-array fortran and unified parallel C.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2005

Effective communication coalescing for data-parallel applications.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2005

Representation-independent program analysis.
Proceedings of the 2005 ACM SIGPLAN-SIGSOFT Workshop on Program Analysis For Software Tools and Engineering, 2005

COTS Clusters vs. the Earth Simulator: An Application Study Using IMPACT-3D.
Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

Low-overhead call path profiling of unmodified, optimized code.
Proceedings of the 19th Annual International Conference on Supercomputing, 2005

Scheduling strategies for mapping application workflows onto the grid.
Proceedings of the 14th IEEE International Symposium on High Performance Distributed Computing, 2005

Reconstructing Phylogenetic Networks Using Maximum Parsimony.
Proceedings of the Fourth International IEEE Computer Society Computational Systems Bioinformatics Conference, 2005

Space-filling Curve Generation: A Table-based Approach.
Proceedings of the 2005 International Conference on Algorithmic Mathematics and Computer Science, 2005

2004
Optimizing Sparse Matrix - Vector Product Computations Using Unroll and Jam.
Int. J. High Perform. Comput. Appl., 2004

Cross-architecture performance predictions for scientific applications using parameterized models.
Proceedings of the International Conference on Measurements and Modeling of Computer Systems, 2004

Experiences with Co-array Fortran on Hardware Shared Memory Platforms.
Proceedings of the Languages and Compilers for High Performance Computing, 2004

Scheduling workflow applications in GrADS.
Proceedings of the 4th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2004), 2004

A Multi-Platform Co-Array Fortran Compiler.
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT 2004), 29 September, 2004

2003
Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computations.
J. Parallel Distributed Comput., 2003

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications.
J. Instr. Level Parallelism, 2003

Co-array Fortran Performance and Potential: An NPB Experimental Study.
Proceedings of the Languages and Compilers for Parallel Computing, 2003

2002
HPCVIEW: A Tool for Top-down Analysis of Node Performance.
J. Supercomput., 2002

Advanced optimization strategies in the Rice dHPF compiler.
Concurr. Comput. Pract. Exp., 2002

Toward a Framework for Preparing and Executing Adaptive Grid Programs.
Proceedings of the 16th International Parallel and Distributed Processing Symposium (IPDPS 2002), 2002

Generalized Multipartitioning for Multi-Dimensional Arrays.
Proceedings of the 16th International Parallel and Distributed Processing Symposium (IPDPS 2002), 2002

Experiences tuning SMG98: a semicoarsening multigrid benchmark based on the hypre library.
Proceedings of the 16th international conference on Supercomputing, 2002

2001
Telescoping Languages: A Strategy for Automatic Generation of Scientific Problem-Solving Systems from Annotated Libraries.
J. Parallel Distributed Comput., 2001

Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings.
Int. J. Parallel Program., 2001

The GrADS Project: Software Support for High-Level Grid Application Development.
Int. J. High Perform. Comput. Appl., 2001

On providing useful information for analyzing and tuning applications.
Proceedings of the Joint International Conference on Measurements and Modeling of Computer Systems, 2001

Increasing temporal locality with skewing and recursive blocking.
Proceedings of the 2001 ACM/IEEE conference on Supercomputing, 2001

Tools for application-oriented performance tuning.
Proceedings of the 15th international conference on Supercomputing, 2001

Data-Parallel Compiler Support for Multipartitioning.
Proceedings of the Euro-Par 2001: Parallel Processing, 2001

Advanced Code Generation for High Performance Fortran.
Proceedings of the Compiler Optimizations for Scalable Parallel Systems Languages, 2001

2000
Compilation and Runtime-Optimizations for Software Distributed Shared Memory.
Proceedings of the Languages, 2000

Toward Compiler Support for Scalable Parallelism Using Multipartitioning.
Proceedings of the Languages, 2000

1999
An Evaluation of Computing Paradigms for N-Body Simulations on Distributed Memory Architectures.
Proceedings of the 1999 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'99), 1999

Improving memory hierarchy performance for irregular applications.
Proceedings of the 13th international conference on Supercomputing, 1999

1998
Simplifying Control Flow in Compiler-Generated Parallel Code.
Int. J. Parallel Program., 1998

High Performance Fortran Compilation Techniques for Parallelizing Scientific Codes.
Proceedings of the ACM/IEEE Conference on Supercomputing, 1998

Using Integer Sets for Data-Parallel Program Analysis and Optimization.
Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation (PLDI), 1998

Compiler-Optimization of Implicit Reductions for Distributed Memory Multiprocessors.
Proceedings of the 12th International Parallel Processing Symposium / 9th Symposium on Parallel and Distributed Processing (IPPS/SPDP '98), March 30, 1998

1997
Compiling Stencils in High Performance Fortran.
Proceedings of the ACM/IEEE Conference on Supercomputing, 1997

1995
An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs.
Proceedings of the Proceedings Supercomputing '95, San Diego, CA, USA, December 4-8, 1995, 1995

Optimizing Fortran 90 Shift Operations on Distributed-Memory Multicomputers.
Proceedings of the Languages and Compilers for Parallel Computing, 1995

1994
Fast, contention-free combining tree barriers for shared-memory multiprocessors.
Int. J. Parallel Program., 1994

Requirements for DataParallel Programming Environments.
IEEE Parallel Distributed Technol. Syst. Appl., 1994

Compilation techniques for block-cyclic distributions.
Proceedings of the 8th international conference on Supercomputing, 1994

Automatic Data Layout for Distributed-Memory Machines in the D Programming Environment.
Proceedings of the Automatic Parallelization: New Approaches to Code Generation, 1994

1993
The ParaScope parallel programming environment.
Proc. IEEE, 1993

Compile-Time Support for Efficient Data Race Detection in Shared-Memory Parallel Programs.
Proceedings of the ACM/ONR Workshop on Parallel and Distributed Debugging, 1993

FIAT: A Framework for Interprocedural Analysis and Transfomation.
Proceedings of the Languages and Compilers for Parallel Computing, 1993

1992
Automatic software cache coherence through vectorization.
Proceedings of the 6th international conference on Supercomputing, 1992

1991
Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors.
ACM Trans. Comput. Syst., 1991

On-the-fly detection of data races for programs with nested fork-join parallelism.
Proceedings of the Proceedings Supercomputing '91, 1991

Scalable Reader-Writer Synchronization for Shared-Memory Multiprocessors.
Proceedings of the Third ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), 1991

Synchronization without Contention.
Proceedings of the ASPLOS-IV Proceedings, 1991

1990
Analyzing Parallel Program Executions Using Multiple Views.
J. Parallel Distributed Comput., 1990

Parallel program debugging with on-the-fly anomaly detection.
Proceedings of the Proceedings Supercomputing '90, New York, NY, USA, November 12-16, 1990, 1990

1989
The Elmwood Multiprocessor Operating System.
Softw. Pract. Exp., 1989

A Software Instruction Counter.
Proceedings of the ASPLOS-III Proceedings, 1989

1988
An Integrated Approach to Parallel Program Debugging and Performance Analysis of Large-Scal Multiprocessors.
Proceedings of the ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging, 1988

Experience with the BBN Butterfly.
Proceedings of the COMPCON'88, Digest of Papers, Thirty-Third IEEE Computer Society International Conference, San Francisco, California, USA, February 29, 1988

1987
Debugging Parallel Programs with Instant Replay.
IEEE Trans. Computers, 1987


  Loading...