John D. Owens

Orcid: 0000-0001-6582-8237

  • University of California, Davis, US

According to our database1, John D. Owens authored at least 150 papers between 1998 and 2024.

Collaborative distances:



In proceedings 
PhD thesis 


Online presence:



Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms.
CoRR, 2024

The EDGE Language: Extended General Einsums for Graph Algorithms.
CoRR, 2024

Helping Faculty Teach Software Performance Engineering.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

The Sparsity Roofline: Understanding the Hardware Limits of Sparse Neural Networks.
CoRR, 2023

BOBA: A Parallel Lightweight Graph Reordering Algorithm with Heavyweight Implications.
CoRR, 2023

Harmonic CUDA: Asynchronous Programming on GPUs.
Proceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores, 2023

A Programming Model for GPU Load Balancing.
Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2023

Stream-K: Work-Centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU.
Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2023

Maximum Clique Enumeration on the GPU.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling (Extended Abstract).
Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing, 2023

Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

Analyzing and Implementing GPU Hash Tables.
Proceedings of the 2023 Symposium on Algorithmic Principles of Computer Systems, 2023

GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU.
ACM Trans. Math. Softw., 2022

Supporting Unified Shader Specialization by Co-opting C++ Features.
Proc. ACM Comput. Graph. Interact. Tech., 2022

Scalable Irregular Parallelism with GPUs: Getting CPUs Out of the Way.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

Essentials of Parallel Graph Analytics.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

Atos: A Task-Parallel GPU Scheduler for Graph Analytics.
Proceedings of the 51st International Conference on Parallel Processing, 2022

Building a Performance Model for Deep Learning Recommendation Model Training on GPUs.
Proceedings of the 29th IEEE International Conference on High Performance Computing, 2022

A GPU Multiversion B-Tree.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2022

RXMesh: a GPU mesh data structure.
ACM Trans. Graph., 2021

Atos: A Task-Parallel GPU Dynamic Scheduling Framework for Dynamic Irregular Computations.
CoRR, 2021

Unified Shader Programming in C++.
CoRR, 2021

Better GPU Hash Tables.
CoRR, 2021

Towards Flexible and Compiler-Friendly Layer Fusion for CNNs on Multicore CPUs.
Proceedings of the Euro-Par 2021: Parallel Processing, 2021

VoroCrust: Voronoi Meshing Without Clipping.
ACM Trans. Graph., 2020

Fast Gunrock Subgraph Matching (GSM) on GPUs.
CoRR, 2020

Energy-based Out-of-distribution Detection.
Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

Dynamic Graphs on the GPU.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

Staged metaprogramming for shader system development.
ACM Trans. Graph., 2019

Benchmarking Deep Learning Frameworks and Investigating FPGA Deployment for Traffic Sign Classification and Detection.
IEEE Trans. Intell. Veh., 2019

Unsupervised Object Segmentation with Explicit Localization Module.
CoRR, 2019

RDMA vs. RPC for Implementing Distributed Data Structures.
Proceedings of the 9th IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms, 2019

Engineering a high-performance GPU B-Tree.
Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

Graph Coloring on the GPU.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops, 2019

Fast BFS-Based Triangle Counting on GPUs.
Proceedings of the 2019 IEEE High Performance Extreme Computing Conference, 2019

Accelerating DNN Inference with GraphBLAS and the GPU.
Proceedings of the 2019 IEEE High Performance Extreme Computing Conference, 2019

Object Localization and Motion Transfer learning with Capsules.
CoRR, 2018

Technical perspective: Graphs, betweenness centrality, and the GPU.
Commun. ACM, 2018

Benchmarking Deep Learning Frameworks with FPGA-suitable Models on a Traffic Sign Dataset.
Proceedings of the 2018 IEEE Intelligent Vehicles Symposium, 2018

FPGA versus GPU for Speed-Limit-Sign Recognition.
Proceedings of the 21st International Conference on Intelligent Transportation Systems, 2018

Scalable Breadth-First Search on a GPU Cluster.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

Quotient Filters: Approximate Membership Queries on the GPU.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

GPU LSM: A Dynamic Dictionary Data Structure for the GPU.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

A Dynamic Hash Table for the GPU.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

Implementing Push-Pull Efficiently in GraphBLAS.
Proceedings of the 47th International Conference on Parallel Processing, 2018

Design Principles for Sparse Matrix Multiplication on the GPU.
Proceedings of the Euro-Par 2018: Parallel Processing, 2018

VoroCrust Illustrated: Theory and Challenges (Multimedia Exposition).
Proceedings of the 34th International Symposium on Computational Geometry, 2018

Sampling Conditions for Conforming Voronoi Meshing by the VoroCrust Algorithm.
Proceedings of the 34th International Symposium on Computational Geometry, 2018

Gunrock: GPU Graph Analytics.
ACM Trans. Parallel Comput., 2017

GPU Multisplit: An Extended Study of a Parallel Algorithm.
ACM Trans. Parallel Comput., 2017

Methods for multitasking among real-time embedded compute tasks running on the GPU.
Concurr. Comput. Pract. Exp., 2017

A Constrained Resampling Strategy for Mesh Improvement.
Comput. Graph. Forum, 2017

Mini-Gunrock: A Lightweight Graph Analytics Framework on the GPU.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Multi-GPU Graph Analytics.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

Multidisciplinary simulation acceleration using multiple shared memory graphical processing units.
Int. J. High Perform. Comput. Appl., 2016

Fast parallel skew and prefix-doubling suffix array construction on the GPU.
Concurr. Comput. Pract. Exp., 2016

Disk Density Tuning of a Maximal Random Packing.
Comput. Graph. Forum, 2016

Parallel Approaches to the String Matching Problem on the GPU.
Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, 2016

Multitasking Real-time Embedded GPU Computing Tasks.
Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores, 2016

GPU multisplit.
Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2016

A Comparative Study on Exact Triangle Counting Algorithms on the GPU.
Proceedings of the ACM Workshop on High Performance Graph Processing, 2016

Real-time GPU-based timing channel detection using entropy.
Proceedings of the 2016 IEEE Conference on Communications and Network Security, 2016

Piko: a framework for authoring programmable graphics pipelines.
ACM Trans. Graph., 2015

Parallel Reyes-style adaptive subdivision with bounded memory usage.
Proceedings of the 19th Symposium on Interactive 3D Graphics and Games, San Francisco, CA, USA, February 27, 2015

Gunrock: a high-performance graph processing library on the GPU.
Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2015

Fast Sparse Matrix and Sparse Vector Multiplication Algorithm on the GPU.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

Performance Characterization of High-Level Programming Models for GPU Graph Analytics.
Proceedings of the 2015 IEEE International Symposium on Workload Characterization, 2015

Fast Parallel Suffix Array on the GPU.
Proceedings of the Euro-Par 2015: Parallel Processing, 2015

Efficient dense reconstruction using geometry and image consistency constraints.
Proceedings of the 2015 IEEE Applied Imagery Pattern Recognition Workshop, 2015

Exercises in High-Dimensional Sampling: Maximal Poisson-Disk Sampling and <i>k</i>-d Darts.
Proceedings of the Green in Software Engineering, 2015

<i>k</i>-d Darts: Sampling by <i>k</i>-dimensional flat searches.
ACM Trans. Graph., 2014

Piko: A Design Framework for Programmable Graphics Pipelines.
CoRR, 2014

GPU-accelerated and efficient multi-view triangulation for scene reconstruction.
Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2014

Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

WTF, GPU! computing twitter's who-to-follow on the GPU.
Proceedings of the second ACM conference on Online social networks, 2014

A Comparative Study of GPU-Accelerated Multi-view Sequential Reconstruction Triangulation Methods for Large-Scale Scenes.
Proceedings of the Computer Vision - ACCV 2014 Workshops, 2014

k-d Darts: Sampling by k-Dimensional Flat Searches
CoRR, 2013

A GPU Implementation for Two-Dimensional Shallow Water Modeling.
CoRR, 2013

Sifted Disks.
Comput. Graph. Forum, 2013

Finding Convex Hulls Using Quickhull on the GPU
CoRR, 2012

A GPU Task-Parallel Model with Dependency Resolution.
Computer, 2012

A Simple Algorithm for Maximal Poisson-Disk Sampling in High Dimensions.
Comput. Graph. Forum, 2012

Plane-dependent error diffusion on a GPU.
Proceedings of the Image Processing: Algorithms and Systems X; and Parallel Processing for Imaging Applications II, 2012

High-Quality Parallel Depth-of-Field Using Line Samples.
Proceedings of the EUROGRAPHICS Conference on High Performance Graphics 2012, 2012

kANN on the GPU with Shifted Sorting.
Proceedings of the EUROGRAPHICS Conference on High Performance Graphics 2012, 2012

Efficient maximal poisson-disk sampling.
ACM Trans. Graph., 2011

Acceleration of 2-D Compressible Flow Solvers with Graphics Processing Unit Clusters.
J. Aerosp. Comput. Inf. Commun., 2011

Efficient Synchronization Primitives for GPUs
CoRR, 2011

Efficient and good Delaunay meshes from random points.
Comput. Aided Des., 2011

Efficient adaptive tiling for programmable rendering.
Proceedings of the Symposium on Interactive 3D Graphics and Games, 2011

A parallel error diffusion implementation on a GPU.
Proceedings of the Conference on Parallel Processing for Imaging Applications 2011, 2011

Feature-based speed limit sign detection using a graphics processing unit.
Proceedings of the IEEE Intelligent Vehicles Symposium (IV), 2011

Multi-GPU MapReduce on GPU Clusters.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

An Auto-tuned Method for Solving Large Tridiagonal Systems on the GPU.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

A quantitative performance analysis model for GPU architectures.
Proceedings of the 17th International Conference on High-Performance Computer Architecture (HPCA-17 2011), 2011

Compute & memory optimizations for high-quality speech recognition on low-end GPU processors.
Proceedings of the 18th International Conference on High Performance Computing, 2011

Lessons Learned from Exploring the Backtracking Paradigm on the GPU.
Proceedings of the Euro-Par 2011 Parallel Processing - 17th International Conference, 2011

Register packing for cyclic reduction: a case study.
Proceedings of 4th Workshop on General Purpose Processing on Graphics Processing Units, 2011

Fragment-Parallel Composite and Filter.
Comput. Graph. Forum, 2010

Fast tridiagonal solvers on the GPU.
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010

Toward Techniques for Auto-tuning GPU Algorithms.
Proceedings of the Applied Parallel and Scientific Computing, 2010

Multi-GPU volume rendering using MapReduce.
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, 2010

GPU-to-CPU Callbacks.
Proceedings of the Euro-Par 2010 Parallel Processing Workshops, 2010

Task management for irregular-parallel workloads on the GPU.
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on High Performance Graphics 2010, 2010

A Template-Based Approach for Real-Time Speed-Limit-Sign Recognition on an Embedded System Using GPU Computing.
Proceedings of the Pattern Recognition, 2010

Efficient Parallel Scan Algorithms for Manycore GPUs.
Proceedings of the Scientific Computing with Multicore and Accelerators., 2010

Real-time parallel hashing on the GPU.
ACM Trans. Graph., 2009

Out-of-core Data Management for Path Tracing on Hybrid Resources.
Comput. Graph. Forum, 2009

Data Parallel Bin-Based Indexing for Answering Queries on Multi-core Architectures.
Proceedings of the Scientific and Statistical Database Management, 2009

Message passing on data-parallel architectures.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

HCW 2009 keynote talk: GPU computing: Heterogeneous computing for future systems.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Parallel view-dependent tessellation of Catmull-Clark subdivision surfaces.
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on High Performance Graphics 2009, 2009

Three-layer optimizations for fast GMM computations on GPU-like parallel processors.
Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, 2009

Real-time Reyes-style adaptive surface subdivision.
ACM Trans. Graph., 2008

GPU Computing.
Proc. IEEE, 2008

Distributed Texture Memory in a Multi-GPU Environment.
Comput. Graph. Forum, 2008

Parallel programming models overview.
Proceedings of the International Conference on Computer Graphics and Interactive Techniques, 2008

Beyond programmable shading: fundamentals.
Proceedings of the International Conference on Computer Graphics and Interactive Techniques, 2008

Efficient computation of sum-products on GPUs through software-managed cache.
Proceedings of the 22nd Annual International Conference on Supercomputing, 2008

Fast Deformable Registration on the GPU: A CUDA Implementation of Demons.
Proceedings of the Selected Papers of the Sixth International Conference on Computational Sciences and Its Applications, 2008

Resolution-matched shadow maps.
ACM Trans. Graph., 2007

Research Challenges for On-Chip Interconnection Networks.
IEEE Micro, 2007

Data-parallel algorithms and data structures.
Proceedings of the International Conference on Computer Graphics and Interactive Techniques, 2007

GPU architecture overview.
Proceedings of the International Conference on Computer Graphics and Interactive Techniques, 2007

Scan primitives for GPU computing.
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware 2007, 2007

Discrete Sibson Interpolation.
IEEE Trans. Vis. Comput. Graph., 2006

Glift: Generic, efficient, random-access GPU data structures.
ACM Trans. Graph., 2006

S07 - GPGPU: general-purpose computation on graphics hardware.
Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

The Virtual Pheromone Communication Primitive.
Proceedings of the Distributed Computing in Sensor Systems, 2006

General Purpose Computation on Graphics Hardware.
Proceedings of the 16th IEEE Visualization Conference, 2005

Streaming architectures and technology trends.
Proceedings of the International Conference on Computer Graphics and Interactive Techniques, 2005

Dynamic adaptive shadow maps on graphics hardware.
Proceedings of the International Conference on Computer Graphics and Interactive Techniques, 2005

Octree textures on graphics hardware.
Proceedings of the International Conference on Computer Graphics and Interactive Techniques, 2005

A Survey of General-Purpose Computation on Graphics Hardware.
Proceedings of the 26th Annual Conference of the European Association for Computer Graphics, 2005

Mio: fast multipass partitioning via priority-based instruction scheduling.
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware 2004, 2004

Programmable Stream Processors.
Computer, 2003

Exploring the VLSI Scalability of Stream Processors.
Proceedings of the Ninth International Symposium on High-Performance Computer Architecture (HPCA'03), 2003

A Stream Processor Development Platform.
Proceedings of the 20th International Conference on Computer Design (ICCD 2002), 2002

Media Processing Applications on the Imagine Stream Processor.
Proceedings of the 20th International Conference on Computer Design (ICCD 2002), 2002

The Imagine Stream Processor.
Proceedings of the 20th International Conference on Computer Design (ICCD 2002), 2002

Comparing Reyes and OpenGL on a Stream Architecture.
Proceedings of the 2002 ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, 2002

Imagine: Media Processing with Streams.
IEEE Micro, 2001

Efficient conditional operations for data-parallel architectures.
Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, 2000

Memory access scheduling.
Proceedings of the 27th International Symposium on Computer Architecture (ISCA 2000), 2000

Register Organization for Media Processing.
Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, 2000

Polygon Rendering on a Stream Architecture.
Proceedings of the 2000 ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware, 2000

Communication Scheduling.
Proceedings of the ASPLOS-IX Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, 2000

A Bandwidth-efficient Architecture for Media Processing.
Proceedings of the 31st Annual IEEE/ACM International Symposium on Microarchitecture, 1998
