Samuel Williams

Orcid: 0000-0002-8327-5717

  • Lawrence Berkeley National Laboratory, Berkeley, CA, USA
  • University of California at Berkeley, CA, USA (PhD 2008)

According to our database1, Samuel Williams authored at least 110 papers between 2001 and 2024.

Collaborative distances:




In proceedings 
PhD thesis 


Online presence:



Evaluating the potential of disaggregated memory systems for HPC applications.
Concurr. Comput. Pract. Exp., August, 2024

Bricks: A high-performance portability layer for computations on block-structured grids.
Int. J. High Perform. Comput. Appl., 2024

LPSim: Large Scale Multi-GPU Parallel Computing based Regional Scale Traffic Simulation Framework.
CoRR, 2024

FTL: Transfer Learning Nonlinear Plasma Dynamic Transitions in Low Dimensional Embeddings via Deep Neural Networks.
CoRR, 2024

Comprehensive Performance Modeling and System Design Insights for Foundation Models.
Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024

Expediting Higher Fidelity Plasma State Reconstructions for the DIII-D National Fusion Facility Using Leadership Class Computing Resources.
Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024

System-Wide Roofline Profiling -a Case Study on NERSC's Perlmutter Supercomputer.
Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024

High-Performance, Scalable Geometric Multigrid via Fine-Grain Data Blocking for GPUs.
Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024

Performance Portable Optimizations of an Ice-sheet Modeling Code on GPU-supercomputers.
Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024

A Workflow Roofline Model for End-to-End Workflow Performance Analysis.
Proceedings of the International Conference for High Performance Computing, 2024

BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUs.
Proceedings of the 53rd International Conference on Parallel Processing, 2024

Performance-Portable GPU Acceleration of the EFIT Tokamak Plasma Equilibrium Reconstruction Code.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

Performance Portability Evaluation of Blocked Stencil Computations on GPUs.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters.
Proceedings of the International Conference for High Performance Computing, 2023

Evaluating the Performance of One-sided Communication on CPUs and GPUs.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

A Comprehensive Methodology to Optimize FPGA Designs via the Roofline Model.
IEEE Trans. Computers, 2022

Understanding the Impact of Input Entropy on FPU, CPU, and GPU Power.
CoRR, 2022

FPGA-based HPC accelerators: An evaluation on performance and energy efficiency.
Concurr. Comput. Pract. Exp., 2022

Instruction Roofline: An insightful visual performance model for GPUs.
Concurr. Comput. Pract. Exp., 2022

A Methodology for Evaluating Tightly-integrated and Disaggregated Accelerated Architectures.
Proceedings of the IEEE/ACM International Workshop on Performance Modeling, 2022

Maximizing Performance Through Memory Hierarchy-Driven Data Layout Transformations.
Proceedings of the IEEE/ACM Workshop on Memory Centric High Performance Computing, 2022

Hierarchical Roofline Performance Analysis for Deep Learning Applications.
Proceedings of the Intelligent Computing, 2021

Improving communication by optimizing on-node data movement with data layout.
Proceedings of the PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021

Architectural Requirements for Deep Learning Workloads in HPC Environments.
Proceedings of the 2021 International Workshop on Performance Modeling, 2021

Experiences Porting the SU3_Bench Microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs.
Proceedings of the IWOCL'21: International Workshop on OpenCL, Munich Germany, April, 2021, 2021

A Message-Driven, Multi-GPU Parallel Sparse Triangular Solver.
Proceedings of the 2021 SIAM Conference on Applied and Computational Discrete Algorithms, 2021

Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC-9 Perlmutter system.
Concurr. Comput. Pract. Exp., 2020

Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight.
Clust. Comput., 2020

Timemory: Modular Performance Analysis for HPC.
Proceedings of the High Performance Computing - 35th International Conference, 2020

Time-Based Roofline for Deep Learning Performance Analysis.
Proceedings of the Fourth IEEE/ACM Workshop on Deep Learning on Supercomputers, 2020

Leveraging One-Sided Communication for Sparse Triangular Solvers.
Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing, 2020

The Performance and Energy Efficiency Potential of FPGAs in Scientific Computing.
Proceedings of the 2020 IEEE/ACM Performance Modeling, 2020

Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches.
Proceedings of the 2020 IEEE/ACM Performance Modeling, 2020

A Case Study of Porting HPGMG from CUDA to OpenMP Target Offload.
Proceedings of the OpenMP: Portable Multi-Level Parallelism on Modern Systems, 2020

Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization.
Proceedings of the International Conference on Rebooting Computing, 2020

A CAD-based methodology to optimize HLS code via the Roofline model.
Proceedings of the IEEE/ACM International Conference On Computer Aided Design, 2020

AMReX: a framework for block-structured adaptive mesh refinement.
J. Open Source Softw., 2019

Modern gyrokinetic particle-in-cell simulation of fusion plasmas on top supercomputers.
Int. J. High Perform. Comput. Appl., 2019

Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs.
Proceedings of the International Conference for High Performance Computing, 2019

An Instruction Roofline Model for GPUs.
Proceedings of the 2019 IEEE/ACM Performance Modeling, 2019

Performance Analysis of GPU Programming Models Using the Roofline Scaling Trajectories.
Proceedings of the Benchmarking, Measuring, and Optimizing, 2019

A Novel Multi-level Integrated Roofline Model Approach for Performance Characterization.
Proceedings of the High Performance Computing - 33rd International Conference, 2018

Improving MPI Reduction Performance for Manycore Architectures with OpenMP and Data Compression.
Proceedings of the 2018 IEEE/ACM Performance Modeling, 2018

SIMD code generation for stencils on brick decompositions.
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018

Roofline Scaling Trajectories: A Method for Parallel Application and Architectural Performance Analysis.
Proceedings of the 2018 International Conference on High Performance Computing & Simulation, 2018

A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations.
IEEE Trans. Parallel Distributed Syst., 2017

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers.
Parallel Comput., 2017

Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends.
J. Parallel Distributed Comput., 2017

Reaching bandwidth saturation using transparent injection parallelization.
Int. J. High Perform. Comput. Appl., 2017

Analyzing Performance of Selected NESAP Applications on the Cori HPC System.
Proceedings of the High Performance Computing, 2017

Performance Variability on Xeon Phi.
Proceedings of the High Performance Computing, 2017

Performance analysis and optimization of the RAMPAGE metal alloy potential generation software.
Proceedings of the 4th ACM SIGPLAN International Workshop on Software Engineering for Parallel Systems, 2017

Snowflake: A Lightweight Portable Stencil DSL.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

A Locality-Based Threading Algorithm for the Configuration-Interaction Method.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Simultaneously Solving Swarms of Small Sparse Systems on SIMD Silicon.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling.
SIAM J. Sci. Comput., 2016

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication.
SIAM J. Sci. Comput., 2016

Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor.
Proceedings of the High Performance Computing, 2016

Extreme scale plasma turbulence simulations on top supercomputers worldwide.
Proceedings of the International Conference for High Performance Computing, 2016

Experiences of Applying One-Sided Communication to Nearest-Neighbor Communication.
Proceedings of the 2016 PGAS Applications Workshop, 2016

OpenMP Parallelization and Optimization of Graph-Based Machine Learning Algorithms.
Proceedings of the OpenMP: Memory, Devices, and Tasks, 2016

Parallel processing of filtered queries in attributed semantic graphs.
J. Parallel Distributed Comput., 2015

ExaSAT: An exascale co-design tool for performance modeling.
Int. J. High Perform. Comput. Appl., 2015

An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling.
CoRR, 2015

Parallel implementation and performance optimization of the configuration-interaction method.
Proceedings of the International Conference for High Performance Computing, 2015

Thread-level parallelization and optimization of NWChem for the Intel MIC architecture.
Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, 2015

Exploiting communication concurrency on high performance computing systems.
Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, 2015

Comparative Performance Analysis of Coarse Solvers for Algebraic Multigrid on Multicore and Manycore Architectures.
Proceedings of the Parallel Processing and Applied Mathematics, 2015

Compiler-Directed Transformation for Higher-Order Stencils.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Parallel Performance Optimizations on Unstructured Mesh-based Simulations.
Proceedings of the International Conference on Computational Science, 2015

Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis.
Proceedings of the High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, 2014

Evaluation of PGAS Communication Paradigms with Geometric Multigrid.
Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, 2014

s-Step Krylov Subspace Methods as Bottom Solvers for Geometric Multigrid.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Collective memory transfers for multi-core chips.
Proceedings of the 2014 International Conference on Supercomputing, 2014

Analysis and tuning of libtensor framework on multicore architectures.
Proceedings of the 21st International Conference on High Performance Computing, 2014

Analysis and optimization of gyrokinetic toroidal simulations on homogenous and heterogenous platforms.
Int. J. High Perform. Comput. Appl., 2013

Kinetic turbulence simulations at extreme scale on leadership-class systems.
Proceedings of the International Conference for High Performance Computing, 2013

Loop Chaining: A Programming Abstraction for Balancing Locality and Parallelism.
Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

High-Productivity and High-Performance Analysis of Filtered Semantic Graphs.
Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

Compiler generation and autotuning of communication-avoiding operators for geometric multigrid.
Proceedings of the 20th Annual International Conference on High Performance Computing, 2013

Optimization of Parallel Particle-to-Grid Interpolation on Leading Multicore Platforms.
IEEE Trans. Parallel Distributed Syst., 2012

Optimization of geometric multigrid for emerging multi- and manycore processors.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

Poster: Advances in Gyrokinetic Particle in Cell Simulation for Fusion Plasmas to Extreme Scale.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Abstract: Advances in Gyrokinetic Particle in Cell Simulation for Fusion Plasmas to Extreme Scale.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

High-performance analysis of filtered semantic graphs.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms.
Parallel Comput., 2011

Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning.
Proceedings of the Conference on High Performance Computing Networking, 2011

Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems.
Proceedings of the Conference on High Performance Computing Networking, 2011

Hardware/software co-design for energy-efficient seismic modeling.
Proceedings of the Conference on High Performance Computing Networking, 2011

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

An auto-tuning framework for parallel multicore stencil computations.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Sparse Matrix-Vector Multiplication on Multicore and Accelerators.
Proceedings of the Scientific Computing with Multicore and Accelerators., 2010

Auto-Tuning Stencil Computations on Multicore and Accelerators.
Proceedings of the Scientific Computing with Multicore and Accelerators., 2010

Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors.
SIAM Rev., 2009

Optimization of sparse matrix-vector multiplication on emerging multicore platforms.
Parallel Comput., 2009

Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms.
J. Parallel Distributed Comput., 2009

The impact of IBM Cell technology on the programming paradigm in the context of computer systems for climate and weather models.
Concurr. Comput. Pract. Exp., 2009

Roofline: an insightful visual performance model for multicore architectures.
Commun. ACM, 2009

A design methodology for domain-optimized power-efficient supercomputing.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2009

Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2009

Improving Memory Subsystem Performance Using ViVA: Virtual Vector Architecture.
Proceedings of the Architecture of Computing Systems, 2009

Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2008

Lattice Boltzmann simulation optimization on leading multicore platforms.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Scientific Computing Kernels on the Cell Processor.
Int. J. Parallel Program., 2007

The potential of the cell processor for scientific computing.
Proceedings of the Third Conference on Computing Frontiers, 2006

Implicit and explicit optimizations for stencil computations.
Proceedings of the 2006 workshop on Memory System Performance and Correctness, 2006

Hardware/compiler codevelopment for an embedded media processor.
Proc. IEEE, 2001
