Gerhard Wellein
Orcid: 0000-0001-7371-3026Affiliations:
- University of Erlangen-Nuremberg, Germany
According to our database1,
Gerhard Wellein
authored at least 131 papers
between 2002 and 2024.
Collaborative distances:
Collaborative distances:
Timeline
Legend:
Book In proceedings Article PhD thesis Dataset OtherLinks
Online presence:
-
on orcid.org
-
on hpc.fau.de
On csauthors.net:
Bibliography
2024
Microarchitectural comparison and in-core modeling of state-of-the-art CPUs: Grace, Sapphire Rapids, and Genoa.
CoRR, 2024
Alya towards Exascale: Optimal OpenACC Performance of the Navier-Stokes Finite Element Assembly on GPUs.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024
2023
MD-Bench: A performance-focused prototyping harness for state-of-the-art short-range molecular dynamics algorithms.
Future Gener. Comput. Syst., December, 2023
Making applications faster by asynchronous execution: Slowing down processes or relaxing MPI collectives.
Future Gener. Comput. Syst., November, 2023
J. Parallel Distributed Comput., March, 2023
IEEE Trans. Parallel Distributed Syst., February, 2023
The Role of Idle Waves, Desynchronization, and Bottleneck Evasion in the Performance of Parallel Programs.
IEEE Trans. Parallel Distributed Syst., February, 2023
CoRR, 2023
MD-Bench: Engineering the in-core performance of short-range molecular dynamics kernels from state-of-the-art simulation packages.
CoRR, 2023
SPEChpc 2021 Benchmarks on Ice Lake and Sapphire Rapids Infiniband Clusters: A Performance and Energy Case Study.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023
2022
Execution-Cache-Memory modeling and performance tuning of sparse matrix-vector multiplication and Lattice quantum chromodynamics on A64FX.
Concurr. Comput. Pract. Exp., 2022
Concurr. Comput. Pract. Exp., 2022
MD-Bench: A Generic Proxy-App Toolbox for State-of-the-Art Molecular Dynamics Algorithms.
Proceedings of the Parallel Processing and Applied Mathematics, 2022
Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications.
Proceedings of the Parallel Processing and Applied Mathematics, 2022
Proceedings of the SIGSIM-PADS '22: SIGSIM Conference on Principles of Advanced Discrete Simulation, Atlanta, GA, USA, June 8, 2022
2021
Int. J. High Perform. Comput. Appl., 2021
Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs.
Int. J. High Perform. Comput. Appl., 2021
Analytic Modeling of Idle Waves in Parallel Programs: Communication, Cluster Topology, and Noise Impact.
Proceedings of the High Performance Computing - 36th International Conference, 2021
Proceedings of the 33rd IEEE International Symposium on Computer Architecture and High Performance Computing, 2021
YaskSite: Stencil Optimization Techniques Applied to Explicit ODE Methods on Modern Architectures.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2021
2020
EXASTEEL: Towards a Virtual Laboratory for the Multiscale Simulation of Dual-Phase Steel Using High-Performance Computing.
Proceedings of the Software for Exascale Computing - SPPEXA 2016-2019, 2020
Proceedings of the Software for Exascale Computing - SPPEXA 2016-2019, 2020
A Recursive Algebraic Coloring Technique for Hardware-efficient Symmetric Sparse Matrix-vector Multiplication.
ACM Trans. Parallel Comput., 2020
ACM Trans. Math. Softw., 2020
Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server Processors.
Supercomput. Front. Innov., 2020
Int. J. High Perform. Comput. Appl., 2020
An analytic performance model for overlapping execution of memory-bound loop kernels on multicore CPUs.
CoRR, 2020
Understanding HPC Benchmark Performance on Intel Broadwell and Cascade Lake Processors.
Proceedings of the High Performance Computing - 35th International Conference, 2020
Desynchronization and Wave Pattern Formation in MPI-Parallel and Hybrid Memory-Bound Programs.
Proceedings of the High Performance Computing - 35th International Conference, 2020
Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX.
Proceedings of the 2020 IEEE/ACM Performance Modeling, 2020
2019
CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance.
IEEE Trans. Parallel Distributed Syst., 2019
Supercomput. Front. Innov., 2019
Delay Propagation and Overlapping Mechanisms on Clusters: A Case Study of Idle Periods based on Workload, Communication, and Delay Granularity.
CoRR, 2019
CoRR, 2019
Proceedings of the 2019 IEEE/ACM Performance Modeling, 2019
Proceedings of the International Conference for High Performance Computing, 2019
Proceedings of the Parallel Processing and Applied Mathematics, 2019
Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019
Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019
2018
Int. J. High Perform. Comput. Appl., 2018
Int. J. High Perform. Comput. Appl., 2018
CoRR, 2018
Proceedings of the High Performance Computing - 33rd International Conference, 2018
Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures.
Proceedings of the 2018 IEEE/ACM Performance Modeling, 2018
Multicore Performance Engineering of Sparse Triangular Solves Using a Modified Roofline Model.
Proceedings of the 30th International Symposium on Computer Architecture and High Performance Computing, 2018
2017
GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems.
Int. J. Parallel Program., 2017
CoRR, 2017
Validation of hardware events for successful performance pattern identification in High Performance Computing.
CoRR, 2017
Performance analysis of the Kahan-enhanced scalar product on current multi-core and many-core processors.
Concurr. Comput. Pract. Exp., 2017
An Analysis of Core- and Chip-Level Architectural Features in Four Generations of Intel Server Processors.
Proceedings of the High Performance Computing - 32nd International Conference, 2017
LIKWID Monitoring Stack: A Flexible Framework Enabling Job Specific Performance monitoring for the masses.
Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017
2016
Proceedings of the Software for Exascale Computing - SPPEXA 2013-2015, 2016
Performance Engineering and Energy Efficiency of Building Blocks for Large, Sparse Eigenvalue Computations on Heterogeneous Supercomputers.
Proceedings of the Software for Exascale Computing - SPPEXA 2013-2015, 2016
Proceedings of the Software for Exascale Computing - SPPEXA 2013-2015, 2016
High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations.
J. Comput. Phys., 2016
Performance analysis of the Kahan-enhanced scalar product on current multi- and manycore processors.
CoRR, 2016
Chip-level and multi-node analysis of energy-optimized lattice Boltzmann CFD simulations.
Concurr. Comput. Pract. Exp., 2016
Exploring performance and power properties of modern multi-core chips via simple machine models.
Concurr. Comput. Pract. Exp., 2016
Concurr. Comput. Pract. Exp., 2016
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016
Analysis of Intel's Haswell Microarchitecture Using the ECM Model and Microbenchmarks.
Proceedings of the Architecture of Computing Systems - ARCS 2016, 2016
2015
SIAM J. Sci. Comput., 2015
SIAM J. Sci. Comput., 2015
Short Note on Costs of Floating Point Operations on current x86-64 Architectures: Denormals, Overflow, Underflow, and Division by Zero.
CoRR, 2015
Performance analysis of the Kahan-enhanced scalar product on current multicore processors.
CoRR, 2015
Proceedings of the 6th International Workshop on Performance Modeling, 2015
Performance Analysis of the Kahan-Enhanced Scalar Product on Current Multicore Processors.
Proceedings of the Parallel Processing and Applied Mathematics, 2015
Proceedings of the Parallel Computing: On the Road to Exascale, 2015
Performance Engineering of the Kernel Polynomal Method on Large-Scale CPU-GPU Systems.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015
Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model.
Proceedings of the 29th ACM on International Conference on Supercomputing, 2015
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015
2014
A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units.
SIAM J. Sci. Comput., 2014
Modeling and analyzing performance for highly optimized propagation steps of the lattice Boltzmann method on sparse lattices.
CoRR, 2014
Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems.
CoRR, 2014
Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips.
Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing, 2014
Proceedings of the 43rd International Conference on Parallel Processing Workshops, 2014
Proceedings of the Euro-Par 2014: Parallel Processing Workshops, 2014
Performance Engineering for a Medical Imaging Application on the Intel Xeon Phi Accelerator.
Proceedings of the ARCS 2014, 2014
2013
Parallel Process. Lett., 2013
Pushing the limits for medical image reconstruction on recent standard multicore processors.
Int. J. High Perform. Comput. Appl., 2013
An analysis of energy-optimized lattice-Boltzmann CFD simulations from the chip to the highly parallel level
CoRR, 2013
CoRR, 2013
Comput. Math. Appl., 2013
Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013
2012
Exploring performance and power properties of modern multicore chips via simple machine models
CoRR, 2012
Best practices for HPM-assisted performance engineering on modern multicore processors
CoRR, 2012
Proceedings of the Recent Advances in the Message Passing Interface, 2012
Sparse Matrix-vector Multiplication on GPGPU Clusters: A New Storage Format and a Scalable Implementation.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012
Performance Patterns and Hardware Metrics on Modern Multicore Processors: Best Practices for Performance Engineering.
Proceedings of the Euro-Par 2012: Parallel Processing Workshops, 2012
2011
Hybrid-Parallel Sparse Matrix-Vector Multiplication with Explicit Communication Overlap on Current Multicore-Based Systems.
Parallel Process. Lett., 2011
A flexible Patch-based lattice Boltzmann parallelization approach for heterogeneous GPU-CPU clusters.
Parallel Comput., 2011
Efficient multicore-aware parallelization strategies for iterative stencil computations.
J. Comput. Sci., 2011
Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results
CoRR, 2011
Domain decomposition and locality optimization for large-scale lattice Boltzmann simulations
CoRR, 2011
CoRR, 2011
Performance analysis and optimization strategies for a D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA.
Adv. Eng. Softw., 2011
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2011
likwid-bench: An Extensible Microbenchmarking Platform for x86 Multicore Compute Nodes.
Proceedings of the Tools for High Performance Computing 2011, 2011
Parallel Sparse Matrix-Vector Multiplication as a Test Case for Hybrid MPI+OpenMP Programming.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011
Chapman and Hall / CRC computational science series, CRC Press, ISBN: 978-1-439-81192-4, 2011
2010
Leveraging Shared Caches for Parallel Temporal Blocking of Stencil Codes on Multicore Processors and Clusters.
Parallel Process. Lett., 2010
Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010
LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments.
Proceedings of the 39th International Conference on Parallel Processing, 2010
Proceedings of the Competence in High Performance Computing 2010, 2010
2009
Benchmark Analysis and Application Results for Lattice Boltzmann Simulations on NEC SX Vector and Intel Nehalem Systems.
Parallel Process. Lett., 2009
Multi-core architectures: Complexities of performance prediction and the impact of cache topology
CoRR, 2009
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009
Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization.
Proceedings of the 33rd Annual IEEE International Computer Software and Applications Conference, 2009
2008
Parallel Process. Lett., 2008
Performance comparison of different parallel lattice Boltzmann implementations on multi-core multi-socket systems.
Int. J. Comput. Sci. Eng., 2008
Data access optimizations for highly threaded multi-core CPUs with multiple memory controllers.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008
Vector Computers in a World of Commodity Clusters, Massively Parallel Systems and Many-Core Many-Threaded CPUs: Recent Experience Based on an Advanced Lattice Boltzmann Flow Solver.
Proceedings of the High Performance Computing in Science and Engineering '08, 2008
2007
Hierarchical hybrid grids: achieving TERAFLOP performance on large scale finite element simulations.
Int. J. Parallel Emergent Distributed Syst., 2007
RZBENCH: Performance evaluation of current HPC architectures using low-level and application benchmarks
CoRR, 2007
2004
Performance Evaluation of Parallel Large-Scale Lattice Boltzmann Applications on Three Supercomputing Architectures.
Proceedings of the ACM/IEEE SC2004 Conference on High Performance Networking and Computing, 2004
2003
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures.
Int. J. High Perform. Comput. Appl., 2003
Proceedings of the Modeling, 2003
Exact Numerical Treatment of Finite Quantum Systems Using Leading-Edge Supercomputers.
Proceedings of the Modeling, 2003
2002
Proceedings of the High Performance Computing for Computational Science, 2002