Alexander Heinecke

According to our database1, Alexander Heinecke authored at least 72 papers between 2007 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Towards a high-performance AI compiler with upstream MLIR.
CoRR, 2024

Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

2023
Microscaling Data Formats for Deep Learning.
CoRR, 2023

Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures.
CoRR, 2023

2022
Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning and HPC Workloads.
Frontiers Appl. Math. Stat., 2022

FP8 Formats for Deep Learning.
CoRR, 2022

FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems.
CoRR, 2022

FPGA-Based AI Smart NICs for Scalable Distributed AI Training Systems.
IEEE Comput. Archit. Lett., 2022

Accelerating Deep Learning based Identification of Chromatin Accessibility from noisy ATAC-seq Data.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2022

Next-Generation Local Time Stepping for the ADER-DG Finite Element Method.
Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

2021
PolyDL: Polyhedral Optimizations for Creation of High-performance DL Primitives.
ACM Trans. Archit. Code Optim., 2021

Efficient and Generic 1D Dilated Convolution Layer for Deep Learning.
CoRR, 2021

Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning Workloads.
CoRR, 2021

DistGNN: scalable distributed training for large-scale graph neural networks.
Proceedings of the International Conference for High Performance Computing, 2021

Tensor processing primitives: a programming abstraction for efficiency and portability in deep learning workloads.
Proceedings of the International Conference for High Performance Computing, 2021

2020
PolyScientist: Automatic Loop Transformations Combined with Microkernels for Optimization of Deep Learning Primitives.
CoRR, 2020

Performance study of sustained petascale direct numerical simulation on Cray XC40 systems.
Concurr. Comput. Pract. Exp., 2020

Optimizing deep learning recommender systems training on CPU cluster architectures.
Proceedings of the International Conference for High Performance Computing, 2020

Harnessing Deep Learning via a Single Building Block.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

2019
Optimizing Deep Learning RNN Topologies on Intel Architecture.
Supercomput. Front. Innov., 2019

Tensor-optimized hardware accelerates fused discontinuous Galerkin simulations.
Parallel Comput., 2019

Training Neural Machine Translation (NMT) Models using Tensor Train Decomposition on TensorFlow (T3F).
CoRR, 2019

High-Performance Deep Learning via a Single Building Block.
CoRR, 2019

A Study of BFLOAT16 for Deep Learning Training.
CoRR, 2019

Petaflop Seismic Simulations in the Public Cloud.
Proceedings of the High Performance Computing - 34th International Conference, 2019

Training Google Neural Machine Translation on an Intel CPU Cluster.
Proceedings of the 2019 IEEE International Conference on Cluster Computing, 2019

ISA mapper: a compute and hardware agnostic deep learning compiler.
Proceedings of the 16th ACM International Conference on Computing Frontiers, 2019

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations.
Proceedings of the 26th IEEE Symposium on Computer Arithmetic, 2019

2018
Anatomy of high-performance deep learning convolutions on SIMD architectures.
Proceedings of the International Conference for High Performance Computing, 2018

Mixed Precision Training of Convolutional Neural Networks using Integer Operations.
Proceedings of the 6th International Conference on Learning Representations, 2018

2017
Accelerating Seismic Simulations Using the Intel Xeon Phi Knights Landing Processor.
Proceedings of the High Performance Computing - 32nd International Conference, 2017

EDGE: Extreme Scale Fused Seismic Simulations with the Discontinuous Galerkin Method.
Proceedings of the High Performance Computing - 32nd International Conference, 2017

2016
Optimizations in a high-performance conjugate gradient benchmark for IA-based multi- and many-core processors.
Int. J. High Perform. Comput. Appl., 2016

Data mining on vast data sets as a cluster system benchmark.
Concurr. Comput. Pract. Exp., 2016

Efficiency of High Order Spectral Element Methods on Petascale Architectures.
Proceedings of the High Performance Computing - 31st International Conference, 2016

High Order Seismic Simulations on the Intel Xeon Phi Processor (Knights Landing).
Proceedings of the High Performance Computing - 31st International Conference, 2016

LIBXSMM: accelerating small matrix multiplications by runtime code generation.
Proceedings of the International Conference for High Performance Computing, 2016

Petascale Local Time Stepping for the ADER-DG Finite Element Method.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

2015
Supercomputing for Molecular Dynamics Simulations - Handling Multi-Trillion Particles in Nanofluidics
Springer Briefs in Computer Science, Springer, ISBN: 978-3-319-17148-7, 2015

Beacon: Deployment and Application of Intel Xeon Phi Coprocessorsfor Scientific Computing.
Comput. Sci. Eng., 2015

Cache-oblivious matrix algorithms in the age of multicores and many cores.
Concurr. Comput. Pract. Exp., 2015

High-Order ADER-DG Minimizes Energy- and Time-to-Solution of SeisSol.
Proceedings of the High Performance Computing - 30th International Conference, 2015

Full correlation matrix analysis of fMRI data on Intel® Xeon Phi™ coprocessors.
Proceedings of the International Conference for High Performance Computing, 2015

Exploring Shared-Memory Optimizations for an Unstructured Mesh CFD Application on Modern Parallel Systems.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Optimized Force Calculation in Molecular Dynamics Simulations for the Intel Xeon Phi.
Proceedings of the Euro-Par 2015: Parallel Processing Workshops, 2015

2014
Boosting Scientific Computing Applications through Leveraging Data Parallel Architectures.
PhD thesis, 2014

ls1 mardyn: The massively parallel molecular dynamics code for large systems.
CoRR, 2014

Parallelizing a Black-Scholes solver based on finite elements and sparse grids.
Concurr. Comput. Pract. Exp., 2014

Sustained Petascale Performance of Seismic Simulations with SeisSol on SuperMUC.
Proceedings of the Supercomputing - 29th International Conference, 2014

Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices.
Proceedings of the International Conference for High Performance Computing, 2014

Petascale High Order Dynamic Rupture Earthquake Simulations on Heterogeneous Supercomputers.
Proceedings of the International Conference for High Performance Computing, 2014

Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

2013
Emerging Architectures Enable to Boost Massively Parallel Data Mining Using Adaptive Sparse Grids.
Int. J. Parallel Program., 2013

591 TFLOPS Multi-trillion Particles Simulation on SuperMUC.
Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013

Many-core architectures boost the pricing of basket options on adaptive sparse grids.
Proceedings of WHPCF'13: 6th Workshop on High Performance Computational Finance, 2013

Accelerating SeisSol by Generating Vectorized Code for Sparse Matrix Operators.
Proceedings of the Parallel Computing: Accelerating Computational Science and Engineering (CSE), 2013

Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor.
Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

Accelerators in scientific computing is it worth the effort?
Proceedings of the International Conference on High Performance Computing & Simulation, 2013

2012
Option pricing with a direct adaptive sparse grid approach.
J. Comput. Appl. Math., 2012

A highly parallel Black-Scholes solver based on adaptive sparse grids.
Int. J. Comput. Math., 2012

From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture.
Comput. Sci. Eng., 2012

Exploiting State-of-the-Art x86 Architectures in Scientific Computing.
Proceedings of the 11th International Symposium on Parallel and Distributed Computing, 2012

HPCS 2012 panels: Panel I: Energy efficient systems in next generation high performance data and compute centers.
Proceedings of the 2012 International Conference on High Performance Computing & Simulation, 2012

Sparse grid classifiers as base learners for AdaBoost.
Proceedings of the 2012 International Conference on High Performance Computing & Simulation, 2012

Solving High-Dimensional Problems on Processors with Integrated GPU.
Proceedings of the Facing the Multicore-Challenge, 2012

An efficient vectorization of linked-cell particle simulations.
Proceedings of the Computing Frontiers Conference, CF'12, 2012

2011
Making TifaMMy fit for tomorrow: Towards future shared memory systems and beyond.
Proceedings of the 2011 International Conference on High Performance Computing & Simulation, 2011

Towards High-Performance Implementations of a Custom HPC Kernel Using ® Array Building Blocks.
Proceedings of the Facing the Multicore - Challenge II, 2011

Extending a Highly Parallel Data Mining Algorithm to the Intel ® Many Integrated Core Architecture.
Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

Multi- and many-core data mining with adaptive sparse grids.
Proceedings of the 8th Conference on Computing Frontiers, 2011

2010
Porting existing cache-oblivious linear algebra HPC modules to larrabee architecture.
Proceedings of the 7th Conference on Computing Frontiers, 2010

2007
Hardware-Oriented Implementation of Cache Oblivious Matrix Operations Based on Space-Filling Curves.
Proceedings of the Parallel Processing and Applied Mathematics, 2007


  Loading...