Toshiyuki Imamura

CoRR, 2024

2023

Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor.

[BibT_eX]

[DOI]

Masatoshi Kawai

Proceedings of the 16th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, 2023

A new data conversion method for mixed precision Krylov solvers with FP16/BF16 Jacobi preconditioners.

[BibT_eX]

[DOI]

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 2023

2022

High Performance Parallel LOBPCG Method for Large Hamiltonian Derived from Hubbard Model on Multi-GPU Systems.

[BibT_eX]

[DOI]

Proceedings of the Supercomputing Frontiers - 7th Asian Conference, 2022

GPU Optimization of Lattice Boltzmann Method with Local Ensemble Transform Kalman Filter.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems, 2022

Infinite-Precision Inner Product and Sparse Matrix-Vector Multiplication Using Ozaki Scheme with Dot2 on Manycore Processors.

[BibT_eX]

[DOI]

Proceedings of the Parallel Processing and Applied Mathematics, 2022

2021

MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems.

[BibT_eX]

[DOI]

CoRR, 2021

Iterative methods with mixed-precision preconditioning for ill-conditioned linear systems in multiphase CFD simulations.

[BibT_eX]

[DOI]

Proceedings of the 12th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2021

MLPerf™ HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments, 2021

Task Scheduling Strategies for Batched Basic Linear Algebra Subprograms on Many-core CPUs.

[BibT_eX]

[DOI]

Yusuke Hirota

Proceedings of the 14th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, 2021

Accurate Matrix Multiplication on Binary128 Format Accelerated by Ozaki Scheme.

[BibT_eX]

[DOI]

Proceedings of the ICPP 2021: 50th International Conference on Parallel Processing, Lemont, IL, USA, August 9, 2021

A Rapid Euclidean Norm Calculation Algorithm that Reduces Overflow and Underflow.

[BibT_eX]

[DOI]

Proceedings of the Computational Science and Its Applications - ICCSA 2021, 2021

2020

White Paper from Workshop on Large-scale Parallel Numerical Computing Technology (LSPANC 2020): HPC and Computer Arithmetic toward Minimal-Precision Computing.

[BibT_eX]

[DOI]

CoRR, 2020

Error Analysis of the Cholesky QR-Based Block Orthogonalization Process for the One-Sided Block Jacobi SVD Algorithm.

[BibT_eX]

[DOI]

Shuhei Kudo

Yusaku Yamamoto

Comput. Informatics, 2020

Can We Avoid Rounding-Error Estimation in HPC Codes and Still Get Trustworthy Results?

[BibT_eX]

[DOI]

Proceedings of the Software Verification - 12th International Conference, 2020

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 35th International Conference, 2020

Implementation and Numerical Techniques for One EFlop/s HPL-AI Benchmark on Fugaku.

[BibT_eX]

[DOI]

Proceedings of the 11th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2020

A 1024-member ensemble data assimilation with 3.5-km mesh global weather simulations.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2020

Acceleration of fusion plasma turbulence simulations using the mixed-precision communication-avoiding krylov method.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2020

An FPGA-based Sound Field Rendering System.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2020

Prompt Report on Exa-Scale HPL-AI Benchmark.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2020

2019

High Performance Eigenvalue Solver for Hubbard Model: Tuning Strategies for LOBPCG Method on CUDA GPU.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing: Technology Trends, 2019

Design of an FPGA-Based Matrix Multiplier with Task Parallelism.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing: Technology Trends, 2019

Batched 3D-Distributed FFT Kernels Towards Practical DNS Codes.

[BibT_eX]

[DOI]

Masaaki Aoki

Mitsuo Yokokawa

Proceedings of the Parallel Computing: Technology Trends, 2019

Cache-efficient implementation and batching of tridiagonalization on manycore CPUs.

[BibT_eX]

[DOI]

Shuhei Kudo

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 2019

2018

High Performance LOBPCG Method for Solving Multiple Eigenvalues of Hubbard Model: Efficiency of Communication Avoiding Neumann Expansion Preconditioner.

[BibT_eX]

[DOI]

Proceedings of the Supercomputing Frontiers - 4th Asian Conference, 2018

Application of a Preconditioned Chebyshev Basis Communication-Avoiding Conjugate Gradient Method to a Multiphase Thermal-Hydraulic CFD Code.

[BibT_eX]

[DOI]

Proceedings of the Supercomputing Frontiers - 4th Asian Conference, 2018

Optimization of Reordering Procedures in HOTRG for Distributed Parallel Computing.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

A Case Study on Modeling the Performance of Dense Matrix Computation: Tridiagonalization in the EigenExa Eigensolver on the K Computer.

[BibT_eX]

[DOI]

Takeshi Fukaya

Yusaku Yamamoto

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

Performance Analysis of 2D-compatible 2.5D-PDGEMM on Knights Landing Cluster.

[BibT_eX]

[DOI]

Proceedings of the Computational Science - ICCS 2018, 2018

Performance Evaluation of a Toolkit for Sparse Tensor Decomposition.

[BibT_eX]

[DOI]

Proceedings of the Poster Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, 2018

2017

Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms.

[BibT_eX]

[DOI]

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2017

Implementation and Performance Analysis of 2.5D-PDGEMM on the K Computer.

[BibT_eX]

[DOI]

Proceedings of the Parallel Processing and Applied Mathematics, 2017

Parallel Divide-and-Conquer Algorithm for Solving Tridiagonal Eigenvalue Problems on Manycore Systems.

[BibT_eX]

[DOI]

Yusuke Hirota

Proceedings of the Parallel Processing and Applied Mathematics, 2017

Communication Avoiding Neumann Expansion Preconditioner for LOBPCG Method: Convergence Property of Exact Diagonalization Method for Hubbard Model.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing is Everywhere, 2017

Design Towards Modern High Performance Numerical LA Library Enabling Heterogeneity and Flexible Data Formats.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing is Everywhere, 2017

Quadruple-Precision BLAS Using Bailey's Arithmetic with FMA Instruction: Its Performance and Applications.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

An energy-efficient FPGA-based matrix multiplier.

[BibT_eX]

[DOI]

Proceedings of the 24th IEEE International Conference on Electronics, Circuits and Systems, 2017

2016

Parallel implementation of 3D FFT with volumetric decomposition schemes for efficient molecular dynamics simulations.

[BibT_eX]

[DOI]

Comput. Phys. Commun., 2016

Left-Preconditioned Communication-Avoiding Conjugate Gradient Methods for Multiphase CFD Simulations on the K Computer.

[BibT_eX]

[DOI]

Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2016

Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs.

[BibT_eX]

[DOI]

Daisuke Takahashi

Proceedings of the 10th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, 2016

Reduced-Precision Floating-Point Formats on GPUs for High Performance and Energy Efficient Computation.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016

2015

Performance Analysis of the Chebyshev Basis Conjugate Gradient Method on the K Computer.

[BibT_eX]

[DOI]

Proceedings of the Parallel Processing and Applied Mathematics, 2015

Fast Implementation of General Matrix-Vector Multiplication (GEMV) on Kepler GPUs.

[BibT_eX]

[DOI]

Daisuke Takahashi

Proceedings of the 23rd Euromicro International Conference on Parallel, 2015

High Performance Eigenvalue Solver in Exact-diagonalization Method for Hubbard Model on CUDA GPU.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing: On the Road to Exascale, 2015

CAHTR: Communication-Avoiding Householder TRidiagonalization.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing: On the Road to Exascale, 2015

Performance Evaluation of the Eigen Exa Eigensolver on Oakleaf-FX: Tridiagonalization Versus Pentadiagonalization.

[BibT_eX]

[DOI]

Takeshi Fukaya

Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

2014

Implementation of d-Spline-based incremental performance parameter estimation method with ppOpen-AT.

[BibT_eX]

[DOI]

Sci. Program., 2014

Communication-overlap techniques for improved strong scaling of gyrokinetic Eulerian code beyond 100k cores on the K-computer.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2014

Performance Analysis of the Householder-Type Parallel Tall-Skinny QR Factorizations Toward Automatic Algorithm Selection.

[BibT_eX]

[DOI]

Takeshi Fukaya

Yusaku Yamamoto

Proceedings of the High Performance Computing for Computational Science - VECPAR 2014 - 11th International Conference, Eugene, OR, USA, June 30, 2014

A Study of Parallel Data Compression Using Proper Orthogonal Decomposition on the K Computer.

[BibT_eX]

[DOI]

Proceedings of the 14th Eurographics Symposium on Parallel Graphics and Visualization, 2014

2013

Eigen-G: GPU-Based Eigenvalue Solver for Real-Symmetric Dense Matrices.

[BibT_eX]

[DOI]

Proceedings of the Parallel Processing and Applied Mathematics, 2013

Parallel Computing Design for Exact Diagonalization Scheme on Multi-band Hubbard Cluster Models.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing: Accelerating Computational Science and Engineering (CSE), 2013

Proper orthogonal decomposition based parallel compression for visualizing big data on the K computer.

[BibT_eX]

[DOI]

Proceedings of the IEEE Symposium on Large-Scale Data Analysis and Visualization, 2013

2012

A High Performance SYMV Kernel on a Fermi-core GPU.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing for Computational Science, 2012

Poster: Preliminary Report for a High Precision Distributed Memory Parallel Eigenvalue Solver.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Abstract: Preliminary Report for a High Precision Distributed Memory Parallel Eigenvalue Solver.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Poster: Communication Overlap Techniques for Improved Strong Scaling of Gyrokinetic Eulerian Code beyond 100k Cores on the K-Computer.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Abstract: Communication Overlap Techniques for Improved Strong Scaling of Gyrokinetic Eulerian Code beyond 100k Cores on the K-Computer.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

2011

Parallelization design on multi-core platforms in density matrix renormalization group toward 2-D quantum strongly-correlated systems.

[BibT_eX]

[DOI]

Proceedings of the Conference on High Performance Computing Networking, 2011

2010

High-Performance Quantum Simulation for Coupled Josephson Junctions on the Earth Simulator: a Challenge To the Schrödinger Equation On 256<sup>4</sup> Grids.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2010

2009

Narrow-band reduction approach of a DRSM eigensolver on a multicore-based cluster system.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing: From Multicores and GPU's to Petascale, 2009

2007

Recursive multi-factoring algorithm for MPI allreduce.

[BibT_eX]

Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks, 2007

2006

Gordon Bell finalists I - High-performance computing for exact numerical approaches to quantum many-body problems on the earth simulator.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

2005

16.447 TFlops and 159-Billion-dimensional Exact-diagonalization for Trapped Fermion-Hubbard Model on the Earth Simulator.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE SC2005 Conference on High Performance Networking and Computing, 2005

10TFLOPS Eigenvalue Solver for Strongly-Correlated Fermions on the Earth Simulator.

[BibT_eX]

Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks, 2005

C-Stab: Cache Stabilizing Algorithm for a Numerical Library.

[BibT_eX]

Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks, 2005

An Evaluation Towards Automatically Tuned Eigensolvers.

[BibT_eX]

[DOI]

Ken Naono

Proceedings of the Large-Scale Scientific Computing, 5th International Conference, 2005

Automatic Tuning Technique Exploring Within the Hardware-Specific Constrained Parameters.

[BibT_eX]

[DOI]

Ken Naono

Proceedings of the Large-Scale Scientific Computing, 5th International Conference, 2005

16.14 TFLOPS Eigenvalue Solver on the Earth Simulator: Exact Diagonalization for Ultra Largescale Hamiltonian Matrix.

[BibT_eX]

[DOI]

Proceedings of the High-Performance Computing - 6th International Symposium, 2005

2003

MPI-2 Support in Heterogeneous Computing Environment Using an SCore Cluster System.

[BibT_eX]

[DOI]

Proceedings of the Parallel and Distributed Processing and Applications, 2003

A Visual Resource Integration Environment for Distributed Applications on the ITBL System.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing, 5th International Symposium, 2003

Grid Computing Supporting System on ITBL Project.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing, 5th International Symposium, 2003

2002

Stampi-I/O: A Flexible Parallel-I/O Library for Heterogeneous Computing Environment.

[BibT_eX]

[DOI]

Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 9th European PVM/MPI Users' Group Meeting, Linz, Austria, September 29, 2002

2000

An Estimation of Complexity and Computational Costs for Vertical Block-Cyclic Distributed Parallel LU Factorization.

[BibT_eX]

[DOI]

J. Supercomput., 2000

An Architecture of Stampi: MPI Library on a Cluster of Parallel Computers.

[BibT_eX]

[DOI]