Toshiyuki Imamura

Orcid: 0000-0003-1601-9710

According to our database1, Toshiyuki Imamura authored at least 76 papers between 2000 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Performance Enhancement of the Ozaki Scheme on Integer Matrix Multiplication Unit.
CoRR, 2024

2023
Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor.
Proceedings of the 16th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, 2023

A new data conversion method for mixed precision Krylov solvers with FP16/BF16 Jacobi preconditioners.
Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 2023

2022
High Performance Parallel LOBPCG Method for Large Hamiltonian Derived from Hubbard Model on Multi-GPU Systems.
Proceedings of the Supercomputing Frontiers - 7th Asian Conference, 2022

GPU Optimization of Lattice Boltzmann Method with Local Ensemble Transform Kalman Filter.
Proceedings of the IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems, 2022

Infinite-Precision Inner Product and Sparse Matrix-Vector Multiplication Using Ozaki Scheme with Dot2 on Manycore Processors.
Proceedings of the Parallel Processing and Applied Mathematics, 2022

2021
MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems.
CoRR, 2021

Iterative methods with mixed-precision preconditioning for ill-conditioned linear systems in multiphase CFD simulations.
Proceedings of the 12th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2021


Task Scheduling Strategies for Batched Basic Linear Algebra Subprograms on Many-core CPUs.
Proceedings of the 14th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, 2021

Accurate Matrix Multiplication on Binary128 Format Accelerated by Ozaki Scheme.
Proceedings of the ICPP 2021: 50th International Conference on Parallel Processing, Lemont, IL, USA, August 9, 2021

A Rapid Euclidean Norm Calculation Algorithm that Reduces Overflow and Underflow.
Proceedings of the Computational Science and Its Applications - ICCSA 2021, 2021

2020
White Paper from Workshop on Large-scale Parallel Numerical Computing Technology (LSPANC 2020): HPC and Computer Arithmetic toward Minimal-Precision Computing.
CoRR, 2020

Error Analysis of the Cholesky QR-Based Block Orthogonalization Process for the One-Sided Block Jacobi SVD Algorithm.
Comput. Informatics, 2020

Can We Avoid Rounding-Error Estimation in HPC Codes and Still Get Trustworthy Results?
Proceedings of the Software Verification - 12th International Conference, 2020

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions.
Proceedings of the High Performance Computing - 35th International Conference, 2020

Implementation and Numerical Techniques for One EFlop/s HPL-AI Benchmark on Fugaku.
Proceedings of the 11th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2020

A 1024-member ensemble data assimilation with 3.5-km mesh global weather simulations.
Proceedings of the International Conference for High Performance Computing, 2020

Acceleration of fusion plasma turbulence simulations using the mixed-precision communication-avoiding krylov method.
Proceedings of the International Conference for High Performance Computing, 2020

An FPGA-based Sound Field Rendering System.
Proceedings of the IEEE International Conference on Cluster Computing, 2020

Prompt Report on Exa-Scale HPL-AI Benchmark.
Proceedings of the IEEE International Conference on Cluster Computing, 2020

2019
High Performance Eigenvalue Solver for Hubbard Model: Tuning Strategies for LOBPCG Method on CUDA GPU.
Proceedings of the Parallel Computing: Technology Trends, 2019

Design of an FPGA-Based Matrix Multiplier with Task Parallelism.
Proceedings of the Parallel Computing: Technology Trends, 2019

Batched 3D-Distributed FFT Kernels Towards Practical DNS Codes.
Proceedings of the Parallel Computing: Technology Trends, 2019

Cache-efficient implementation and batching of tridiagonalization on manycore CPUs.
Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 2019

2018
High Performance LOBPCG Method for Solving Multiple Eigenvalues of Hubbard Model: Efficiency of Communication Avoiding Neumann Expansion Preconditioner.
Proceedings of the Supercomputing Frontiers - 4th Asian Conference, 2018

Application of a Preconditioned Chebyshev Basis Communication-Avoiding Conjugate Gradient Method to a Multiphase Thermal-Hydraulic CFD Code.
Proceedings of the Supercomputing Frontiers - 4th Asian Conference, 2018

Optimization of Reordering Procedures in HOTRG for Distributed Parallel Computing.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

A Case Study on Modeling the Performance of Dense Matrix Computation: Tridiagonalization in the EigenExa Eigensolver on the K Computer.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

Performance Analysis of 2D-compatible 2.5D-PDGEMM on Knights Landing Cluster.
Proceedings of the Computational Science - ICCS 2018, 2018

Performance Evaluation of a Toolkit for Sparse Tensor Decomposition.
Proceedings of the Poster Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, 2018

2017
Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms.
Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2017

Implementation and Performance Analysis of 2.5D-PDGEMM on the K Computer.
Proceedings of the Parallel Processing and Applied Mathematics, 2017

Parallel Divide-and-Conquer Algorithm for Solving Tridiagonal Eigenvalue Problems on Manycore Systems.
Proceedings of the Parallel Processing and Applied Mathematics, 2017

Communication Avoiding Neumann Expansion Preconditioner for LOBPCG Method: Convergence Property of Exact Diagonalization Method for Hubbard Model.
Proceedings of the Parallel Computing is Everywhere, 2017

Design Towards Modern High Performance Numerical LA Library Enabling Heterogeneity and Flexible Data Formats.
Proceedings of the Parallel Computing is Everywhere, 2017

Quadruple-Precision BLAS Using Bailey's Arithmetic with FMA Instruction: Its Performance and Applications.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

An energy-efficient FPGA-based matrix multiplier.
Proceedings of the 24th IEEE International Conference on Electronics, Circuits and Systems, 2017

2016
Parallel implementation of 3D FFT with volumetric decomposition schemes for efficient molecular dynamics simulations.
Comput. Phys. Commun., 2016

Left-Preconditioned Communication-Avoiding Conjugate Gradient Methods for Multiphase CFD Simulations on the K Computer.
Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2016

Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs.
Proceedings of the 10th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, 2016

Reduced-Precision Floating-Point Formats on GPUs for High Performance and Energy Efficient Computation.
Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016

2015
Performance Analysis of the Chebyshev Basis Conjugate Gradient Method on the K Computer.
Proceedings of the Parallel Processing and Applied Mathematics, 2015

Fast Implementation of General Matrix-Vector Multiplication (GEMV) on Kepler GPUs.
Proceedings of the 23rd Euromicro International Conference on Parallel, 2015

High Performance Eigenvalue Solver in Exact-diagonalization Method for Hubbard Model on CUDA GPU.
Proceedings of the Parallel Computing: On the Road to Exascale, 2015

CAHTR: Communication-Avoiding Householder TRidiagonalization.
Proceedings of the Parallel Computing: On the Road to Exascale, 2015

Performance Evaluation of the Eigen Exa Eigensolver on Oakleaf-FX: Tridiagonalization Versus Pentadiagonalization.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

2014
Implementation of d-Spline-based incremental performance parameter estimation method with ppOpen-AT.
Sci. Program., 2014

Communication-overlap techniques for improved strong scaling of gyrokinetic Eulerian code beyond 100k cores on the K-computer.
Int. J. High Perform. Comput. Appl., 2014

Performance Analysis of the Householder-Type Parallel Tall-Skinny QR Factorizations Toward Automatic Algorithm Selection.
Proceedings of the High Performance Computing for Computational Science - VECPAR 2014 - 11th International Conference, Eugene, OR, USA, June 30, 2014

A Study of Parallel Data Compression Using Proper Orthogonal Decomposition on the K Computer.
Proceedings of the 14th Eurographics Symposium on Parallel Graphics and Visualization, 2014

2013
Eigen-G: GPU-Based Eigenvalue Solver for Real-Symmetric Dense Matrices.
Proceedings of the Parallel Processing and Applied Mathematics, 2013

Parallel Computing Design for Exact Diagonalization Scheme on Multi-band Hubbard Cluster Models.
Proceedings of the Parallel Computing: Accelerating Computational Science and Engineering (CSE), 2013

Proper orthogonal decomposition based parallel compression for visualizing big data on the K computer.
Proceedings of the IEEE Symposium on Large-Scale Data Analysis and Visualization, 2013

2012
A High Performance SYMV Kernel on a Fermi-core GPU.
Proceedings of the High Performance Computing for Computational Science, 2012

Poster: Preliminary Report for a High Precision Distributed Memory Parallel Eigenvalue Solver.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Abstract: Preliminary Report for a High Precision Distributed Memory Parallel Eigenvalue Solver.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Poster: Communication Overlap Techniques for Improved Strong Scaling of Gyrokinetic Eulerian Code beyond 100k Cores on the K-Computer.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Abstract: Communication Overlap Techniques for Improved Strong Scaling of Gyrokinetic Eulerian Code beyond 100k Cores on the K-Computer.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

2011
Parallelization design on multi-core platforms in density matrix renormalization group toward 2-D quantum strongly-correlated systems.
Proceedings of the Conference on High Performance Computing Networking, 2011

2010
High-Performance Quantum Simulation for Coupled Josephson Junctions on the Earth Simulator: a Challenge To the Schrödinger Equation On 256<sup>4</sup> Grids.
Int. J. High Perform. Comput. Appl., 2010

2009
Narrow-band reduction approach of a DRSM eigensolver on a multicore-based cluster system.
Proceedings of the Parallel Computing: From Multicores and GPU's to Petascale, 2009

2007
Recursive multi-factoring algorithm for MPI allreduce.
Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks, 2007

2006
Gordon Bell finalists I - High-performance computing for exact numerical approaches to quantum many-body problems on the earth simulator.
Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

2005
16.447 TFlops and 159-Billion-dimensional Exact-diagonalization for Trapped Fermion-Hubbard Model on the Earth Simulator.
Proceedings of the ACM/IEEE SC2005 Conference on High Performance Networking and Computing, 2005

10TFLOPS Eigenvalue Solver for Strongly-Correlated Fermions on the Earth Simulator.
Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks, 2005

C-Stab: Cache Stabilizing Algorithm for a Numerical Library.
Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks, 2005

An Evaluation Towards Automatically Tuned Eigensolvers.
Proceedings of the Large-Scale Scientific Computing, 5th International Conference, 2005

Automatic Tuning Technique Exploring Within the Hardware-Specific Constrained Parameters.
Proceedings of the Large-Scale Scientific Computing, 5th International Conference, 2005

16.14 TFLOPS Eigenvalue Solver on the Earth Simulator: Exact Diagonalization for Ultra Largescale Hamiltonian Matrix.
Proceedings of the High-Performance Computing - 6th International Symposium, 2005

2003
MPI-2 Support in Heterogeneous Computing Environment Using an SCore Cluster System.
Proceedings of the Parallel and Distributed Processing and Applications, 2003

A Visual Resource Integration Environment for Distributed Applications on the ITBL System.
Proceedings of the High Performance Computing, 5th International Symposium, 2003

Grid Computing Supporting System on ITBL Project.
Proceedings of the High Performance Computing, 5th International Symposium, 2003

2002
Stampi-I/O: A Flexible Parallel-I/O Library for Heterogeneous Computing Environment.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 9th European PVM/MPI Users' Group Meeting, Linz, Austria, September 29, 2002

2000
An Estimation of Complexity and Computational Costs for Vertical Block-Cyclic Distributed Parallel LU Factorization.
J. Supercomput., 2000

An Architecture of Stampi: MPI Library on a Cluster of Parallel Computers.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2000


  Loading...