2025

Accelerating General Relativistic Radiation Magnetohydrodynamic Simulations with GPUs.

[DOI]

Ryohei Kobayashi

Hiroyuki R. Takahashi

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 2025

2024

FCUFS: Core-Level Frequency Tuning for Energy Optimization on Intel Processors.

[DOI]

Hongjian Zhang

Akira Nukada

Qiucheng Liao

Proceedings of the IEEE International Conference on Cluster Computing, 2024

Preliminary Performance Evaluation of Grace-Hopper GH200.

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2024

2023

Efficient checkpoint/Restart of CUDA applications.

[DOI]

Akira Nukada

Taichiro Suzuki

Satoshi Matsuoka

Parallel Comput., 2023

2022

Efficient high-precision integer multiplication on the GPU.

[DOI]

Int. J. High Perform. Comput. Appl., 2022

Accelerating data transfer between host and device using idle GPU.

[DOI]

Yuya Tatsugi

Akira Nukada

Proceedings of the GPGPU@PPoPP 2022: Proceedings of the 14th Workshop on General Purpose Processing Using GPU, 2022

2021

Performance Optimization of Allreduce Operation for Multi-GPU Systems.

[DOI]

Akira Nukada

Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), 2021

2019

Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks.

[DOI]

Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019

2018

Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations.

[DOI]

Parallel Comput., 2018

MRG8: Random Number Generation for the Exascale Era.

[DOI]

Proceedings of the Platform for Advanced Scientific Computing Conference, 2018

Efficient Solving of Scan Primitive on Multi-GPU Systems.

[DOI]

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

Optimizing Preconditioned Conjugate Gradient on TaihuLight for OpenFOAM.

[DOI]

Proceedings of the 18th IEEE/ACM International Symposium on Cluster, 2018

2017

High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU.

[DOI]

Yusuke Nagasaka

Akira Nukada

Satoshi Matsuoka

Proceedings of the 46th International Conference on Parallel Processing, 2017

Optimizations of Two Compute-Bound Scientific Kernels on the SW26010 Many-Core Processor.

[DOI]

Proceedings of the 46th International Conference on Parallel Processing, 2017

2016

Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU.

[DOI]

Yusuke Nagasaka

Akira Nukada

Satoshi Matsuoka

Proceedings of the International Conference on Computational Science 2016, 2016

2015

Efficient Execution of Multiple CUDA Applications Using Transparent Suspend, Resume and Migration.

[DOI]

Taichiro Suzuki

Akira Nukada

Satoshi Matsuoka

Proceedings of the Euro-Par 2015: Parallel Processing, 2015

Modeling Gather and Scatter with Hardware Performance Counters for Xeon Phi.

[DOI]

James Lin

Akira Nukada

Satoshi Matsuoka

Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

2014

Mixed-Precision AMG method for Many Core Accelerators.

[DOI]

Proceedings of the 21st European MPI Users' Group Meeting, 2014

Cache-aware sparse matrix formats for Kepler GPU.

[DOI]

Yusuke Nagasaka

Akira Nukada

Satoshi Matsuoka

Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, 2014

TSUBAME-KFC: A modern liquid submersion cooling prototype towards exascale becoming the greenest supercomputer in the world.

[DOI]

Toshio Endo

Akira Nukada

Satoshi Matsuoka

Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, 2014

2012

Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer.

[DOI]

Akira Nukada

Kento Sato

Satoshi Matsuoka

Proceedings of the SC Conference on High Performance Computing Networking, 2012

High performance 3-D FFT using multiple CUDA GPUs.

[DOI]

Akira Nukada

Yutaka Maruyama

Satoshi Matsuoka

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, 2012

2011

Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer.

[DOI]

Proceedings of the Conference on High Performance Computing Networking, 2011

NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA.

[DOI]

Akira Nukada

Hiroyuki Takizawa

Satoshi Matsuoka

Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Hamming Color Code for Dense and Robust One-shot 3D Scanning.

[DOI]

Shuntaro Yamazaki

Akira Nukada

Masaaki Mochimaru

Proceedings of the British Machine Vision Conference, 2011

2010

High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning.

[DOI]

Ali Cevahir

Akira Nukada

Satoshi Matsuoka

Comput. Sci. Res. Dev., 2010

An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code.

[DOI]

Proceedings of the Conference on High Performance Computing Networking, 2010

A high-performance fault-tolerant software framework for memory on commodity GPUs.

[DOI]

Naoya Maruyama

Akira Nukada

Satoshi Matsuoka

Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Linpack evaluation on a supercomputer with heterogeneous accelerators.

[DOI]

Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Low-overhead diskless checkpoint for hybrid computing systems.

[DOI]

Leonardo Arturo Bautista-Gomez

Proceedings of the 2010 International Conference on High Performance Computing, 2010

Statistical power modeling of GPU kernels using performance counters.

[DOI]

Proceedings of the International Green Computing Conference 2010, 2010

Toward Automatic Performance Tuning for Numerical Simulations in the SILC Matrix Computation Framework.

[DOI]

Proceedings of the Software Automatic Tuning, From Concepts to State-of-the-Art Results, 2010

2009

Auto-tuning 3-D FFT library for CUDA GPUs.

[DOI]

Akira Nukada

Satoshi Matsuoka

Proceedings of the ACM/IEEE Conference on High Performance Computing, 2009

Fast Conjugate Gradients with Multiple GPUs.

[DOI]

Ali Cevahir

Akira Nukada

Satoshi Matsuoka

Proceedings of the Computational Science, 2009

Aspects of GPU for general purpose high performance computing.

[DOI]

Proceedings of the 14th Asia South Pacific Design Automation Conference, 2009

2008

Bandwidth intensive 3-D FFT kernel for GPUs using CUDA.

[DOI]

Proceedings of the ACM/IEEE Conference on High Performance Computing, 2008

2007

Cloth Simulation in the SILC Matrix Computation Framework: A Case Study.

[DOI]

Proceedings of the Parallel Processing and Applied Mathematics, 2007

High Performance 3D Convolution for Protein Docking on IBM Blue Gene.

[DOI]

Proceedings of the Parallel and Distributed Processing and Applications, 2007

High Performance FFT on SGI Altix 3700.

[DOI]

Proceedings of the High Performance Computing and Communications, 2007

2006

Poster reception - Scalable software infrastructure project.

[DOI]

Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

Distributed SILC: An Easy-to-Use Interface for MPI-Based Parallel Matrix Computation Libraries.

[DOI]

Proceedings of the Applied Parallel Computing. State of the Art in Scientific Computing, 2006

FFTSS: A High Performance Fast Fourier Transform Library.

[DOI]

Akira Nukada

Proceedings of the 2006 IEEE International Conference on Acoustics Speech and Signal Processing, 2006

2005

SILC: A Flexible and Environment-Independent Interface for Matrix Computation Libraries.

[DOI]

Proceedings of the Parallel Processing and Applied Mathematics, 2005

Performance Evaluation of Parallel Sparse Matrix-Vector Products on SGI Altix3700.

[DOI]

Proceedings of the OpenMP Shared Memory Parallel Programming - International Workshops, 2005