2025
Accelerating General Relativistic Radiation Magnetohydrodynamic Simulations with GPUs.
Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 2025
2024
FCUFS: Core-Level Frequency Tuning for Energy Optimization on Intel Processors.
Proceedings of the IEEE International Conference on Cluster Computing, 2024
Preliminary Performance Evaluation of Grace-Hopper GH200.
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the IEEE International Conference on Cluster Computing, 2024
2023
Efficient checkpoint/Restart of CUDA applications.
Parallel Comput., 2023
2022
Efficient high-precision integer multiplication on the GPU.
Int. J. High Perform. Comput. Appl., 2022
Accelerating data transfer between host and device using idle GPU.
Proceedings of the GPGPU@PPoPP 2022: Proceedings of the 14th Workshop on General Purpose Processing Using GPU, 2022
2021
Performance Optimization of Allreduce Operation for Multi-GPU Systems.
Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), 2021
2019
Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks.
Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019
2018
Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations.
Parallel Comput., 2018
MRG8: Random Number Generation for the Exascale Era.
Proceedings of the Platform for Advanced Scientific Computing Conference, 2018
Efficient Solving of Scan Primitive on Multi-GPU Systems.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018
Optimizing Preconditioned Conjugate Gradient on TaihuLight for OpenFOAM.
Proceedings of the 18th IEEE/ACM International Symposium on Cluster, 2018
2017
High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU.
Proceedings of the 46th International Conference on Parallel Processing, 2017
Optimizations of Two Compute-Bound Scientific Kernels on the SW26010 Many-Core Processor.
Proceedings of the 46th International Conference on Parallel Processing, 2017
2016
Adaptive Multi-level Blocking Optimization for Sparse Matrix Vector Multiplication on GPU.
Proceedings of the International Conference on Computational Science 2016, 2016
2015
Efficient Execution of Multiple CUDA Applications Using Transparent Suspend, Resume and Migration.
Proceedings of the Euro-Par 2015: Parallel Processing, 2015
Modeling Gather and Scatter with Hardware Performance Counters for Xeon Phi.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015
2014
Mixed-Precision AMG method for Many Core Accelerators.
Proceedings of the 21st European MPI Users' Group Meeting, 2014
Cache-aware sparse matrix formats for Kepler GPU.
Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, 2014
TSUBAME-KFC: A modern liquid submersion cooling prototype towards exascale becoming the greenest supercomputer in the world.
Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, 2014
2012
Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer.
Proceedings of the SC Conference on High Performance Computing Networking, 2012
High performance 3-D FFT using multiple CUDA GPUs.
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, 2012
2011
Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer.
Proceedings of the Conference on High Performance Computing Networking, 2011
NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011
Hamming Color Code for Dense and Robust One-shot 3D Scanning.
Proceedings of the British Machine Vision Conference, 2011
2010
High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning.
Comput. Sci. Res. Dev., 2010
An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code.
Proceedings of the Conference on High Performance Computing Networking, 2010
A high-performance fault-tolerant software framework for memory on commodity GPUs.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010
Linpack evaluation on a supercomputer with heterogeneous accelerators.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010
Low-overhead diskless checkpoint for hybrid computing systems.
Proceedings of the 2010 International Conference on High Performance Computing, 2010
Statistical power modeling of GPU kernels using performance counters.
Proceedings of the International Green Computing Conference 2010, 2010
Toward Automatic Performance Tuning for Numerical Simulations in the SILC Matrix Computation Framework.
Proceedings of the Software Automatic Tuning, From Concepts to State-of-the-Art Results, 2010
2009
Auto-tuning 3-D FFT library for CUDA GPUs.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2009
Fast Conjugate Gradients with Multiple GPUs.
Proceedings of the Computational Science, 2009
Aspects of GPU for general purpose high performance computing.
Proceedings of the 14th Asia South Pacific Design Automation Conference, 2009
2008
Bandwidth intensive 3-D FFT kernel for GPUs using CUDA.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2008
2007
Cloth Simulation in the SILC Matrix Computation Framework: A Case Study.
Proceedings of the Parallel Processing and Applied Mathematics, 2007
High Performance 3D Convolution for Protein Docking on IBM Blue Gene.
Proceedings of the Parallel and Distributed Processing and Applications, 2007
High Performance FFT on SGI Altix 3700.
Proceedings of the High Performance Computing and Communications, 2007
2006
Poster reception - Scalable software infrastructure project.
Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006
Distributed SILC: An Easy-to-Use Interface for MPI-Based Parallel Matrix Computation Libraries.
Proceedings of the Applied Parallel Computing. State of the Art in Scientific Computing, 2006
FFTSS: A High Performance Fast Fourier Transform Library.
Proceedings of the 2006 IEEE International Conference on Acoustics Speech and Signal Processing, 2006
2005
SILC: A Flexible and Environment-Independent Interface for Matrix Computation Libraries.
Proceedings of the Parallel Processing and Applied Mathematics, 2005
Performance Evaluation of Parallel Sparse Matrix-Vector Products on SGI Altix3700.
Proceedings of the OpenMP Shared Memory Parallel Programming - International Workshops, 2005