Naoya Maruyama

According to our database1, Naoya Maruyama authored at least 65 papers between 2006 and 2024.

Collaborative distances:



In proceedings 
PhD thesis 




A Low Power ΔΣ Modulator with Low Voltage OTA for Wearable Applications.
Proceedings of the International Conference on Microelectronics, 2024

The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs With Hybrid Parallelism.
IEEE Trans. Parallel Distributed Syst., 2021

Co-design Center for Exascale Machine Learning Technologies (ExaLearn).
Int. J. High Perform. Comput. Appl., 2021

AIMES: Advanced Computation and I/O Methods for Earth-System Simulations.
Proceedings of the Software for Exascale Computing - SPPEXA 2016-2019, 2020


Channel and filter parallelism for large-scale CNN training.
Proceedings of the International Conference for High Performance Computing, 2019

Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism.
Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

Effective Quantization Approaches for Recurrent Neural Networks.
Proceedings of the 2018 International Joint Conference on Neural Networks, 2018

A Scalable Multi-Granular Data Model for Data Parallel Workflows.
Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 2018

A Portability Layer of an All-pairs Operation for Hierarchical N-Body Algorithm Framework Tapas.
Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 2018

Trends in Data Locality Abstractions for HPC Systems.
IEEE Trans. Parallel Distributed Syst., 2017

Efficient Breadth-First Search on Massively Parallel and Distributed-Memory Machines.
Data Sci. Eng., 2017

Optimizations of Two Compute-Bound Scientific Kernels on the SW26010 Many-Core Processor.
Proceedings of the 46th International Conference on Parallel Processing, 2017

Evaluating high-level design strategies on FPGAs for high-performance computing.
Proceedings of the 27th International Conference on Field Programmable Logic and Applications, 2017

High-performance conjugate gradient performance improvement on the K computer.
Int. J. High Perform. Comput. Appl., 2016

Evaluating and optimizing OpenCL kernels for high performance computing with FPGAs.
Proceedings of the International Conference for High Performance Computing, 2016

Daino: a high-level framework for parallel and efficient AMR on GPUs.
Proceedings of the International Conference for High Performance Computing, 2016

Scaling FMM with Data-Driven OpenMP Tasks on Multicore Architectures.
Proceedings of the OpenMP: Memory, Devices, and Tasks, 2016

Tapas: An Implicitly Parallel Programming Framework for Hierarchical N-Body Algorithms.
Proceedings of the 22nd IEEE International Conference on Parallel and Distributed Systems, 2016

A Directive-Based Data Layout Abstraction for Performance Portability of OpenACC Applications.
Proceedings of the 18th IEEE International Conference on High Performance Computing and Communications; 14th IEEE International Conference on Smart City; 2nd IEEE International Conference on Data Science and Systems, 2016

From FLOPS to BYTES: disruptive change in high-performance computing towards the post-moore era.
Proceedings of the ACM International Conference on Computing Frontiers, CF'16, 2016

Extreme scale breadth-first search on supercomputers.
Proceedings of the 2016 IEEE International Conference on Big Data (IEEE BigData 2016), 2016

Data-centric GPU-based adaptive mesh refinement.
Proceedings of the 5th Workshop on Irregular Applications - Architectures and Algorithms, 2015

PDSEC Keynote.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications.
Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, 2015

Scalable Kernel Fusion for Memory-Bound GPU Applications.
Proceedings of the International Conference for High Performance Computing, 2014

An OpenACC extension for data layout transformation.
Proceedings of the First Workshop on Accelerator Programming using Directives, 2014

Evaluation of Asynchronous MPI Communication in Map-Reduce System on the K Computer.
Proceedings of the 21st European MPI Users' Group Meeting, 2014

FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers.
Proceedings of the 14th IEEE/ACM International Symposium on Cluster, 2014

Fork-Join and Data-Driven Execution Models on Multi-core Architectures: Case Study of the FMM.
Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013

Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing.
Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

Integrating Multi-GPU Execution in an OpenACC Compiler.
Proceedings of the 42nd International Conference on Parallel Processing, 2013

Topic 15: GPU and Accelerator Computing - (Introduction).
Proceedings of the Euro-Par 2013 Parallel Processing, 2013

Highly optimized full GPU-acceleration of non-hydrostatic weather model SCALE-LES.
Proceedings of the 2013 IEEE International Conference on Cluster Computing, 2013

K MapReduce: A scalable tool for data-processing and search/ensemble applications on large-scale supercomputers.
Proceedings of the 2013 IEEE International Conference on Cluster Computing, 2013

CUDA vs OpenACC: Performance Case Studies with Kernel Benchmarks and a Memory-Bound CFD Application.
Proceedings of the 13th IEEE/ACM International Symposium on Cluster, 2013

A Multi GPU Read Alignment Algorithm with Model-Based Performance Optimization.
Proceedings of the High Performance Computing for Computational Science, 2012

A Task Parallel Implementation of Fast Multipole Methods.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Design and modeling of a non-blocking checkpointing system.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

Sequence Alignment on Massively Parallel Heterogeneous Systems.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

Multi-GPU Implementation of the NICAM Atmospheric Model.
Proceedings of the Euro-Par 2012: Parallel Processing Workshops, 2012

Scalable Reed-Solomon-Based Reliable Local Storage for HPC Applications on IaaS Clouds.
Proceedings of the Euro-Par 2012 Parallel Processing - 18th International Conference, 2012

Hierarchical Clustering Strategies for Fault Tolerance in Large Scale HPC Systems.
Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012

Design and Implementation of Portable and Efficient Non-blocking Collective Communication.
Proceedings of the 12th IEEE/ACM International Symposium on Cluster, 2012

An exact algorithm for energy-efficient acceleration of task trees on CPU/GPU architectures.
Proceedings of of SYSTOR 2011: The 4th Annual Haifa Experimental Systems Conference, Haifa, Israel, May 30, 2011

Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer.
Proceedings of the Conference on High Performance Computing Networking, 2011

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers.
Proceedings of the Conference on High Performance Computing Networking, 2011

Poster: fast GPU read alignment with burrows wheeler transform based index.
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2011

FTI: high performance fault tolerance interface for hybrid systems.
Proceedings of the Conference on High Performance Computing Networking, 2011

Model-based Fault Localization: Finding Behavioral Outliers in Large-scale Computing Systems.
New Gener. Comput., 2010

An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code.
Proceedings of the Conference on High Performance Computing Networking, 2010

A high-performance fault-tolerant software framework for memory on commodity GPUs.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Linpack evaluation on a supercomputer with heterogeneous accelerators.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Low-overhead diskless checkpoint for hybrid computing systems.
Proceedings of the 2010 International Conference on High Performance Computing, 2010

Statistical power modeling of GPU kernels using performance counters.
Proceedings of the International Green Computing Conference 2010, 2010

Distributed Diskless Checkpoint for Large Scale Systems.
Proceedings of the 10th IEEE/ACM International Conference on Cluster, 2010

Adaptive Resource Indexing Technique for Unstructured Peer-to-Peer Networks.
Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009

An efficient, model-based CPU-GPU heterogeneous FFT library.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Model-based fault localization in large-scale computing systems.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Access-pattern and bandwidth aware file replication algorithm in a grid environment.
Proceedings of the 9th IEEE/ACM International Conference on Grid Computing (Grid 2008), Tsukuba, Japan, September 29, 2008

Model-based resource selection for efficient virtual cluster deployment.
Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed Computing, 2007

Virtual Clusters on the Fly - Fast, Scalable, and Flexible Installation.
Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2007), 2007

Scalable systems software - Problem diagnosis in large-scale computing environments.
Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

Making Wide-Area, Multi-site MPI Feasible Using Xen VM.
Proceedings of the Frontiers of High Performance Computing and Networking, 2006
