2022
MT-3000: a heterogeneous multi-zone processor for HPC.
,
,
,
,
,
,
,
,
,
,
,
,
CCF Trans. High Perform. Comput., 2022
2021
Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directions.
CCF Trans. High Perform. Comput., 2021
2019
Pair-HMM accelerator based on non-cooperative structure.
IEICE Electron. Express, 2019
MT-DMA: A DMA Controller Supporting Efficient Matrix Transposition for Digital Signal Processing.
IEEE Access, 2019
An Efficient Direct Memory Access (DMA) Controller for Scientific Computing Accelerators.
Proceedings of the IEEE International Symposium on Circuits and Systems, 2019
Efficient Large-Scale 1D FFT Vectorization on Multi-Core Vector Accelerator.
Proceedings of the 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems, 2019
2018
A Variable-Size FFT Hardware Accelerator Based on Matrix Transposition.
IEEE Trans. Very Large Scale Integr. Syst., 2018
2017
Low Latency and Low Error Floating-Point Sine/Cosine Function Based TCORDIC Algorithm.
IEEE Trans. Circuits Syst. I Regul. Pap., 2017
Platform-Adaptive High-Throughput Surveillance Video Condensation on Heterogeneous Processor Clusters.
Proceedings of the Advanced Parallel Processing Technologies, 2017
2016
Classification of Hyperspectral Remote Sensing Image Using Hierarchical Local-Receptive-Field-Based Extreme Learning Machine.
IEEE Geosci. Remote. Sens. Lett., 2016
An efficient and effective convolutional auto-encoder extreme learning machine network for 3d feature learning.
Neurocomputing, 2016
PR-ELM: Parallel regularized extreme learning machine based on cluster.
Neurocomputing, 2016
Multi-bit transient fault control for NoC links using 2D fault coding method.
Proceedings of the Tenth IEEE/ACM International Symposium on Networks-on-Chip, 2016
Single/Double Precision Floating-Point Division and Square Root Unit Based on SRT-8 Algorithm.
Proceedings of the Computer Engineering and Technology - 20th CCF Conference, 2016
2015
A deeply-pipelined FPGA-based SpMV accelerator with a hardware-friendly storage scheme.
IEICE Electron. Express, 2015
An efficient multi-standard QC-LDPC decoder based on the row-layered decoding algorithm.
IEICE Electron. Express, 2015
Accelerating Molecular Dynamics Simulations on Heterogeneous Architecture.
Proceedings of the Computer Engineering and Technology - 19th CCF Conference, 2015
Designing Parallel Sparse Matrix Transposition Algorithm Using ELLPACK-R for GPUs.
Proceedings of the Computer Engineering and Technology - 19th CCF Conference, 2015
2014
FPGA Implementation of a Special-Purpose VLIW Structure for Double-Precision Elementary Function.
ACM Trans. Reconfigurable Technol. Syst., 2014
Transpose-free variable-size FFT accelerator based on-chip SRAM.
IEICE Electron. Express, 2014
CPU-GPU hybrid parallel strategy for cosmological simulations.
Concurr. Comput. Pract. Exp., 2014
2013
FPGA implementation of an exact dot product and its application in variable-precision floating-point arithmetic.
J. Supercomput., 2013
VLIW coprocessor for IEEE-754 quadruple-precision elementary functions.
ACM Trans. Archit. Code Optim., 2013
Window Memory Layout Scheme for Alternate Row-Wise/Column-Wise Matrix Access.
IEICE Trans. Inf. Syst., 2013
2012
Design and Implementation of the Parameterized Multi-Standard High-Throughput Radix-4 Viterbi Decoder on FPGA.
IEICE Trans. Commun., 2012
2011
FPGA-Specific Custom VLIW Architecture for Arbitrary Precision Floating-Point Arithmetic.
IEICE Trans. Inf. Syst., 2011
Special-purposed VLIW architecture for IEEE-754 quadruple precision elementary functions on FPGA.
Proceedings of the IEEE 29th International Conference on Computer Design, 2011
VPFPAP: A Special-Purpose VLIW Processor for Variable-Precision Floating-Point Arithmetic.
Proceedings of the International Conference on Field Programmable Logic and Applications, 2011
FPGA Implementation of Variable-Precision Floating-Point Arithmetic.
Proceedings of the Advanced Parallel Processing Technologies - 9th International Symposium, 2011
2010
A Unified Co-Processor Architecture for Matrix Decomposition.
J. Comput. Sci. Technol., 2010
FPGA accelerating double/quad-double high precision floating-point applications for ExaScale computing.
Proceedings of the 24th International Conference on Supercomputing, 2010
2009
FPGA accelerating three QR decomposition algorithms in the unified pipelined framework.
Proceedings of the 19th International Conference on Field Programmable Logic and Applications, 2009
A Fine-grained Pipelined Implementation of the LINPACK Benchmark on FPGAs.
Proceedings of the FCCM 2009, 2009
A Fine-Grained Pipelined Implementation for Large-Scale Matrix Inversion on FPGA.
Proceedings of the Advanced Parallel Processing Technologies, 8th International Symposium, 2009
2008
Dynamic Configurable Floating-Point FFT Pipelines and Hybrid-Mode CORDIC on FPGA.
Proceedings of the International Conference on Embedded Software and Systems, 2008
Double Precision Hybrid-Mode Floating-Point FPGA CORDIC Co-processor.
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications, 2008
Hybrid-Mode Floating-Point FPGA CORDIC Co-processor.
Proceedings of the Reconfigurable Computing: Architectures, 2008
Area and throughput trade-offs in design of arithmetic encoder for JPEG2000.
Proceedings of the IEEE Asia Pacific Conference on Circuits and Systems, 2008
2007
FPGA SAR Processor with Window Memory Accesses.
Proceedings of the IEEE International Conference on Application-Specific Systems, 2007