Adrián Castelló

Orcid: 0000-0002-8576-8451

Affiliations:
  • Universitat Jaume I de Castello, Spain


According to our database1, Adrián Castelló authored at least 53 papers between 2014 and 2025.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Experience-guided, mixed-precision matrix multiplication with apache TVM for ARM processors.
J. Supercomput., January, 2025

2024
Communication-Avoiding Fusion of GEMM-Based Convolutions for Deep Learning in the RISC-V GAP8 MCU.
IEEE Internet Things J., November, 2024

Automatic generation of ARM NEON micro-kernels for matrix multiplication.
J. Supercomput., July, 2024

Parallel GEMM-based convolution for deep learning on multicore RISC-V processors.
J. Supercomput., June, 2024

Algorithm 1039: Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM.
ACM Trans. Math. Softw., March, 2024

RED-SEA Project: Towards a new-generation European interconnect.
Microprocess. Microsystems, 2024

Parallel GEMM-based convolutions for deep learning on multicore ARM and RISC-V architectures.
J. Syst. Archit., 2024

Experiences with nested parallelism in task-parallel applications using malleable BLAS on multicore processors.
Int. J. High Perform. Comput. Appl., 2024

Inference with Transformer Encoders on ARM and RISC-V Multicore Processors.
Proceedings of the Euro-Par 2024: Parallel Processing, 2024

QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Tackling the Matrix Multiplication Micro-Kernel Generation with Exo.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2024

2023
Efficient and portable Winograd convolutions for multi-core processors.
J. Supercomput., July, 2023

Performance-energy trade-offs of deep learning convolution algorithms on ARM processors.
J. Supercomput., June, 2023

Micro-kernels for portable and efficient matrix multiplication in deep learning.
J. Supercomput., May, 2023

Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks.
Computing, May, 2023

Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUs.
Computing, May, 2023

Reformulating the direct convolution for high-performance deep learning inference on ARM processors.
J. Syst. Archit., February, 2023

Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM.
CoRR, 2023

Automatic Generation of Micro-kernels for Performance Portability of Matrix Multiplication on RISC-V Vector Processors.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

2022
A BLIS-like matrix multiplication for machine learning in the RISC-V ISA-based GAP8 processor.
J. Supercomput., 2022

BestOf: an online implementation selector for the training and inference of deep neural networks.
J. Supercomput., 2022

High performance and energy efficient inference for deep learning on multicore ARM processors using general optimization techniques and BLIS.
J. Syst. Archit., 2022

Performance Analysis of Matrix Multiplication for Deep Learning on the Edge.
Proceedings of the High Performance Computing. ISC High Performance 2022 International Workshops - Hamburg, Germany, May 29, 2022

QR Factorization Using Malleable BLAS on Multicore Processors.
Proceedings of the High Performance Computing. ISC High Performance 2022 International Workshops - Hamburg, Germany, May 29, 2022

Towards Portable Realizations of Winograd-based Convolution with Vector Intrinsics and OpenMP.
Proceedings of the 30th Euromicro International Conference on Parallel, 2022

Anatomy of the BLIS Family of Algorithms for Matrix Multiplication.
Proceedings of the 30th Euromicro International Conference on Parallel, 2022


2021
PyDTNN: A user-friendly and extensible framework for distributed deep learning.
J. Supercomput., 2021

Acoustic Echo Cancellation using Residual U-Nets.
CoRR, 2021

High performance and energy efficient inference for deep learning on ARM processors.
CoRR, 2021

Accelerating distributed deep neural network training with pipelined MPI allreduce.
Clust. Comput., 2021

Evaluation of MPI Allreduce for Distributed Training of Convolutional Neural Networks.
Proceedings of the 29th Euromicro International Conference on Parallel, 2021

Performance Modeling for Distributed Training of Convolutional Neural Networks.
Proceedings of the 29th Euromicro International Conference on Parallel, 2021

A Flexible Research-Oriented Framework for Distributed Training of Deep Neural Networks.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops, 2021

2020
Analysis of Threading Libraries for High Performance Computing.
IEEE Trans. Computers, 2020

High Performance and Portable Convolution Operators for ARM-based Multicore Processors.
CoRR, 2020

Programming parallel dense matrix factorizations with look-ahead and OpenMP.
Clust. Comput., 2020

High Performance and Portable Convolution Operators for Multicore Processors.
Proceedings of the 32nd IEEE International Symposium on Computer Architecture and High Performance Computing, 2020

2019
Analysis of model parallelism for distributed neural networks.
Proceedings of the 26th European MPI Users' Group Meeting, 2019

Theoretical Scalability Analysis of Distributed Deep Convolutional Neural Networks.
Proceedings of the 19th IEEE/ACM International Symposium on Cluster, 2019

2018
Unification of Lightweight Thread Solutions and their Application in High Performance Programming.
PhD thesis, 2018

Argobots: A Lightweight Low-Level Threading and Tasking Framework.
IEEE Trans. Parallel Distributed Syst., 2018

Exploring the interoperability of remote GPGPU virtualization using rCUDA and directive-based programming models.
J. Supercomput., 2018

On the adequacy of lightweight thread approaches for high-level parallel programming models.
Future Gener. Comput. Syst., 2018

2017
GLTO: On the Adequacy of Lightweight Thread Approaches for OpenMP Implementations.
Proceedings of the 46th International Conference on Parallel Processing, 2017

GLT: A Unified API for Lightweight Thread Libraries.
Proceedings of the Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28, 2017

2016
A Review of Lightweight Thread Approaches for High Performance Computing.
Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016

Enabling GPU Virtualization in Cloud Environments.
Proceedings of the CLOSER 2016, 2016

2015
Improving the user experience of the rCUDA remote GPU virtualization framework.
Concurr. Comput. Pract. Exp., 2015

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rCUDA Virtualization.
Proceedings of the 2015 IEEE TrustCom/BigDataSE/ISPA, 2015

Exploring the Suitability of Remote GPGPU Virtualization for the OpenACC Programming Model Using rCUDA.
Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

2014
SLURM Support for Remote GPU Virtualization: Implementation and Performance Study.
Proceedings of the 26th IEEE International Symposium on Computer Architecture and High Performance Computing, 2014

Boosting the performance of remote GPU virtualization using InfiniBand connect-IB and PCIe 3.0.
Proceedings of the 2014 IEEE International Conference on Cluster Computing, 2014


  Loading...