Francisco D. Igual

Orcid: 0000-0003-4480-9517

According to our database1, Francisco D. Igual authored at least 103 papers between 2008 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.



In proceedings 
PhD thesis 


Online presence:



Balanced segmentation of CNNs for multi-TPU inference.
J. Supercomput., January, 2025

Experience-guided, mixed-precision matrix multiplication with apache TVM for ARM processors.
J. Supercomput., January, 2025

Automatic generation of ARM NEON micro-kernels for matrix multiplication.
J. Supercomput., July, 2024

Algorithm 1039: Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM.
ACM Trans. Math. Softw., March, 2024

Experiences with nested parallelism in task-parallel applications using malleable BLAS on multicore processors.
Int. J. High Perform. Comput. Appl., 2024

Acceleration and energy consumption optimization in cascading classifiers for face detection on low-cost ARM big.LITTLE asymmetric architectures.
CoRR, 2024

Performance Analysis of BERT on RISC-V Processors with SIMD Units.
Proceedings of the High Performance Computing. ISC High Performance 2024 International Workshops, 2024

Inference with Transformer Encoders on ARM and RISC-V Multicore Processors.
Proceedings of the Euro-Par 2024: Parallel Processing, 2024

Micro-kernels for portable and efficient matrix multiplication in deep learning.
J. Supercomput., May, 2023

Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures.
J. Parallel Distributed Comput., May, 2023

Dynamic power budget redistribution under a power cap on multi-application environments.
Sustain. Comput. Informatics Syst., April, 2023

Algorithm 1033: Parallel Implementations for Computing the Minimum Distance of a Random Linear Code on Distributed-memory Architectures.
ACM Trans. Math. Softw., March, 2023

Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM.
CoRR, 2023

Co-Design of the Dense Linear AlgebravSoftware Stack for Multicore Processors.
CoRR, 2023

Fine-grain task-parallel algorithms for matrix factorizations and inversion on many-threaded CPUs.
Concurr. Comput. Pract. Exp., 2023

Automatic Generation of Micro-kernels for Performance Portability of Matrix Multiplication on RISC-V Vector Processors.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

Improving inference time in multi-TPU systems with profiled model segmentation.
Proceedings of the 31st Euromicro International Conference on Parallel, 2023

Algorithm 1022: Efficient Algorithms for Computing a Rank-Revealing UTV Factorization on Parallel Computing Architectures.
ACM Trans. Math. Softw., 2022

QR Factorization Using Malleable BLAS on Multicore Processors.
Proceedings of the High Performance Computing. ISC High Performance 2022 International Workshops - Hamburg, Germany, May 29, 2022

NUMA-Aware Dense Matrix Factorizations and Inversion with Look-Ahead on Multicore Processors.
Proceedings of the 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2022

Anatomy of the BLIS Family of Algorithms for Matrix Multiplication.
Proceedings of the 30th Euromicro International Conference on Parallel, 2022

Applying Game-Learning Environments to Power Capping Scenarios via Reinforcement Learning.
Proceedings of the Cloud Computing, Big Data & Emerging Topics - 10th Conference, 2022

Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors.
J. Supercomput., 2021

Efficient algorithms for computing a rank-revealing UTV factorization on parallel computing architectures.
CoRR, 2021

A New Generation of Task-Parallel Algorithms for Matrix Inversion in Many-Threaded CPUs.
Proceedings of the PMAM@PPoPP 2021: Proceedings of the Twelfth International Workshop on Programming Models and Applications for Multicores and Manycores, 2021

Scalable Hybrid Loop- and Task-Parallel Matrix Inversion for Multicore Processors.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops, 2021

Resource Management for Power-Constrained HEVC Transcoding Using Reinforcement Learning.
IEEE Trans. Parallel Distributed Syst., 2020

Integration and exploitation of intra-routine malleability in BLIS.
J. Supercomput., 2020

STEEL-RT: combining single task-single executor model and expanded scheduling to ease heterogeneity exploitation.
J. Supercomput., 2020

Leveraging knowledge-as-a-service (KaaS) for QoS-aware resource management in multi-user video transcoding.
J. Supercomput., 2020

Programming parallel dense matrix factorizations with look-ahead and OpenMP.
Clust. Comput., 2020

Towards a Malleable Tensorflow Implementation.
Proceedings of the Cloud Computing, Big Data & Emerging Topics - 8th Conference, 2020

Algorithm 994: Fast Implementations of the Brouwer-Zimmermann Algorithm for the Computation of the Minimum Distance of a Random Linear Code.
ACM Trans. Math. Softw., 2019

Variable intra-task threading for power-constrained performance and energy optimization in DAG scheduling.
J. Supercomput., 2019

Accelerating the SRP-PHAT algorithm on multi- and many-core platforms using OpenCL.
J. Supercomput., 2019

Portability Study of an OpenCL Algorithm for Automatic Target Detection in Hyperspectral Images.
IEEE Trans. Geosci. Remote. Sens., 2019

Practical Considerations for Acoustic Source Localization in the IoT Era: Platforms, Energy Efficiency, and Performance.
IEEE Internet Things J., 2019

Parallel Implementations for Computing the Minimum Distance of a Random Linear Code on Multicomputers.
CoRR, 2019

Detecting Time-Fragmented Cache Attacks Against AES Using Performance Monitoring Counters.
Proceedings of the 7th Conference on Cloud Computing & Big Data, 2019

MAMUT: Multi-Agent Reinforcement Learning for Efficient Real-Time Multi-User Video Transcoding.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2019

Optimized Fundamental Signal Processing Operations For Energy Minimization on Heterogeneous Mobile Devices.
IEEE Trans. Circuits Syst. I Regul. Pap., 2018

Multi-threaded dense linear algebra libraries for low-power asymmetric multicore processors.
J. Comput. Sci., 2018

Acceleration and energy consumption optimization in cascading classifiers for face detection on low-cost ARM big. LITTLE asymmetric architectures.
Int. J. Circuit Theory Appl., 2018

Time and energy modeling of a high-performance multi-threaded Cholesky factorization.
J. Supercomput., 2017

Solving Weighted Least Squares (WLS) problems on ARM-based architectures.
J. Supercomput., 2017

Revisiting conventional task schedulers to exploit asymmetry in multi-core architectures for dense linear algebra operations.
Parallel Comput., 2017

Performance-Power Evaluation of an OpenCL Implementation of the Simplex Growing Algorithm for Hyperspectral Unmixing.
IEEE Geosci. Remote. Sens. Lett., 2017

Energy Efficiency Optimization of Task-Parallel Codes on Asymmetric Architectures.
Proceedings of the 2017 International Conference on High Performance Computing & Simulation, 2017

Performance and Scalability Study of FMM Kernels on Novel Multi- and Many-core Architectures.
Proceedings of the International Conference on Computational Science, 2017

On the Use of a GPU-Accelerated Mobile Device Processor for Sound Source Localization.
Proceedings of the International Conference on Computational Science, 2017

The BLIS Framework: Experiments in Portability.
ACM Trans. Math. Softw., 2016

Analytical Modeling Is Enough for High-Performance BLIS.
ACM Trans. Math. Softw., 2016

Fast Algorithms for the Computation of the Minimum Distance of a Random Linear Code.
CoRR, 2016

Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors.
Clust. Comput., 2016

Refactoring Conventional Task Schedulers to Exploit Asymmetric ARM big.LITTLE Architectures in Dense Linear Algebra.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

HeSP: A Simulation Framework for Solving the Task Scheduling-Partitioning Problem on Heterogeneous Architectures.
Proceedings of the Euro-Par 2016: Parallel Processing, 2016

Time and energy modeling of high-performance Level-3 BLAS on x86 architectures.
Simul. Model. Pract. Theory, 2015

Speeding up the log-polar transform with inexpensive parallel hardware: graphics units and multi-core architectures.
J. Real Time Image Process., 2015

Accelerating fluid-solid simulations (Lattice-Boltzmann & Immersed-Boundary) on heterogeneous architectures.
J. Comput. Sci., 2015

A power measurement environment for PCIe accelerators.
Comput. Sci. Res. Dev., 2015

Revisiting Conventional Task Schedulers to Exploit Asymmetry in ARM big.LITTLE Architectures for Dense Linear Algebra.
CoRR, 2015

Performance and Energy Optimization of Matrix Multiplication on Asymmetric big.LITTLE Processors.
CoRR, 2015

Multi-Threaded Dense Linear Algebra Libraries for Low-Power Asymmetric Multicore Processors.
CoRR, 2015

Non-negative Matrix Factorization on Low-Power Architectures and Accelerators: A Comparative Study.
Comput. Electr. Eng., 2015

Balancing task- and data-level parallelism to improve performance and energy consumption of matrix computations on the Intel Xeon Phi.
Comput. Electr. Eng., 2015

Vectorization of binaural sound virtualization on the ARM Cortex-A15 architecture.
Proceedings of the 23rd European Signal Processing Conference, 2015

Hyperspectral Unmixing on Multicore DSPs: Trading Off Performance for Energy.
IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., 2014

Enhancing performance and energy consumption of runtime schedulers for dense linear algebra.
Concurr. Comput. Pract. Exp., 2014

Author's retrospective for biomedical image analysis on a cooperative cluster of gpus and multicores.
Proceedings of the ACM International Conference on Supercomputing 25th Anniversary Volume, 2014

Parallel performance and energy efficiency of modern video encoders on multithreaded architectures.
Proceedings of the 22nd European Signal Processing Conference, 2014

Matrix computations on graphics processors and clusters of gpus
PhD thesis, 2013

Robust motion estimation on a low-power multi-core DSP.
EURASIP J. Adv. Signal Process., 2013

Scheduling algorithms-by-blocks on small clusters.
Concurr. Comput. Pract. Exp., 2013

Non-negative matrix factorization on low-power architectures: a comparative study.
Proceedings of the 20th European MPI Users's Group Meeting, 2013

Runtime Scheduling of the LU Factorization: Performance and Energy.
Proceedings of the Energy Efficiency in Large Scale Distributed Systems, 2013

A Runtime System for Programming Out-of-Core Matrix Algorithms-by-Tiles on Multithreaded Architectures.
ACM Trans. Math. Softw., 2012

The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations.
J. Parallel Distributed Comput., 2012

Color and texture analysis on emerging parallel architectures.
Int. J. High Perform. Comput. Appl., 2012

Optimization of power consumption in the iterative solution of sparse linear systems on graphics processors.
Comput. Sci. Res. Dev., 2012

DVFS-control techniques for dense linear algebra operations on multi-core processors.
Comput. Sci. Res. Dev., 2012

Solving dense generalized eigenproblems on multi-threaded architectures.
Appl. Math. Comput., 2012

Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

Level-3 BLAS on the TI C6678 Multi-core DSP.
Proceedings of the IEEE 24th International Symposium on Computer Architecture and High Performance Computing, 2012

Saving Energy in the LU Factorization with Partial Pivoting on Multi-core Processors.
Proceedings of the 20th Euromicro International Conference on Parallel, 2012

Reducing Energy Consumption of Dense Linear Algebra Operations on Hybrid CPU-GPU Platforms.
Proceedings of the 10th IEEE International Symposium on Parallel and Distributed Processing with Applications, 2012

Color and texture analysis using emerging parallel architectures.
Int. J. High Perform. Comput. Appl., 2011

Condensed forms for the symmetric eigenvalue problem on multi-threaded architectures.
Concurr. Comput. Pract. Exp., 2011

Power-aware Dense Linear Algebra Implementations on Multi-core and Many-core Processors.
Proceedings of the 3rd Many-core Applications Research Community (MARC) Symposium. Proceedings of the 3rd MARC Symposium, 2011

Extending OpenMP to Survive the Heterogeneous Multi-Core Era.
Int. J. Parallel Program., 2010

Retargeting PLAPACK to clusters with hardware accelerators.
Proceedings of the 2010 International Conference on High Performance Computing & Simulation, 2010

Out-of-core solution of linear systems on graphics processors.
Int. J. Parallel Emergent Distributed Syst., 2009

Exploiting the capabilities of modern GPUs for dense matrix computations.
Concurr. Comput. Pract. Exp., 2009

Solving dense linear systems on platforms with multiple hardware accelerators.
Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009

Reduction to Condensed Forms for Symmetric Eigenvalue Problems on Multi-core Architectures.
Proceedings of the Parallel Processing and Applied Mathematics, 2009

Exploring the GPU for Enhancing Parallelism on Color and Texture Analysis.
Proceedings of the Parallel Computing: From Multicores and GPU's to Petascale, 2009

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures.
Proceedings of the Evolving OpenMP in an Age of Extreme Parallelism, 2009

Fast development of dense linear algebra codes on graphics processors.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

An Efficient Implementation of GPU Virtualization in High Performance Clusters.
Proceedings of the Euro-Par 2009, 2009

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs.
Proceedings of the Euro-Par 2009 Parallel Processing, 2009

Attaining High Performance in General-Purpose Computations on Current Graphics Processors.
Proceedings of the High Performance Computing for Computational Science, 2008

Evaluation and tuning of the Level 3 CUBLAS for graphics processors.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Biomedical image analysis on a cooperative cluster of GPUs and multicores.
Proceedings of the 22nd Annual International Conference on Supercomputing, 2008

Solving Dense Linear Systems on Graphics Processors.
Proceedings of the Euro-Par 2008, 2008
