Leonel Sousa

Orcid: 0000-0002-8066-221X

  • University of Lisbon, Instituto Superior Tecnico, INESC-ID, Portugal

According to our database1, Leonel Sousa authored at least 332 papers between 1997 and 2025.

Collaborative distances:



In proceedings 
PhD thesis 


Online presence:

On csauthors.net:


LTE: Lightweight and Timing-Efficient Unequal-Sized Polynomial Multiplication Accelerators.
IEEE Trans. Circuits Syst. II Express Briefs, January, 2025

SpEpistasis: A sparse approach for three-way epistasis detection.
J. Parallel Distributed Comput., 2025

Transient-Execution Attacks: A Computer Architect Perspective.
ACM Comput. Surv., March, 2024

Deadline-aware task offloading in vehicular networks using deep reinforcement learning.
Expert Syst. Appl., 2024

Hardware for converting floating-point to the microscaling (MX) format.
CoRR, 2024

Energy-aware QoS-based dynamic virtual machine consolidation approach based on RL and ANN.
Clust. Comput., 2024

Sparse Matrix-Vector Multiplication Based on Online Arithmetic.
IEEE Access, 2024

A Comprehensive Approach and Analysis of Reverse Converters for a Class of Moduli Sets.
Proceedings of the 15th IEEE Latin America Symposium on Circuits and Systems, 2024

Most Significant Digit First Multiply-and-Accumulate Unit for Neural Networks.
Proceedings of the IEEE Computer Society Annual Symposium on VLSI, 2024

IPU-EpiDet: Identifying Gene Interactions on Massively Parallel Graph-Based AI Accelerators.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

CARM Tool: Cache-Aware Roofline Model Automatic Benchmarking and Application Analysis.
Proceedings of the IEEE International Symposium on Workload Characterization, 2024

COPMA: Compact and Optimized Polynomial Multiplier Accelerator for High-Performance Implementation of LWR-Based PQC.
IEEE Trans. Very Large Scale Integr. Syst., April, 2023

QR TPM in Programmable Low-Power Devices.
CoRR, 2023

A Course chapter on Quantum Computing for Master Students in Engineering.
CoRR, 2023

Special issue: 20th international workshop on algorithms, models and tools for parallel computing on heterogeneous platforms (HeteroPar'22).
Concurr. Comput. Pract. Exp., 2023

CoDi$: Randomized Caches Through Confusion and Diffusion.
IEEE Access, 2023

Performance Modelling-Driven Optimization of RISC-V Hardware for Efficient SpMV.
Proceedings of the High Performance Computing, 2023

Social and Environmental Effects of Post-COVID-19 Computer Science Virtual Conferencing: The Euro-Par Case.
Proceedings of the International Conference on ICT for Sustainability, 2023

Proceedings of the 33rd International Conference on Field-Programmable Logic and Applications, 2023

A Performance Modelling-Driven Approach to Hardware Resource Scaling.
Proceedings of the Euro-Par 2023: Parallel Processing Workshops - Euro-Par 2023 International Workshops, Limassol, Cyprus, August 28, 2023

Sparse-Aware CARM: Rooflining Locality of Sparse Computations.
Proceedings of the Euro-Par 2023: Parallel Processing Workshops - Euro-Par 2023 International Workshops, Limassol, Cyprus, August 28, 2023

Interpreting High Order Epistasis Using Sparse Transformers.
Proceedings of the IEEE/ACM Conference on Connected Health: Applications, 2023

Supporting RISC-V Performance Counters Through Linux Performance Analysis Tools.
Proceedings of the 34th IEEE International Conference on Application-specific Systems, 2023

Scalable architecture of constant division on FPGA.
Proceedings of the 30th IEEE Symposium on Computer Arithmetic, 2023

Guest Editorial: Special Issue on Advances in Signal Processing Systems.
J. Signal Process. Syst., 2022

Introduction to the Special Section on FPL 2020.
ACM Trans. Reconfigurable Technol. Syst., 2022

A genetic-based approach for service placement in fog computing.
J. Supercomput., 2022

Guest Editorial: Special Section on Emerging and Impacting Trends on Computer Arithmetic.
IEEE Trans. Emerg. Top. Comput., 2022

Inter-Algorithm Multiobjective Cooperation for Phylogenetic Reconstruction on Amino Acid Data.
IEEE Trans. Cybern., 2022

NTT Architecture for a Linux-Ready RISC-V Fully-Homomorphic Encryption Accelerator.
IEEE Trans. Circuits Syst. I Regul. Pap., 2022

Modeling and evaluation of dispatching policies in IaaS cloud data centers using SANs.
Sustain. Comput. Informatics Syst., 2022

Uncertainty Estimation via Monte Carlo Dropout in CNN-Based mmWave MIMO Localization.
IEEE Signal Process. Lett., 2022

Exploiting multi-level parallel metaheuristics and heterogeneous computing to boost phylogenetics.
Future Gener. Comput. Syst., 2022

Unlocking Personalized Healthcare on Modern CPUs/GPUs: Three-way Gene Interaction Study.
Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

Tensor-Accelerated Fourth-Order Epistasis Detection on GPUs.
Proceedings of the 51st International Conference on Parallel Processing, 2022

Low-Precision Floating-Point Formats: From General-Purpose to Application-Specific.
Proceedings of the Approximate Computing, 2022

Modeling Epidemic Routing: Capturing Frequently Visited Locations While Preserving Scalability.
IEEE Trans. Veh. Technol., 2021

Retargeting Tensor Accelerators for Epistasis Detection.
IEEE Trans. Parallel Distributed Syst., 2021

Mansard Roofline Model: Reinforcing the Accuracy of the Roofs.
ACM Trans. Model. Perform. Evaluation Comput. Syst., 2021

ROTed: Random Oblivious Transfer for embedded devices.
IACR Trans. Cryptogr. Hardw. Embed. Syst., 2021

Editorial on the Special Section on Algorithms, Circuits, and Systems for Signal Processing at the Edge.
IEEE Open J. Circuits Syst., 2021

Supporting RISC-V Performance Counters through Performance analysis tools for Linux (Perf).
CoRR, 2021

Variable Latency Carry Speculative Adders with Input-based Dynamic Configuration.
Comput. Electr. Eng., 2021

Fourth-Order Exhaustive Epistasis Detection for the xPU Era.
Proceedings of the ICPP 2021: 50th International Conference on Parallel Processing, Lemont, IL, USA, August 9, 2021

Number Theoretic Transform Architecture suitable to Lattice-based Fully-Homomorphic Encryption.
Proceedings of the 32nd IEEE International Conference on Application-specific Systems, 2021

Modeling and Evaluation of Service Composition in Commercial Multiclouds Using Timed Colored Petri Nets.
IEEE Trans. Syst. Man Cybern. Syst., 2020

GPU acceleration of Fitch's parsimony on protein data: from Kepler to Turing.
J. Supercomput., 2020

Improving the Efficiency of SVM Classification With FHE.
IEEE Trans. Inf. Forensics Secur., 2020

Towards the Integration of Reverse Converters into the RNS Channels.
IEEE Trans. Computers, 2020

Deep Learning Architectures for Accurate Millimeter Wave Positioning in 5G.
Neural Process. Lett., 2020

Parallelism exploration for 3D high-efficiency video coding depth modeling mode one.
J. Real Time Image Process., 2020

Parallel evolutionary computation for multiobjective gene interaction analysis.
J. Comput. Sci., 2020

Temperature-aware core management in MPSoCs: modelling and evaluation using MRMs.
IET Comput. Digit. Tech., 2020

Application-driven Cache-Aware Roofline Model.
Future Gener. Comput. Syst., 2020

Dethroning GPS: Low-Power Accurate 5G Positioning Systems Using Machine Learning.
IEEE J. Emerg. Sel. Topics Circuits Syst., 2020

Can 5G and Machine Learning Replace the Global Positioning System?
ERCIM News, 2020

Multicore Parallelism Exploration Targeting 3D-HEVC Intra-Frame Prediction.
IEEE Des. Test, 2020

Performance Modeling of Epidemic Routing in Mobile Social Networks with Emphasis on Scalability.
CoRR, 2020

A hybrid algorithm for task scheduling on heterogeneous multiprocessor embedded systems.
Appl. Soft Comput., 2020

The Role of Non-Positional Arithmetic on Efficient Emerging Cryptographic Algorithms.
IEEE Access, 2020

Raising the Abstraction Level of a Deep Learning Design on FPGAs.
IEEE Access, 2020

Accelerating 3-Way Epistasis Detection with CPU+GPU Processing.
Proceedings of the Job Scheduling Strategies for Parallel Processing, 2020

Exploring the Binary Precision Capabilities of Tensor Cores for Epistasis Detection.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

Software Emulation of Quantum Resistant Trusted Platform Modules.
Proceedings of the 17th International Joint Conference on e-Business and Telecommunications, 2020

Heterogeneous CPU+iGPU Processing for Efficient Epistasis Detection.
Proceedings of the Euro-Par 2020: Parallel Processing, 2020

An asymptotically faster version of FV supported on HPR.
Proceedings of the 27th IEEE Symposium on Computer Arithmetic, 2020

Sign Identifier for the Enhanced Three Moduli Set {2 n + k , 2 n - 1, 2 n+ 1 - 1}.
J. Signal Process. Syst., 2019

Efficient Modular Adder Designs Based on Thermometer and One-Hot Coding.
IEEE Trans. Very Large Scale Integr. Syst., 2019

Modeling Non-Uniform Memory Access on Large Compute Nodes with the Cache-Aware Roofline Model.
IEEE Trans. Parallel Distributed Syst., 2019

Comparative assessment of GPGPU technologies to accelerate objective functions: A case study on parsimony.
J. Parallel Distributed Comput., 2019

A multiobjective adaptive approach for the inference of evolutionary relationships in protein-based scenarios.
Inf. Sci., 2019

New energy-efficient hybrid wide-operand adder architecture.
IET Circuits Devices Syst., 2019

A Lattice-based Enhanced Privacy ID.
IACR Cryptol. ePrint Arch., 2019

Note on the noise growth of the RNS variants of the BFV scheme.
IACR Cryptol. ePrint Arch., 2019

An HPR variant of the FV scheme: Computationally Cheaper, Asymptotically Faster.
IACR Cryptol. ePrint Arch., 2019

A methodical FHE-based cloud computing model.
Future Gener. Comput. Syst., 2019

More efficient, provably-secure direct anonymous attestation from lattices.
Future Gener. Comput. Syst., 2019

On the Design of RNS Inter-Modulo Processing Units for the Arithmetic-Friendly Moduli Sets {2n+k, 2n - 1, 2n+1 - 1}.
Comput. J., 2019

Scalable Performance Analysis of Epidemic Routing Considering Skewed Location Visiting Preferences.
Proceedings of the 27th IEEE International Symposium on Modeling, 2019

Enhancing Beamformed Fingerprint Outdoor Positioning with Hierarchical Convolutional Neural Networks.
Proceedings of the IEEE International Conference on Acoustics, 2019

Analysis of MOEA/D Approaches for Inferring Ancestral Relationships.
Proceedings of the Hybrid Artificial Intelligent Systems - 14th International Conference, 2019

HyPoRes: An Hybrid Representation System for ECC.
Proceedings of the 26th IEEE Symposium on Computer Arithmetic, 2019

Proceedings of the Ultrascale Computing Systems, 2019

Multiobjective Frog-Leaping Optimization for the Study of Ancestral Relationships in Protein Data.
IEEE Trans. Evol. Comput., 2018

Highly parallel HEVC decoding for heterogeneous systems with CPU and GPU.
Signal Process. Image Commun., 2018

Temperature-aware dynamic voltage and frequency scaling enabled MPSoC modeling using Stochastic Activity Networks.
Microprocess. Microsystems, 2018

Guest Editors' Introduction.
Int. J. Semantic Comput., 2018

MrBayes sMC<sup>3</sup>.
Int. J. High Perform. Comput. Appl., 2018

A Survey on Fully Homomorphic Encryption: An Engineering Perspective.
ACM Comput. Surv., 2018

Performability-Based Workflow Scheduling in Grids.
Comput. J., 2018

Beamformed Fingerprint Learning for Accurate Millimeter Wave Positioning.
Proceedings of the 88th IEEE Vehicular Technology Conference, 2018

Cache-Aware Roofline Model and Medical Image Processing Optimizations in GPUs.
Proceedings of the High Performance Computing, 2018

3D-HEVC DMM-1 Parallelism Exploration Targeting Multicore Systems.
Proceedings of the 31st Symposium on Integrated Circuits and Systems Design, 2018

Analysis of Scheduling Policies in Metaheuristics for Evolutionary Biology.
Proceedings of the 6th International Workshop on Parallelism in Bioinformatics, 2018

Towards Efficient Modular Adders based on Reversible Circuits.
Proceedings of the IEEE International Symposium on Circuits and Systems, 2018

Configurable N-fold Hardware Architecture for Convolutional Neural Networks.
Proceedings of the 2018 International Conference on Biomedical Engineering and Applications, 2018

Data-Aided Fast Beamforming Selection for 5G.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Phylogenetic Reconstructions Using an Indicator-Based Bat Algorithm for Multicore Processors.
Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, 2018

Accelerating CNN computation: quantisation tuning and network resizing.
Proceedings of the 2nd Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems, 2018

An Efficient Component for Designing Signed Reverse Converters for a Class of RNS Moduli Sets of Composite Form {2<sup>k</sup>, 2<sup>P</sup>-1}.
IEEE Trans. Very Large Scale Integr. Syst., 2017

GHEVC: An Efficient HEVC Decoder for Graphics Processing Units.
IEEE Trans. Multim., 2017

A Reduced-Bias Approach With a Lightweight Hard-Multiple Generator to Design a Radix-8 Modulo 2<sup>n</sup> + 1 Multiplier.
IEEE Trans. Circuits Syst. II Express Briefs, 2017

Arithmetical Improvement of the Round-Off for Cryptosystems in High-Dimensional Lattices.
IEEE Trans. Computers, 2017

Beyond the Roofline: Cache-Aware Power and Energy-Efficiency Modeling for Multi-Cores.
IEEE Trans. Computers, 2017

Special issue on real-time energy-aware circuits and systems for HEVC and for its 3D and SVC extensions.
J. Real Time Image Process., 2017

Performance and power modeling and evaluation of virtualized servers in IaaS clouds.
Inf. Sci., 2017

GPU Parallelization of HEVC In-Loop Filters.
Int. J. Parallel Program., 2017

Efficient reductions in cyclotomic rings - Application to R-LWE based FHE schemes.
IACR Cryptol. ePrint Arch., 2017

Cache-aware Roofline Model in Intel® Advisor.
ERCIM News, 2017

Sign Detection and Number Comparison on RNS 3-Moduli Sets \(\{2^n-1, 2^{n+x}, 2^n+1\}\).
Circuits Syst. Signal Process., 2017

Accelerating the phylogenetic parsimony function on heterogeneous systems.
Concurr. Comput. Pract. Exp., 2017

Energy-aware mechanism for stencil-based MPDATA algorithm with constraints.
Concurr. Comput. Pract. Exp., 2017

A Multifunctional Unit for Designing Efficient RNS-Based Datapaths.
IEEE Access, 2017

Design Space Exploration of LDPC Decoders Using High-Level Synthesis.
IEEE Access, 2017

A stochastic number representation for fully homomorphic cryptography.
Proceedings of the 2017 IEEE International Workshop on Signal Processing Systems, 2017

Modeling Large Compute Nodes with Heterogeneous Memories with Cache-Aware Roofline Model.
Proceedings of the High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, 2017

Pipelined FPGA coprocessor for elliptic curve cryptography based on residue number system.
Proceedings of the 2017 International Conference on Embedded Computer Systems: Architectures, 2017

Efficient Reductions in Cyclotomic Rings - Application to Ring-LWE Based FHE Schemes.
Proceedings of the Selected Areas in Cryptography - SAC 2017, 2017

Energy-efficient motion estimation with approximate arithmetic.
Proceedings of the 19th IEEE International Workshop on Multimedia Signal Processing, 2017

Exploring GPU performance, power and energy-efficiency bounds with Cache-aware Roofline Modeling.
Proceedings of the 2017 IEEE International Symposium on Performance Analysis of Systems and Software, 2017

Analyzing Performance of Multi-cores and Applications with Cache-aware Roofline Model.
Proceedings of the 2017 International Conference on High Performance Computing & Simulation, 2017

Performance Analysis with Cache-Aware Roofline Model in Intel Advisor.
Proceedings of the 2017 International Conference on High Performance Computing & Simulation, 2017

On Boosting Energy-Efficiency of Heterogeneous Embedded Systems via Game Theory.
Proceedings of the 8th Workshop and 6th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms, 2017

TrustZone-backed bitcoin wallet.
Proceedings of the Fourth Workshop on Cryptography and Security in Computing Systems, 2017

Adaptive Scheduling Framework for Real-Time Video Encoding on Heterogeneous Systems.
IEEE Trans. Circuits Syst. Video Technol., 2016

A Framework for Application-Guided Task Management on Heterogeneous Embedded Systems.
ACM Trans. Archit. Code Optim., 2016

GPU-assisted HEVC intra decoder.
J. Real Time Image Process., 2016

Exploiting task and data parallelism for advanced video coding on hybrid CPU + GPU platforms.
J. Real Time Image Process., 2016

Method for designing two levels RNS reverse converters for large dynamic ranges.
Integr., 2016

Guest Editors' Introduction.
Int. J. Semantic Comput., 2016

Ubiquitous Multimedia: Emerging Research on Multimedia Computing.
IEEE Multim., 2016

A Survey on Programmable LDPC Decoders.
IEEE Access, 2016

HPC on the Intel Xeon Phi: Homomorphic Word Searching.
Proceedings of the High Performance Computing for Computational Science - VECPAR 2016, 2016

Efficient HEVC decoder for heterogeneous CPU with GPU systems.
Proceedings of the 18th IEEE International Workshop on Multimedia Signal Processing, 2016

Area-delay-power-aware adder placement method for RNS reverse converter design.
Proceedings of the IEEE 7th Latin American Symposium on Circuits & Systems, 2016

Enhancing Data Parallelism of Fully Homomorphic Encryption.
Proceedings of the Information Security and Cryptology - ICISC 2016 - 19th International Conference, Seoul, South Korea, November 30, 2016

High-Level Designs of Complex FIR Filters on FPGAs for the SKA.
Proceedings of the 18th IEEE International Conference on High Performance Computing and Communications; 14th IEEE International Conference on Smart City; 2nd IEEE International Conference on Data Science and Systems, 2016

Reverse Converter Design via Parallel-Prefix Adders: Novel Components, Methodology, and Implementations.
IEEE Trans. Very Large Scale Integr. Syst., 2015

Arithmetic-Based Binary-to-RNS Converter Modulo {2<sup>n</sup>±k} for jn-bit Dynamic Range.
IEEE Trans. Very Large Scale Integr. Syst., 2015

Base Transformation With Injective Residue Mapping for Dynamic Range Reduction in RNS.
IEEE Trans. Circuits Syst. I Regul. Pap., 2015

2<sup>n</sup> RNS Scalers for Extended 4-Moduli Sets.
IEEE Trans. Computers, 2015

Real-time implementation of remotely sensed hyperspectral image unmixing on GPUs.
J. Real Time Image Process., 2015

Attaining performance fairness in big.LITTLE systems.
Proceedings of the 12th International Workshop on Intelligent Solutions in Embedded Systems, 2015

Accelerating Phylogenetic Inference on Heterogeneous OpenCL Platforms.
Proceedings of the 2015 IEEE TrustCom/BigDataSE/ISPA, 2015

HEVC in-loop filters GPU parallelization in embedded systems.
Proceedings of the 2015 International Conference on Embedded Computer Systems: Architectures, 2015

Run-Time Machine Learning for HEVC/H.265 Fast Partitioning Decision.
Proceedings of the 2015 IEEE International Symposium on Multimedia, 2015

Featuring Immediate Revocation in Mikey-Sakke (FIRM).
Proceedings of the 2015 IEEE International Symposium on Multimedia, 2015

RNS reverse converters based on the new Chinese Remainder Theorem I.
Proceedings of the 2015 IEEE International Symposium on Circuits and Systems, 2015

High performance IP core for HEVC quantization.
Proceedings of the 2015 IEEE International Symposium on Circuits and Systems, 2015

Towards GPU HEVC intra decoding: Seizing fine-grain parallelism.
Proceedings of the 2015 IEEE International Conference on Multimedia and Expo, 2015

Stretching the limits of Programmable Embedded Devices for Public-key Cryptography.
Proceedings of the Second Workshop on Cryptography and Security in Computing Systems, 2015

GPU acceleration of the HEVC decoder inter prediction module.
Proceedings of the 2015 IEEE Global Conference on Signal and Information Processing, 2015

Programmable RNS lattice-based parallel cryptographic decryption.
Proceedings of the 26th IEEE International Conference on Application-specific Systems, 2015

An Efficient Scalable RNS Architecture for Large Dynamic Ranges.
J. Signal Process. Syst., 2014

A Flexible Architecture for Modular Arithmetic Hardware Accelerators based on RNS.
J. Signal Process. Syst., 2014

Dynamic Load Balancing for Real-Time Video Encoding on Heterogeneous CPU+GPU Systems.
IEEE Trans. Multim., 2014

Efficient Method for Designing Modulo {2<sup>n</sup> ± k} Multipliers.
J. Circuits Syst. Comput., 2014

Unified transform architecture for AVC, AVS, VC-1 and HEVC high-performance codecs.
EURASIP J. Adv. Signal Process., 2014

Method for Designing Efficient Mixed Radix Multipliers.
Circuits Syst. Signal Process., 2014

Cache-aware Roofline model: Upgrading the loft.
IEEE Comput. Archit. Lett., 2014

On the Evaluation of Multi-core Systems with SIMD Engines for Public-Key Cryptography.
Proceedings of the 26th IEEE International Symposium on Computer Architecture and High Performance Computing Workshop, 2014

Performance-Aware Task Management and Frequency Scaling in Embedded Systems.
Proceedings of the 26th IEEE International Symposium on Computer Architecture and High Performance Computing, 2014

Accelerating Phylogenetic Inference on GPUs: an OpenACC and CUDA comparison.
Proceedings of the International Work-Conference on Bioinformatics and Biomedical Engineering, 2014

ROM-less RNS-to-binary converter moduli {2<sup>2n</sup> - 1, 2<sup>2n</sup> + 1, 2<sup>n</sup> - 3, 2<sup>n</sup> + 3}.
Proceedings of the 2014 International Symposium on Integrated Circuits (ISIC), 2014

Method for designing multi-channel RNS architectures to prevent power analysis SCA.
Proceedings of the IEEE International Symposium on Circuits and Systemss, 2014

FEVES: Framework for Efficient Parallel Video Encoding on Heterogeneous Systems.
Proceedings of the 43rd International Conference on Parallel Processing, 2014

Collaborative inter-prediction on CPU+GPU systems.
Proceedings of the 2014 IEEE International Conference on Image Processing, 2014

Reconfigurable data flow engine for HEVC motion estimation.
Proceedings of the 2014 IEEE International Conference on Image Processing, 2014

Cooperative CPU+GPU deblocking filter parallelization for high performance HEVC video codecs.
Proceedings of the IEEE International Conference on Acoustics, 2014

Opencl parallelization of the HEVC de-quantization and inverse transform for heterogeneous platforms.
Proceedings of the 22nd European Signal Processing Conference, 2014

Nonlinear system identification using constellation based multiple model adaptive estimators.
Proceedings of the 22nd European Signal Processing Conference, 2014

SchedMon: A Performance and Energy Monitoring Tool for Modern Multi-cores.
Proceedings of the Euro-Par 2014: Parallel Processing Workshops, 2014

Combining flexibility with low power: Dataflow and wide-pipeline LDPC decoding engines in the Gbit/s era.
Proceedings of the IEEE 25th International Conference on Application-Specific Systems, 2014

Finite-Difference in Time-Domain Scalable Implementations on CUDA and OpenCL.
Proceedings of the Numerical Computations with GPUs, 2014

On the Design of RNS Reverse Converters for the Four-Moduli Set ${\bf\{2^{\mmb n}+1, 2^{\mmb n}-1, 2^{\mmb n}, 2^{{\mmb n}+1}+1\}}$.
IEEE Trans. Very Large Scale Integr. Syst., 2013

A Lab Project on the Design and Implementation of Programmable and Configurable Embedded Systems.
IEEE Trans. Educ., 2013

Method to Design General RNS Reverse Converters for Extended Moduli Sets.
IEEE Trans. Circuits Syst. II Express Briefs, 2013

RNS Reverse Converters for Moduli Sets With Dynamic Ranges up to (8n+1)-bit.
IEEE Trans. Circuits Syst. I Regul. Pap., 2013

The CRNS framework and its application to programmable and reconfigurable cryptography.
ACM Trans. Archit. Code Optim., 2013

Scalable Unified Transform Architecture for Advanced Video Coding Embedded Systems.
Int. J. Parallel Program., 2013

Randomised multi-modulo residue number system architecture for double-and-add to prevent power analysis side channel attacks.
IET Circuits Devices Syst., 2013

Monitoring Performance and Power for Application Characterization with the Cache-Aware Roofline Model.
Proceedings of the Parallel Processing and Applied Mathematics, 2013

Stressing the BER simulation of LDPC codes in the error floor region using GPU clusters.
Proceedings of the ISWCS 2013, 2013

A comparison of computing architectures and parallelization frameworks based on a two-dimensional FDTD.
Proceedings of the International Conference on High Performance Computing & Simulation, 2013

An RNS-based architecture targeting hardware accelerators for modular arithmetic.
Proceedings of the IEEE International Conference on Acoustics, 2013

Open the Gates: Using High-level Synthesis towards programmable LDPC decoders on FPGAs.
Proceedings of the IEEE Global Conference on Signal and Information Processing, 2013

Accelerating the Computation of Induced Dipoles for Molecular Mechanics with Dataflow Engines.
Proceedings of the 21st IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2013

High performance multi-standard architecture for DCT computation in H.264/AVC High Profile and HEVC codecs.
Proceedings of the 2013 Conference on Design and Architectures for Signal and Image Processing, 2013

DARNS: A randomized multi-modulo RNS architecture for double-and-add in ECC to prevent power analysis side channel attacks.
Proceedings of the 18th Asia and South Pacific Design Automation Conference, 2013

A compact and scalable RNS architecture.
Proceedings of the 24th International Conference on Application-Specific Systems, 2013

Corrections to "MRC-Based RNS Reverse Converters for the Four-Moduli Sets 2<sup>n</sup>+1, 2<sup>n</sup>-1, 2<sup>n</sup>, 2<sup>2n+1</sup>-1 and 2<sup>n</sup>+1, 2<sup>n</sup>-1, 2<sup>2n</sup>, 2<sup>2n+1</sup>-1".
IEEE Trans. Circuits Syst. II Express Briefs, 2012

MRC-Based RNS Reverse Converters for the Four-Moduli Sets 2<sup>n</sup>+1, 2<sup>n</sup>-1, 2<sup>n</sup>, 2<sup>2n+1</sup>-1 and 2<sup>n</sup>+1, 2<sup>n</sup>-1, 2<sup>2n</sup>, 2<sup>2n+1</sup>-1.
IEEE Trans. Circuits Syst. II Express Briefs, 2012

Portable LDPC Decoding on Multicores Using OpenCL [Applications Corner].
IEEE Signal Process. Mag., 2012

Fine-grain parallelism using multi-core, Cell/BE, and GPU Systems.
Parallel Comput., 2012

Computation of Induced Dipoles in Molecular Mechanics Simulations Using Graphics Processors.
J. Chem. Inf. Model., 2012

Configurable M-factor VLSI DVB-S2 LDPC decoder architecture with optimized memory tiling design.
EURASIP J. Wirel. Commun. Netw., 2012

RNS-Based Elliptic Curve Point Multiplication for Massive Parallel Architectures.
Comput. J., 2012

Energy efficient stream-based configurable architecture for embedded platforms.
Proceedings of the 2012 International Conference on Embedded Computer Systems: Architectures, 2012

On Realistic Divisible Load Scheduling in Highly Heterogeneous Distributed Systems.
Proceedings of the 20th Euromicro International Conference on Parallel, 2012

Simultaneous Multi-Level Divisible Load Balancing for Heterogeneous Desktop Systems.
Proceedings of the 10th IEEE International Symposium on Parallel and Distributed Processing with Applications, 2012

Multi-level Parallelization of Advanced Video Coding on Hybrid CPU+GPU Platforms.
Proceedings of the Euro-Par 2012: Parallel Processing Workshops, 2012

Hierarchical Partitioning Algorithm for Scientific Computing on Highly Heterogeneous CPU + GPU Clusters.
Proceedings of the Euro-Par 2012 Parallel Processing - 18th International Conference, 2012

VLSI Reverse Converter for RNS Based on the Moduli Set.
Proceedings of the 15th Euromicro Conference on Digital System Design, 2012

RNS Arithmetic Units for Modulo {2^n+-k}.
Proceedings of the 15th Euromicro Conference on Digital System Design, 2012

High Performance Unified Architecture for Forward and Inverse Quantization in H.264/AVC.
Proceedings of the 15th Euromicro Conference on Digital System Design, 2012

Efficient implementation of multi-moduli architectures for Binary-to-RNS conversion.
Proceedings of the 17th Asia and South Pacific Design Automation Conference, 2012

Modeling and Evaluating Non-shared Memory CELL/BE Type Multi-core Architectures for Local Image and Video Processing.
J. Signal Process. Syst., 2011

Massively LDPC Decoding on Multicore Architectures.
IEEE Trans. Parallel Distributed Syst., 2011

A flexible architecture for the computation of direct and inverse transforms in H.264/AVC video codecs.
IEEE Trans. Consumer Electron., 2011

A tutorial overview on the properties of the discrete cosine transform for encoded image and video processing.
Signal Process., 2011

Parallel Computing - Special Issue.
Parallel Comput., 2011

CHPS: An Environment for Collaborative Execution on Heterogeneous Desktop Systems.
Int. J. Netw. Comput., 2011

High throughput and scalable architecture for unified transform coding in embedded H.264/AVC video coding systems.
Proceedings of the 2011 International Conference on Embedded Computer Systems: Architectures, 2011

A new approach to system identification and parameter tuning with multiple model adaptive estimators.
Proceedings of the 7th International Symposium on Image and Signal Processing and Analysis, 2011

Real-time DVB-S2 LDPC decoding on many-core GPU accelerators.
Proceedings of the IEEE International Conference on Acoustics, 2011

Proceedings of the Euro-Par 2011 Parallel Processing - 17th International Conference, 2011

Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory.
Proceedings of the Euro-Par 2011: Parallel Processing Workshops - CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29, 2011

Binary-to-RNS Conversion Units for moduli {2^n ± 3}.
Proceedings of the 14th Euromicro Conference on Digital System Design, 2011

Virtualization for Morphable Multi-Cores.
Proceedings of the ARCS 2011, 2011

Measuring and Extraction of Biological Information on New Handheld Biochip-Based Microsystem.
IEEE Trans. Instrum. Meas., 2010

On the Modeling of New Tunnel Junction Magnetoresistive Biosensors.
IEEE Trans. Instrum. Meas., 2010

A quantitative analysis of firing rate estimators: Unveiling bias sources.
Neurocomputing, 2010

An improved RNS generator 2<sup>n</sup> +/- k based on threshold logic.
Proceedings of the 18th IEEE/IFIP VLSI-SoC 2010, 2010

Unifying stream based and reconfigurable computing to design application accelerators.
Proceedings of the 18th IEEE/IFIP VLSI-SoC 2010, 2010

Embedded multicore architectures for LDPC decoding.
Proceedings of the 2010 International Conference on Embedded Computer Systems: Architectures, 2010

Programming Cell/BE and GPUs systems for real-time video encoding.
Proceedings of the Real-Time Image and Video Processing 2010, 2010

p264: open platform for designing parallel H.264/AVC video encoders on multi-core systems.
Proceedings of the Network and Operating System Support for Digital Audio and Video, 2010

H.264/AVC framework for multi-core embedded video encoders.
Proceedings of the 2010 International Symposium on System on Chip, SoC 2010, Tampere, 2010

An improved RNS reverse converter for the {2<sup>2n+1</sup>-1, 2<sup>n</sup>, 2<sup>n</sup>-1} moduli set.
Proceedings of the International Symposium on Circuits and Systems (ISCAS 2010), May 30, 2010

Collaborative execution environment for heterogeneous parallel systems.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Exploiting SIMD extensions for linear image processing with OpenCL.
Proceedings of the 28th International Conference on Computer Design, 2010

High-Performance Computing on Heterogeneous Systems: Database Queries on CPU and GPU.
Proceedings of the High Performance Computing: From Grids and Clouds to Exascale, 2010

Arithmetic Units for RNS Moduli {2n-3} and {2n+3} Operations.
Proceedings of the 13th Euromicro Conference on Digital System Design, 2010

Hardware/software co-design of H.264/AVC encoders for multi-core embedded systems.
Proceedings of the 2010 Conference on Design & Architectures for Signal & Image Processing, 2010

Iterative induced dipoles computation for molecular mechanics on GPUs.
Proceedings of 3rd Workshop on General Purpose Processing on Graphics Processing Units, 2010

Elliptic Curve point multiplication on GPUs.
Proceedings of the 21st IEEE International Conference on Application-specific Systems Architectures and Processors, 2010

Efficient Independent Component Analysis on a GPU.
Proceedings of the 10th IEEE International Conference on Computer and Information Technology, 2010

A Feature Selection Algorithm for the Regularization of Neuron Models.
IEEE Trans. Instrum. Meas., 2009

A Portable and Autonomous Magnetic Detection Platform for Biosensing.
Sensors, 2009

Modelling and programming stream-based distributed computing based on the meta-pipeline approach.
Int. J. Parallel Emergent Distributed Syst., 2009

Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach.
J. Comput. Sci. Technol., 2009

Neural code metrics: Analysis and application to the assessment of neural models.
Neurocomputing, 2009

Development and evaluation of scalable video motion estimators on GPU.
Proceedings of the IEEE Workshop on Signal Processing Systems, 2009

Applying the Stream-Based Computing Model to Design Hardware Accelerators: A Case Study.
Proceedings of the Embedded Computer Systems: Architectures, 2009

On the design of distributed autonomous embedded systems for biomedical applications.
Proceedings of the 3rd International Conference on Pervasive Computing Technologies for Healthcare, 2009

CaravelaMPI: Message Passing Interface for Parallel GPU-Based Applications.
Proceedings of the Eighth International Symposium on Parallel and Distributed Computing, 2009

Distributed Software Platform for Automation and Control of General Anaesthesia.
Proceedings of the Eighth International Symposium on Parallel and Distributed Computing, 2009

How GPUs can outperform ASICs for fast LDPC decoding.
Proceedings of the 23rd international conference on Supercomputing, 2009

Fine-grain Parallelism Using Multi-core, Cell/BE, and GPU Systems: Accelerating the Phylogenetic Likelihood Function.
Proceedings of the ICPP 2009, 2009

Multi-core platforms for signal processing: source and channel coding.
Proceedings of the 2009 IEEE International Conference on Multimedia and Expo, 2009

Parallel LDPC Decoding on the Cell/B.E. Processor.
Proceedings of the High Performance Embedded Architectures and Compilers, 2009

Compact and Flexible Microcoded Elliptic Curve Processor for Reconfigurable Devices.
Proceedings of the FCCM 2009, 2009

Proceedings of the Euro-Par 2009, 2009

Cost-Efficient SHA Hardware Accelerators.
IEEE Trans. Very Large Scale Integr. Syst., 2008

Statistical Analysis of a Spike Train Distance in Poisson Models.
IEEE Signal Process. Lett., 2008

Parallel Advanced Video Coding: Motion Estimation on Multi-cores.
Scalable Comput. Pract. Exp., 2008

Massive parallel LDPC decoding on GPU.
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008

Edge Stream Oriented LDPC Decoding.
Proceedings of the 16th Euromicro International Conference on Parallel, 2008

Heuristic Optimization Methods for Improving Performance of Recursive General Purpose Applications on GPUs.
Proceedings of the 7th International Symposium on Parallel and Distributed Computing (ISPDC 2008), 2008

Distributed Web-based Platform for Computer Architecture Simulation.
Proceedings of the 7th International Symposium on Parallel and Distributed Computing (ISPDC 2008), 2008

Design and implementation of a tool for modeling and programming deadlock free meta-pipeline applications.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

BRAM-LUT Tradeoff on a Polymorphic DES Design.
Proceedings of the High Performance Embedded Architectures and Compilers, 2008

Efficient FPGA elliptic curve cryptographic processor over GF(2<sup>m</sup>).
Proceedings of the 2008 International Conference on Field-Programmable Technology, 2008

On-the-fly attestation of reconfigurable hardware.
Proceedings of the FPL 2008, 2008

Application Specific Programmable IP Core for Motion Estimation: Technology Comparison Targeting Efficient Embedded Co-Processing Units.
Proceedings of the 11th Euromicro Conference on Digital System Design: Architectures, 2008

An RNS based Specific Processor for Computing the Minimum Sum-of-Absolute-Differences.
Proceedings of the 11th Euromicro Conference on Digital System Design: Architectures, 2008

Merged Computation for Whirlpool Hashing.
Proceedings of the Design, Automation and Test in Europe, 2008

A Parallel Algorithm for Advanced Video Motion Estimation on Multicore Architectures.
Proceedings of the Second International Conference on Complex, 2008

Low power microarchitecture with instruction reuse.
Proceedings of the 5th Conference on Computing Frontiers, 2008

Towards a Unified Model for the Retina - Static vs Dynamic Integrate and Fire Models.
Proceedings of the First International Conference on Biomedical Electronics and Devices, 2008

Reconfigurable architectures and processors for real-time video motion estimation.
J. Real Time Image Process., 2007

Improving residue number system multiplication with more balanced moduli sets and enhanced modular arithmetic structures.
IET Comput. Digit. Tech., 2007

Embedded Systems for Portable and Mobile Video Platforms.
EURASIP J. Embed. Syst., 2007

Adaptive Motion Estimation Processor for Autonomous Video Devices.
EURASIP J. Embed. Syst., 2007

Efficient Hybrid DCT-Domain Algorithm for Video Spatial Downscaling.
EURASIP J. Adv. Signal Process., 2007

Caravela: A Novel Stream-Based Distributed Computing Environment.
Computer, 2007

Developing and Integrating Lab Projects as Important Learning Components in an Embedded Systems Course.
Proceedings of the IEEE International Conference on Microelectronic Systems Education, 2007

Meta-Pipeline: A New Execution Mechanism for Distributed Pipeline Processing.
Proceedings of the 6th International Symposium on Parallel and Distributed Computing (ISPDC 2007), 2007

A New Handheld Biochip-based Microsystem.
Proceedings of the International Symposium on Circuits and Systems (ISCAS 2007), 2007

Generic Architecture Designed for Biomedical Embedded Systems.
Proceedings of the Embedded System Design: Topics, Techniques and Trends, IFIP TC10 Working Conference: International Embedded Systems Symposium (IESS), May 30, 2007

Additive Logistic Regression Applied to Retina Modelling.
Proceedings of the International Conference on Image Processing, 2007

An Efficient Expectation-Maximisation Algorithm for Spike Classification.
Proceedings of the 15th International Conference on Digital Signal Processing, 2007

Adaptive Motion Estimation Algorithm for H.264/AVC.
Proceedings of the 15th International Conference on Digital Signal Processing, 2007

A Run-time Reconfigurable Processor for Video Motion Estimation.
Proceedings of the FPL 2007, 2007

Stochastic integrate-and-fire model for the retina.
Proceedings of the 15th European Signal Processing Conference, 2007

Data buffering optimization methods toward a uniform programming interface for gpu-based applications.
Proceedings of the 4th Conference on Computing Frontiers, 2007

Design and implementation of a stream-based distributedcomputing platform using graphics processing units.
Proceedings of the 4th Conference on Computing Frontiers, 2007

Efficient Method for Magnitude Comparison in RNS Based on Two Pairs of Conjugate Moduli.
Proceedings of the 18th IEEE Symposium on Computer Arithmetic (ARITH-18 2007), 2007

Toward a Realistic Task Scheduling Model.
IEEE Trans. Parallel Distributed Syst., 2006

A New Hand-Held Microsystem Architecture for Biological Analysis.
IEEE Trans. Circuits Syst. I Regul. Pap., 2006

Maestro2: Experimental Evaluation of Communication Performance Improvement Techniques in the Link Layer.
J. Interconnect. Networks, 2006

Rescheduling for Optimized SHA-1 Calculation.
Proceedings of the Embedded Computer Systems: Architectures, 2006

Low Power Distance Measurement Unit for Real-Time Hardware Motion Estimators.
Proceedings of the Integrated Circuit and System Design. Power and Timing Modeling, 2006

Reconfigurable memory based AES co-processor.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Application Specific Instruction Set Processor for Adaptive Video Motion Estimation.
Proceedings of the Ninth Euromicro Conference on Digital System Design: Architectures, Methods and Tools (DSD 2006), 30 August, 2006

Improving SHA-2 Hardware Implementations.
Proceedings of the Cryptographic Hardware and Embedded Systems, 2006

Configurable Embedded Core for Controlling Electro-Mechanical Systems.
Proceedings of the Reconfigurable Computing: Architectures and Applications, 2006

Communication Contention in Task Scheduling.
IEEE Trans. Parallel Distributed Syst., 2005

Corrections to "A Universal Architecture for Designing Efficient Modulo 2<sup>n+1</sup> Multipliers".
IEEE Trans. Circuits Syst. I Regul. Pap., 2005

A universal architecture for designing efficient modulo 2<sup>n</sup>+1 multipliers.
IEEE Trans. Circuits Syst. I Regul. Pap., 2005

Visual neuroprosthesis: a non invasive system for stimulating the cortex.
IEEE Trans. Circuits Syst. I Regul. Pap., 2005

Efficient VLSI Architecture for Real-Time Motion Estimation in Advanced Video Coding.
Proceedings of the Proceedings 2005 IEEE International SOC Conference, 2005

On the Implementation and Evaluation of Berkeley Sockets on Maestro2 cluster computing environment.
Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC 2005), 2005

Least squares motion estimation algorithm in the compressed DCT domain for H.26x/MPEG-x video sequences.
Proceedings of the Advanced Video and Signal Based Surveillance, 2005

The Midlifekicker Microarchitecture Evaluation Metric.
Proceedings of the 16th IEEE International Conference on Application-Specific Systems, 2005

On Task Scheduling Accuracy: Evaluation Methodology and Results.
J. Supercomput., 2004

List scheduling: extension for contention awareness and evaluation of node priorities for heterogeneous cluster architectures.
Parallel Comput., 2004

A programmable cellular neural network circuit.
Proceedings of the 17th Annual Symposium on Integrated Circuits and Systems Design, 2004

Task Scheduling: Considering the Processor Involvement in Communication.
Proceedings of the 3rd International Symposium on Parallel and Distributed Computing (ISPDC 2004), 2004

Distributed Shared Memory System Based on the Maestro2 High Performance Cluster Network.
Proceedings of the 3rd International Symposium on Parallel and Distributed Computing (ISPDC 2004), 2004

On the performance of Maestro2 high performance network equipment, using new improvement techniques.
Proceedings of the 23rd IEEE International Performance Computing and Communications Conference, 2004

{2<sup>n</sup>+1, s<sup>n+k</sup>, s<sup>n</sup>-1}: A New RNS Moduli Set Extension.
Proceedings of the 2004 Euromicro Symposium on Digital Systems Design (DSD 2004), Architectures, Methods and Tools, 31 August, 2004

Automatic Synthesis of Motion Estimation Processors Based on a New Class of Hardware Architectures.
J. VLSI Signal Process., 2003

Fast transcoding architectures for insertion of non-regular shaped objects in the compressed DCT-domain.
Signal Process. Image Commun., 2003

An FPL Bioinspired Visual Encoding System to Stimulate Cortical Neurons in Real-Time.
Proceedings of the Field Programmable Logic and Application, 13th International Conference, 2003

Customisable Core-Based Architectures for Real-Time Motion Estimation on FPGAs.
Proceedings of the Field Programmable Logic and Application, 13th International Conference, 2003

RDSP: A RISC DSP based on Residue Number System.
Proceedings of the 2003 Euromicro Symposium on Digital Systems Design (DSD 2003), 2003

Efficient and configurable full-search block-matching processors.
IEEE Trans. Circuits Syst. Video Technol., 2002

Video coding by using the 3D zero-tree approach in the wavelet transform domain.
Proceedings of the 14th International Conference on Digital Signal Processing, 2002

Insertion of irregular-shaped logos in the compressed DCT domain.
Proceedings of the 14th International Conference on Digital Signal Processing, 2002

A New Efficient VLSI Architecture for Full Search Block Matching Motion Estimation.
Proceedings of the SOC Design Methodologies, 2001

Comparison of Contention Aware List Scheduling Heuristics for Cluster Computing.
Proceedings of the 30th International Workshops on Parallel Processing (ICPP 2001 Workshops), 2001

Scheduling Task Graphs on Arbitrary Processor Architectures Considering Contention.
Proceedings of the High-Performance Computing and Networking, 9th International Conference, 2001

Exploiting Unused Time Slots in List Scheduling Considering Communication Contention.
Proceedings of the Euro-Par 2001: Parallel Processing, 2001

Synchronous Non-local Image Processing on Orthogonal Multiprocessor Systems.
Proceedings of the Vector and Parallel Processing, 2000

A Platform Independent Parallelising Tool Based on Graph Theoretic Models.
Proceedings of the Vector and Parallel Processing, 2000

In the Development and Evaluation of Specialized Processors for Computing High-Order 2-D Image Moments in Real-Time.
Proceedings of the Fifth International Workshop on Computer Architectures for Machine Perception (CAMP 2000), 2000

Low-power array architectures for motion estimation.
Proceedings of the Third IEEE Workshop on Multimedia Signal Processing, 1999

Applying Conditional Processing to Design Low-Power Array Processors for Motion Estimation.
Proceedings of the 1999 International Conference on Image Processing, 1999

On the Development of a Video CODEC for Low Bitrate Communication in General Purpose Computers.
Proceedings of the 17th IASTED International Conference on Applied Informatics, 1999

Bidirectional systolic arrays for digital recursive filters.
Proceedings of the 5th IEEE International Conference on Electronics, Circuits and Systems, 1998

A new orthogonal multiprocessor and its application to image processing.
Proceedings of the Fourth International on High-Performance Computing, 1997
