2025
Instruction-Aware Cooperative TLB and Cache Replacement Policies.
Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2025
2024
AmgT: Algebraic Multigrid Solver on Tensor Cores.
Proceedings of the International Conference for High Performance Computing, 2024
Practically Tackling Memory Bottlenecks of Graph-Processing Workloads.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024
Exploiting Vector Code Semantics for Efficient Data Cache Prefetching.
Proceedings of the 38th ACM International Conference on Supercomputing, 2024
Extending Sparse Patterns to Improve Inverse Preconditioning on GPU Architectures.
Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, 2024
A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2024
2023
HPCG on long-vector architectures: Evaluation and optimization on NEC SX-Aurora and RISC-V.
Future Gener. Comput. Syst., June, 2023
Compressed Real Numbers for AI: a case-study using a RISC-V CPU.
CoRR, 2023
Open-Source GEMM Hardware Kernels Generator: Toward Numerically-Tailored Computations.
CoRR, 2023
Characterizing the impact of last-level cache replacement policies on big-data workloads.
CoRR, 2023
Optimization of SpGEMM with Risc-V vector instructions.
CoRR, 2023
Efficient Direct Convolution Using Long SIMD Instructions.
Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2023
Efficient Execution of SpGEMM on Long Vector Architectures.
Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, 2023
An Open-Source Framework for Efficient Numerically-Tailored Computations.
Proceedings of the 33rd International Conference on Field-Programmable Logic and Applications, 2023
2022
Compiler-Assisted Compaction/Restoration of SIMD Instructions.
IEEE Trans. Parallel Distributed Syst., 2022
A BF16 FMA is All You Need for DNN Training.
IEEE Trans. Emerg. Top. Comput., 2022
Optimization of the Sparse Multi-Threaded Cholesky Factorization for A64FX.
CoRR, 2022
TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming Models.
Proceedings of the SC22: International Conference for High Performance Computing, 2022
FASE: A Fast, Accurate and Seamless Emulator for Custom Numerical Formats.
Proceedings of the Machine Learning and Knowledge Discovery in Databases, 2022
Page Size Aware Cache Prefetching.
Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture, 2022
Task-based Acceleration of Bidirectional Recurrent Neural Networks on Multi-core Architectures.
Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022
Communication-aware Sparse Patterns for the Factorized Approximate Inverse Preconditioner.
Proceedings of the HPDC '22: The 31st International Symposium on High-Performance Parallel and Distributed Computing, Minneapolis, MN, USA, 27 June 2022, 2022
A Generator of Numerically-Tailored and High-Throughput Accelerators for Batched GEMMs.
Proceedings of the 30th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2022
A Selective Nesting Approach for the Sparse Multi-threaded Cholesky Factorization.
Proceedings of the 7th IEEE/ACM International Workshop on Extreme Scale Programming Models and Middleware, 2022
2021
Intelligent Adaptation of Hardware Knobs for Improving Performance and Power Consumption.
IEEE Trans. Computers, 2021
Efficiently running SpMV on long vector architectures.
Proceedings of the PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021
Multilevel simulation-based co-design of next generation HPC microprocessors.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2021 International Workshop on Performance Modeling, 2021
Morrigan: A Composite Instruction TLB Prefetcher.
Proceedings of the MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021
Exploiting Page Table Locality for Agile TLB Prefetching.
Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture, 2021
Dynamically Adapting Floating-Point Precision to Accelerate Deep Neural Network Training.
Proceedings of the 20th IEEE International Conference on Machine Learning and Applications, 2021
Cache-aware Sparse Patterns for the Factorized Sparse Approximate Inverse Preconditioner.
Proceedings of the HPDC '21: The 30th International Symposium on High-Performance Parallel and Distributed Computing, 2021
PrioRAT: Criticality-Driven Prioritization Inside the On-Chip Memory Hierarchy.
Proceedings of the Euro-Par 2021: Parallel Processing, 2021
2020
Efficiency analysis of modern vector architectures: vector ALU sizes, core counts and clock frequencies.
J. Supercomput., 2020
Iteration-fusing conjugate gradient for sparse linear systems with MPI + OmpSs.
J. Supercomput., 2020
Using Arm's scalable vector extension on stencil codes.
J. Supercomput., 2020
Semi-automatic validation of cycle-accurate simulation infrastructures: The case for gem5-x86.
Future Gener. Comput. Syst., 2020
Generating Efficient DNN-Ensembles with Evolutionary Computation.
CoRR, 2020
Reducing Data Motion to Accelerate the Training of Deep Neural Networks.
CoRR, 2020
Runtime-guided ECC protection using online estimation of memory vulnerability.
Proceedings of the International Conference for High Performance Computing, 2020
Cost-aware prediction of uncorrected DRAM errors in the field.
Proceedings of the International Conference for High Performance Computing, 2020
Characterizing the impact of last-level cache replacement policies on big-data workloads.
Proceedings of the IEEE International Symposium on Workload Characterization, 2020
Wavefront parallelization of recurrent neural networks on multi-core architectures.
Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020
RICH: implementing reductions in the cache hierarchy.
Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020
Modeling and optimizing NUMA effects and prefetching with machine learning.
Proceedings of the ICS '20: 2020 International Conference on Supercomputing, 2020
Evaluating Mixed-Precision Arithmetic for 3D Generative Adversarial Networks to Simulate High Energy Physics Detectors.
Proceedings of the 19th IEEE International Conference on Machine Learning and Applications, 2020
Improving Predication Efficiency through Compaction/Restoration of SIMD Instructions.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2020
2019
Design trade-offs for emerging HPC processors based on mobile market technology.
J. Supercomput., 2019
Sampled Simulation of Task-Based Programs.
IEEE Trans. Computers, 2019
Special issue on the message passing interface.
Parallel Comput., 2019
On the maturity of parallel applications for asymmetric multi-core processors.
J. Parallel Distributed Comput., 2019
Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions.
Int. J. High Perform. Comput. Appl., 2019
Optimizing computation-communication overlap in asynchronous task-based programs: poster.
Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019
On the Benefits of Tasking with OpenMP.
Proceedings of the OpenMP: Conquering the Full Hardware Spectrum, 2019
Design Space Exploration of Next-Generation HPC Machines.
Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019
A Vulnerability Factor for ECC-protected Memory.
Proceedings of the 25th IEEE International Symposium on On-Line Testing and Robust System Design, 2019
Open-Source Shared Memory implementation of the HPCG benchmark: analysis, improvements and evaluation on Cavium ThunderX2.
Proceedings of the 17th International Conference on High Performance Computing & Simulation, 2019
Power efficient job scheduling by predicting the impact of processor manufacturing variability.
Proceedings of the ACM International Conference on Supercomputing, 2019
Optimizing computation-communication overlap in asynchronous task-based programs.
Proceedings of the ACM International Conference on Supercomputing, 2019
Convolutional Neural Network Training with Dynamic Epoch Ordering.
Proceedings of the Artificial Intelligence Research and Development, 2019
POSTER: An Optimized Predication Execution for SIMD Extensions.
Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques, 2019
2018
Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers.
IEEE Trans. Parallel Distributed Syst., 2018
Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach.
IEEE Trans. Parallel Distributed Syst., 2018
Performance and energy effects on task-based parallelized applications - User-directed versus manual vectorization.
J. Supercomput., 2018
Memory Vulnerability: A Case for Delaying Error Reporting.
CoRR, 2018
Low-Precision Floating-Point Schemes for Neural Network Training.
CoRR, 2018
TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism.
Proceedings of the High Performance Computing - 33rd International Conference, 2018
Approximating a Multi-Grid Solver.
Proceedings of the 2018 IEEE/ACM Performance Modeling, 2018
Runtime-assisted cache coherence deactivation in task parallel programs.
Proceedings of the International Conference for High Performance Computing, 2018
Graph partitioning applied to DAG scheduling to reduce NUMA effects.
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018
Data Prefetching on In-order Processors.
Proceedings of the 2018 International Conference on High Performance Computing & Simulation, 2018
Reducing Data Movement on Large Shared Memory Systems by Exploiting Computation Dependencies.
Proceedings of the 32nd International Conference on Supercomputing, 2018
Runtime-Guided Management of Stacked DRAM Memories in Task Parallel Programs.
Proceedings of the 32nd International Conference on Supercomputing, 2018
Architectural Support for Task Dependence Management with Flexible Software Scheduling.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2018
Stencil codes on a vector length agnostic architecture.
Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, 2018
2017
Task Scheduling Techniques for Asymmetric Multi-Core Systems.
IEEE Trans. Parallel Distributed Syst., 2017
Prediction of the impact of network switch utilization on application performance via active measurement.
Parallel Comput., 2017
iQ: An Efficient and Flexible Queue-Based Simulation Framework.
Proceedings of the 25th IEEE International Symposium on Modeling, 2017
ATM: Approximate Task Memoization in the Runtime System.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017
Iteration-fusing conjugate gradient.
Proceedings of the International Conference on Supercomputing, 2017
libPRISM: an intelligent adaptation of prefetch and SMT levels.
Proceedings of the International Conference on Supercomputing, 2017
Evaluating Scientific Workflow Execution on an Asymmetric Multicore Processor.
Proceedings of the Euro-Par 2017: Parallel Processing Workshops, 2017
Runtime-Assisted Shared Cache Insertion Policies Based on Re-reference Intervals.
Proceedings of the Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28, 2017
2016
Evaluation of HPC Applications' Memory Resource Consumption via Active Measurement.
IEEE Trans. Parallel Distributed Syst., 2016
PARSECSs: Evaluating the Impact of Task Parallelism in the PARSEC Benchmark Suite.
ACM Trans. Archit. Code Optim., 2016
MUSA: a multi-level simulation approach for next-generation HPC machines.
Proceedings of the International Conference for High Performance Computing, 2016
TaskPoint: Sampled simulation of task-based programs.
Proceedings of the 2016 IEEE International Symposium on Performance Analysis of Systems and Software, 2016
CATA: Criticality Aware Task Acceleration for Multicore Processors.
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016
Runtime-Guided Mitigation of Manufacturing Variability in Power-Constrained Multi-Socket NUMA Nodes.
Proceedings of the 2016 International Conference on Supercomputing, 2016
POSTER: Exploiting Asymmetric Multi-Core Processors with Flexible System Sofware.
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, 2016
Reducing Cache Coherence Traffic with Hierarchical Directory Cache and NUMA-Aware Runtime Scheduling.
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, 2016
2015
A framework for evaluating comprehensive fault resilience mechanisms in numerical programs.
J. Supercomput., 2015
Adaptive and application dependent runtime guided hardware prefetcher reconfiguration on the IBM POWER7.
CoRR, 2015
Exploiting asynchrony from exact forward recovery for DUE in iterative solvers.
Proceedings of the International Conference for High Performance Computing, 2015
Evaluating the Impact of OpenMP 4.0 Extensions on Relevant Parallel Workloads.
Proceedings of the OpenMP: Heterogenous Execution and Data Movements, 2015
Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures.
Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015
Runtime-Aware Architectures.
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Euro-Par 2015: Parallel Processing, 2015
Runtime-Guided Management of Scratchpad Memories in Multicore Architectures.
Proceedings of the 2015 International Conference on Parallel Architectures and Compilation, 2015
2014
Runtime-Aware Architectures: A First Approach.
Supercomput. Front. Innov., 2014
Active Measurement of Memory Resource Consumption.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014
Active Measurement of the Impact of Network Switch Utilization on Application Performance.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014
Evaluating Execution Time Predictability of Task-Based Programs on Multi-Core Processors.
Proceedings of the Euro-Par 2014: Parallel Processing Workshops, 2014
2013
Performance Analysis Techniques for the Exascale Co-Design Process.
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Parallel Computing: Accelerating Computational Science and Engineering (CSE), 2013
2012
Poster: Autonomic Modeling of Data-Driven Application Behavior.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012
Abstract: Autonomic Modeling of Data-Driven Application Behavior.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012
Fault resilience of the algebraic multi-grid solver.
Proceedings of the International Conference on Supercomputing, 2012
2011
Simulating Whole Supercomputer Applications.
IEEE Micro, 2011
Extracting the optimal sampling frequency of applications using spectral analysis.
Concurr. Comput. Pract. Exp., 2011
Trace Spectral Analysis toward Dynamic Levels of Detail.
Proceedings of the 17th IEEE International Conference on Parallel and Distributed Systems, 2011
2010
Spectral analysis of executions of computer programs and its applications on performance analysis.
PhD thesis, 2010
Automatic Phase Detection and Structure Extraction of MPI Applications.
Int. J. High Perform. Comput. Appl., 2010
2008
Automatic analysis of speedup of MPI applications.
Proceedings of the 22nd Annual International Conference on Supercomputing, 2008
Prediction of behavior of MPI applications.
Proceedings of the 2008 IEEE International Conference on Cluster Computing, 29 September, 2008
2007
Automatic Phase Detection of MPI Applications.
Proceedings of the Parallel Computing: Architectures, 2007
Automatic Structure Extraction from MPI Applications Tracefiles.
Proceedings of the Euro-Par 2007, 2007