Lorenzo Baraldi

Orcid: 0000-0001-5125-4957

  • University of Pisa, Pisa, Toscana, Italy - professor

According to our database1, Lorenzo Baraldi authored at least 123 papers between 2013 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.



In proceedings 
PhD thesis 


Online presence:

On csauthors.net:


Towards Retrieval-Augmented Architectures for Image Captioning.
ACM Trans. Multim. Comput. Commun. Appl., August, 2024

Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets.
Int. J. Comput. Vis., May, 2024

Video Surveillance and Privacy: A Solvable Paradox?
Computer, March, 2024

Are Learnable Prompts the Right Way of Prompting? Adapting Vision-and-Language Models with Memory Optimization.
IEEE Intell. Syst., 2024

Fluent and Accurate Image Captioning with a Self-Trained Reward Model.
CoRR, 2024

Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization.
CoRR, 2024

UNMuTe: Unifying Navigation and Multimodal Dialogue-like Text Generation.
CoRR, 2024

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues.
CoRR, 2024

Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities.
CoRR, 2024

Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs.
CoRR, 2024

AIGeN: An Adversarial Approach for Instruction Generation in VLN.
CoRR, 2024

Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation.
CoRR, 2024

The (R)Evolution of Multimodal Large Language Models: A Survey.
CoRR, 2024

What's Outside the Intersection? Fine-grained Error Analysis for Semantic Segmentation Beyond IoU.
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024

FOSSIL: Free Open-Vocabulary Semantic Segmentation through Synthetic References Retrieval.
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024

Mapping High-level Semantic Regions in Indoor Environments without Object Recognition.
Proceedings of the IEEE International Conference on Robotics and Automation, 2024

The Revolution of Multimodal Large Language Models: A Survey.
Proceedings of the Findings of the Association for Computational Linguistics, 2024

Sharing Cultural Heritage - The Case of the Lodovico Media Library.
Multimodal Technol. Interact., December, 2023

Fully-attentive iterative networks for region-based controllable image and video captioning.
Comput. Vis. Image Underst., December, 2023

Evaluating synthetic pre-Training for handwriting processing tasks.
Pattern Recognit. Lett., August, 2023

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates.
Sensors, February, 2023

From Show to Tell: A Survey on Deep Learning-Based Image Captioning.
IEEE Trans. Pattern Anal. Mach. Intell., 2023

Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation.
CoRR, 2023

Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation.
CoRR, 2023

Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training.
CoRR, 2023

Multi-Class Explainable Unlearning for Image Classification via Weight Filtering.
CoRR, 2023

Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images.
CoRR, 2023

Positive-Augmented Constrastive Learning for Image and Video Captioning Evaluation.
CoRR, 2023

Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

Where Research meets Industry: New Challenges and Opportunities at AImageLab.
Proceedings of the Italia Intelligenza Artificiale, 2023

Embodied Agents for Efficient Exploration and Smart Scene Description.
Proceedings of the IEEE International Conference on Robotics and Automation, 2023

Towards Explainable Navigation and Recounting.
Proceedings of the Image Analysis and Processing - ICIAP 2023, 2023

Unveiling the Impact of Image Transformations on Deepfake Detection: An Experimental Analysis.
Proceedings of the Image Analysis and Processing - ICIAP 2023, 2023

SynthCap: Augmenting Transformers with Synthetic Data for Image Captioning.
Proceedings of the Image Analysis and Processing - ICIAP 2023, 2023

Enhancing Open-Vocabulary Semantic Segmentation with Prototype Retrieval.
Proceedings of the Image Analysis and Processing - ICIAP 2023, 2023

With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Superpixel Positional Encoding to Improve ViT-based Semantic Segmentation Models.
Proceedings of the 34th British Machine Vision Conference 2023, 2023

Matching Faces and Attributes Between the Artistic and the Real Domain: the PersonArt Approach.
ACM Trans. Multim. Comput. Commun. Appl., 2022

A computational approach for progressive architecture shrinkage in action recognition.
Softw. Pract. Exp., 2022

Focus on Impact: Indoor Exploration With Intrinsic Motivation.
IEEE Robotics Autom. Lett., 2022

Boosting modern and historical handwritten text recognition with deformable convolutions.
Int. J. Document Anal. Recognit., 2022

Explaining transformer-based image captioning models: An empirical analysis.
AI Commun., 2022

Spot the Difference: A Novel Task for Embodied Agents in Changing Environments.
Proceedings of the 26th International Conference on Pattern Recognition, 2022

The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition.
Proceedings of the 26th International Conference on Pattern Recognition, 2022

CaMEL: Mean Teacher Learning for Image Captioning.
Proceedings of the 26th International Conference on Pattern Recognition, 2022

Investigating Bidimensional Downsampling in Vision Transformer Models.
Proceedings of the Image Analysis and Processing - ICIAP 2022, 2022

Embodied Navigation at the Art Gallery.
Proceedings of the Image Analysis and Processing - ICIAP 2022, 2022

Dual-Branch Collaborative Transformer for Virtual Try-On.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022

The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022

Retrieval-Augmented Transformer for Image Captioning.
Proceedings of the CBMI 2022: International Conference on Content-based Multimedia Indexing, Graz, Austria, September 14, 2022

ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval.
Proceedings of the CBMI 2022: International Conference on Content-based Multimedia Indexing, Graz, Austria, September 14, 2022

Working Memory Connections for LSTM.
Neural Networks, 2021

Video action detection by learning graph-based spatio-temporal interactions.
Comput. Vis. Image Underst., 2021

Multimodal attention networks for low-level vision-and-language navigation.
Comput. Vis. Image Underst., 2021

Universal Captioner: Long-Tail Vision-and-Language Model Training through Content-Style Separation.
CoRR, 2021

From Show to Tell: A Survey on Image Captioning.
CoRR, 2021

Learning to Select: A Fully Attentive Approach for Novel Object Captioning.
Proceedings of the ICMR '21: International Conference on Multimedia Retrieval, 2021

Improving Indoor Semantic Segmentation with Boundary-Level Objectives.
Proceedings of the Advances in Computational Intelligence, 2021

Estimating (and Fixing) the Effect of Face Obfuscation in Video Recognition.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2021

Revisiting the Evaluation of Class Activation Mapping for Explainability: A Novel Metric and Experimental Analysis.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2021

Learning to Read L'Infinito: Handwritten Text Recognition with Synthetic Training Data.
Proceedings of the Computer Analysis of Images and Patterns, 2021

Out of the Box: Embodied Navigation in the Real World.
Proceedings of the Computer Analysis of Images and Patterns, 2021

Assessing the Role of Boundary-Level Objectives in Indoor Semantic Segmentation.
Proceedings of the Computer Analysis of Images and Patterns, 2021

Spaghetti Labeling: Directed Acyclic Graphs for Block-Based Connected Components Labeling.
IEEE Trans. Image Process., 2020

Explaining digital humanities by aligning images and textual descriptions.
Pattern Recognit. Lett., 2020

A unified cycle-consistent neural model for text and image retrieval.
Multim. Tools Appl., 2020

Toward reliable experiments on the performance of Connected Components Labeling algorithms.
J. Real Time Image Process., 2020

Inter-Homines: Distance-Based Risk Estimation for Human Safety.
CoRR, 2020

SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability.
Proceedings of the 2020 IEEE International Conference on Robotics and Automation, 2020

RMS-Net: Regression and Masking for Soccer Event Spotting.
Proceedings of the 25th International Conference on Pattern Recognition, 2020

A Novel Attention-based Aggregation Function to Combine Vision and Language.
Proceedings of the 25th International Conference on Pattern Recognition, 2020

Watch Your Strokes: Improving Handwritten Text Recognition with Deformable Convolutions.
Proceedings of the 25th International Conference on Pattern Recognition, 2020

Explore and Explain: Self-supervised Navigation and Recounting.
Proceedings of the 25th International Conference on Pattern Recognition, 2020

Meshed-Memory Transformer for Image Captioning.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

AI4AR: An AI-Based Mobile Application for the Automatic Generation of AR Contents.
Proceedings of the Augmented Reality, Virtual Reality, and Computer Graphics, 2020

M-VAD names: a dataset for video captioning with naming.
Multim. Tools Appl., 2019

M<sup>2</sup>: Meshed-Memory Transformer for Image Captioning.
CoRR, 2019

STAGE: Spatio-Temporal Attention on Graph Entities for Video Action Detection.
CoRR, 2019

Perceive, Transform, and Act: Multi-Modal Attention Networks for Vision-and-Language Navigation.
CoRR, 2019

Image-to-Image Translation to Unfold the Reality of Artworks: An Empirical Analysis.
Proceedings of the Image Analysis and Processing - ICIAP 2019, 2019

Artpedia: A New Visual-Semantic Dataset with Visual and Contextual Sentences in the Artistic Domain.
Proceedings of the Image Analysis and Processing - ICIAP 2019, 2019

Art2Real: Unfolding the Reality of Artworks via Semantically-Aware Image-To-Image Translation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

A Deep-learning-based approach to VM behavior Identification in Cloud Systems.
Proceedings of the 9th International Conference on Cloud Computing and Services Science, 2019

Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters.
Proceedings of the 30th British Machine Vision Conference 2019, 2019

Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention.
ACM Trans. Multim. Comput. Commun. Appl., 2018

Predicting Human Eye Fixations via an LSTM-Based Saliency Attentive Model.
IEEE Trans. Image Process., 2018

Attentive models in vision: Computing saliency maps in the deep learning era.
Intelligenza Artificiale, 2018

Connected Components Labeling on DRAGs: Implementation and Reproducibility Notes.
Proceedings of the Reproducible Research in Pattern Recognition, 2018

Automatic Image Cropping and Selection Using Saliency: An Application to Historical Manuscripts.
Proceedings of the Digital Libraries and Multimedia Archives, 2018

A Hierarchical Quasi-Recurrent approach to Video Captioning.
Proceedings of the IEEE International Conference on Image Processing, 2018

Connected Components Labeling on DRAGs.
Proceedings of the 24th International Conference on Pattern Recognition, 2018

Aligning Text and Document Illustrations: Towards Visually Explainable Digital Humanities.
Proceedings of the 24th International Conference on Pattern Recognition, 2018

What Was Monet Seeing While Painting? Translating Artworks to Photo-Realistic Images.
Proceedings of the Computer Vision - ECCV 2018 Workshops, 2018

Towards Cycle-Consistent Models for Text and Image Retrieval.
Proceedings of the Computer Vision - ECCV 2018 Workshops, 2018

Visual-Semantic Alignment Across Domains Using a Semi-Supervised Approach.
Proceedings of the Computer Vision - ECCV 2018 Workshops, 2018

SAM: Pushing the Limits of Saliency Prediction Models.
Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018

LAMV: Learning to Align and Match Videos With Kernelized Temporal Layers.
Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

Recognizing and Presenting the Storytelling Video Structure With Deep Multimodal Networks.
IEEE Trans. Multim., 2017

A Video Library System Using Scene Detection and Automatic Tagging.
Proceedings of the Digital Libraries and Archives, 2017

Modeling multimodal cues in a deep learning-based framework for emotion recognition in the wild.
Proceedings of the 19th ACM International Conference on Multimodal Interaction, 2017

Visual saliency for image captioning in new multimedia services.
Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops, 2017

Towards Video Captioning with Naming: A Novel Dataset and a Multi-modal Approach.
Proceedings of the Image Analysis and Processing - ICIAP 2017, 2017

Hierarchical Boundary-Aware Neural Encoder for Video Captioning.
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017

NeuralStory: an Interactive Multimedia System for Video Indexing and Re-use.
Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing, 2017

Shot, Scene and Keyframe Ordering for Interactive Video Re-use.
Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016), 2016

A Browsing and Retrieval System for Broadcast Videos using Scene Detection and Automatic Annotation.
Proceedings of the 2016 ACM Conference on Multimedia Conference, 2016

Scene-driven Retrieval in Edited Videos using Aesthetic and Semantic Deep Features.
Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, 2016

Layout Analysis and Content Classification in Digitized Books.
Proceedings of the Digital Libraries and Multimedia Archives, 2016

YACCLAB - Yet Another Connected Components Labeling Benchmark.
Proceedings of the 23rd International Conference on Pattern Recognition, 2016

A deep multi-level network for saliency prediction.
Proceedings of the 23rd International Conference on Pattern Recognition, 2016

Historical document digitization through layout analysis and deep content classification.
Proceedings of the 23rd International Conference on Pattern Recognition, 2016

Context Change Detection for an Ultra-Low Power Low-Resolution Ego-Vision Imager.
Proceedings of the Computer Vision - ECCV 2016 Workshops, 2016

Multi-level Net: A Visual Saliency Prediction Model.
Proceedings of the Computer Vision - ECCV 2016 Workshops, 2016

Optimized Connected Components Labeling with Pixel Prediction.
Proceedings of the Advanced Concepts for Intelligent Vision Systems, 2016

A Deep Siamese Network for Scene Detection in Broadcast Videos.
Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM '15, Brisbane, Australia, October 26, 2015

Analysis and Re-Use of Videos in Educational Digital Libraries with Automatic Scene Detection.
Proceedings of the Digital Libraries on the Move, 2015

Scene segmentation using temporal clustering for accessing and re-using broadcast video.
Proceedings of the 2015 IEEE International Conference on Multimedia and Expo, 2015

Measuring Scene Detection Performance.
Proceedings of the Pattern Recognition and Image Analysis - 7th Iberian Conference, 2015

Shot and Scene Detection via Hierarchical Clustering for Re-using Broadcast Video.
Proceedings of the Computer Analysis of Images and Patterns, 2015

Gesture Recognition in Ego-centric Videos Using Dense Trajectories and Hand Segmentation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014

Hand segmentation for gesture recognition in EGO-vision.
Proceedings of the 3rd ACM international workshop on Interactive multimedia on mobile & portable devices, 2013
