2025
Enhancing the Geometric Problem-Solving Ability of Multimodal LLMs via Symbolic-Neural Integration.
CoRR, April, 2025
MMC: Iterative Refinement of VLM Reasoning via MCTS-based Multimodal Critique.
CoRR, April, 2025
PRM-BAS: Enhancing Multimodal Reasoning through PRM-guided Beam Annealing Search.
,
,
,
,
,
,
,
,
,
,
CoRR, April, 2025
Latent Swap Joint Diffusion for Long-Form Audio Generation.
CoRR, February, 2025
Skeleton and Font Generation Network for Zero-shot Chinese Character Generation.
CoRR, January, 2025
Count, decompose and correct: A new approach to handwritten Chinese character error correction.
Pattern Recognit., 2025
Bidirectional trained tree-structured decoder for Handwritten Mathematical Expression Recognition.
Pattern Recognit., 2025
DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking head Video Generation.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
DocMamba: Efficient Document Pre-training with State Space Model.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025
RFL: Simplifying Chemical Structure Recognition with Ring-Free Language.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025
2024
Generate, transform, and clean: the role of GANs and transformers in palm leaf manuscript generation and enhancement.
Int. J. Document Anal. Recognit., September, 2024
SEMv2: Table separation line detection based on instance segmentation.
Pattern Recognit., 2024
EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion.
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
See then Tell: Enhancing Key Information Extraction with Vision Grounding.
CoRR, 2024
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition.
CoRR, 2024
SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
SEMv3: A Fast and Robust Approach to Table Separation Line Detection.
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024
Maths: Multimodal Transformer-Based Human-Readable Solver.
Proceedings of the IEEE International Conference on Multimedia and Expo, 2024
Radical Similarity Based Model Optimization and Post-correction for Chinese Character Recognition.
Proceedings of the Document Analysis and Recognition - ICDAR 2024 - 18th International Conference, Athens, Greece, August 30, 2024
ICDAR 2024 Competition on Recognition of Chemical Structures.
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Document Analysis and Recognition - ICDAR 2024 - 18th International Conference, Athens, Greece, August 30, 2024
Viewing Writing as Video: Optical Flow based Multi-Modal Handwritten Mathematical Expression Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2024
UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, 2024
2023
Multimodal Pre-Training Based on Graph Attention Network for Document Understanding.
IEEE Trans. Multim., 2023
Count, Decode and Fetch: A New Approach to Handwritten Chinese Character Error Correction.
CoRR, 2023
SEMv2: Table Separation Line Detection Based on Conditional Convolution.
CoRR, 2023
Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023.
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 31st ACM International Conference on Multimedia, 2023
Group, Contrast and Recognize: A Self-supervised Method for Chinese Character Recognition.
Proceedings of the Document Analysis and Recognition - ICDAR 2023, 2023
Enhancing Math Word Problem Solving Through Salient Clue Prioritization: A Joint Token-Phrase-Level Feature Integration Approach.
Proceedings of the International Conference on Asian Language Processing, 2023
USTC-iFLYTEK at DocILE: A Multi-modal Approach Using Domain-specific GraphDoc.
Proceedings of the Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), 2023
HRDoc: Dataset and Baseline Method toward Hierarchical Reconstruction of Document Structures.
Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023
2022
GMN: Generative Multi-modal Network for Practical Document Information Extraction.
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022
Query-driven Generative Network for Document Information Extraction in the Wild.
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022
2021
An Open-Source Library of 2D-GMM-HMM Based on Kaldi Toolkit and Its Application to Handwritten Chinese Character Recognition.
Proceedings of the Image and Graphics - 11th International Conference, 2021