2025

Enhancing the Geometric Problem-Solving Ability of Multimodal LLMs via Symbolic-Neural Integration.

[DOI]

Yicheng Pan

Zhenrong Zhang

CoRR, April, 2025

MMC: Iterative Refinement of VLM Reasoning via MCTS-based Multimodal Critique.

[DOI]

CoRR, April, 2025

PRM-BAS: Enhancing Multimodal Reasoning through PRM-guided Beam Annealing Search.

[DOI]

CoRR, April, 2025

Latent Swap Joint Diffusion for Long-Form Audio Generation.

[DOI]

CoRR, February, 2025

Skeleton and Font Generation Network for Zero-shot Chinese Character Generation.

[DOI]

CoRR, January, 2025

Count, decompose and correct: A new approach to handwritten Chinese character error correction.

[DOI]

Pattern Recognit., 2025

Bidirectional trained tree-structured decoder for Handwritten Mathematical Expression Recognition.

[DOI]

Pattern Recognit., 2025

DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking head Video Generation.

[DOI]

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion.

[DOI]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

DocMamba: Efficient Document Pre-training with State Space Model.

[DOI]

Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

RFL: Simplifying Chemical Structure Recognition with Ring-Free Language.

[DOI]

Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024

Generate, transform, and clean: the role of GANs and transformers in palm leaf manuscript generation and enhancement.

[DOI]

Int. J. Document Anal. Recognit., September, 2024

SEMv2: Table separation line detection based on instance segmentation.

[DOI]

Pattern Recognit., 2024

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion.

[DOI]

CoRR, 2024

See then Tell: Enhancing Key Information Extraction with Vision Grounding.

[DOI]

CoRR, 2024

A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition.

[DOI]

CoRR, 2024

SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding.

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024

SEMv3: A Fast and Robust Approach to Table Separation Line Detection.

[DOI]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024

Maths: Multimodal Transformer-Based Human-Readable Solver.

[DOI]

Proceedings of the IEEE International Conference on Multimedia and Expo, 2024

Radical Similarity Based Model Optimization and Post-correction for Chinese Character Recognition.

[DOI]

Proceedings of the Document Analysis and Recognition - ICDAR 2024 - 18th International Conference, Athens, Greece, August 30, 2024

ICDAR 2024 Competition on Recognition of Chemical Structures.

[DOI]

Proceedings of the Document Analysis and Recognition - ICDAR 2024 - 18th International Conference, Athens, Greece, August 30, 2024

Viewing Writing as Video: Optical Flow based Multi-Modal Handwritten Mathematical Expression Recognition.

[DOI]

Proceedings of the IEEE International Conference on Acoustics, 2024

UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition.

[DOI]

Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, 2024

2023

Multimodal Pre-Training Based on Graph Attention Network for Document Understanding.

[DOI]

IEEE Trans. Multim., 2023

Count, Decode and Fetch: A New Approach to Handwritten Chinese Character Error Correction.

[DOI]

CoRR, 2023

SEMv2: Table Separation Line Detection Based on Conditional Convolution.

[DOI]

CoRR, 2023

Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023.

[DOI]

Proceedings of the 31st ACM International Conference on Multimedia, 2023

Group, Contrast and Recognize: A Self-supervised Method for Chinese Character Recognition.

[DOI]

Proceedings of the Document Analysis and Recognition - ICDAR 2023, 2023

Enhancing Math Word Problem Solving Through Salient Clue Prioritization: A Joint Token-Phrase-Level Feature Integration Approach.

[DOI]

Proceedings of the International Conference on Asian Language Processing, 2023

USTC-iFLYTEK at DocILE: A Multi-modal Approach Using Domain-specific GraphDoc.

[DOI]

Proceedings of the Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), 2023

HRDoc: Dataset and Baseline Method toward Hierarchical Reconstruction of Document Structures.

[DOI]

Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023

2022

GMN: Generative Multi-modal Network for Practical Document Information Extraction.

[DOI]

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022

Query-driven Generative Network for Document Information Extraction in the Wild.

[DOI]

Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022

2021

An Open-Source Library of 2D-GMM-HMM Based on Kaldi Toolkit and Its Application to Handwritten Chinese Character Recognition.

[DOI]

Jiefeng Ma

Zirui Wang

Jun Du

Proceedings of the Image and Graphics - 11th International Conference, 2021