2025
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, January, 2025
ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning.
,
,
,
,
,
,
,
,
,
,
,
CoRR, January, 2025
2024
<tt>L2CEval</tt>: Evaluating Language-to-Code Generation Capabilities of Large Language Models.
,
,
,
,
,
,
,
,
,
,
,
,
,
Trans. Assoc. Comput. Linguistics, 2024
ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain.
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents.
CoRR, 2024
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models.
CoRR, 2024
ReIFE: Re-evaluating Instruction-Following Evaluation.
CoRR, 2024
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation.
CoRR, 2024
Step-Back Profiling: Distilling User History for Personalized Scientific Writing.
CoRR, 2024
MIMIR: A Streamlined Platform for Personalized Agent Tuning in Domain Expertise.
CoRR, 2024
Evaluating LLMs at Detecting Errors in LLM Responses.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science.
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2024
Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models.
CoRR, 2024
Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in LLMs.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Struc-Bench: Are Large Language Models Good at Generating Complex Structured Tabular Data?
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Short Papers, 2024
On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering.
Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, 2024
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization.
Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, 2024
Investigating Data Contamination in Modern Benchmarks for Large Language Models.
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024
Revisiting Automated Evaluation for Long-form Table Question Answering.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
OMG-QA: Building Open-Domain Multi-Modal Generative Question Answering Systems.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: EMNLP 2024, 2024
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, 2024
P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, 2024
FOLIO: Natural Language Reasoning with First-Order Logic.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
FinDVer: Explainable Claim Verification over Long and Hybrid-content Financial Documents.
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning.
Proceedings of the Findings of the Association for Computational Linguistics, 2024
Unveiling the Spectrum of Data Contamination in Language Model: A Survey from Detection to Remediation.
Proceedings of the Findings of the Association for Computational Linguistics, 2024
KnowledgeFMath: A Knowledge-Intensive Math Reasoning Dataset in Finance Domains.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Financial Documents.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
TaPERA: Enhancing Faithfulness and Interpretability in Long-Form Table QA by Content Planning and Execution-based Reasoning.
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
2023
MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning.
CoRR, 2023
ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2023
DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data.
CoRR, 2023
KnowledgeMath: Knowledge-Intensive Math Word Problem Solving in Finance Domains.
CoRR, 2023
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models.
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2023
Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?
CoRR, 2023
ODSum: New Benchmarks for Open Domain Multi-Document Summarization.
CoRR, 2023
Large Language Models are Effective Table-to-Text Generators, Evaluators, and Feedback Providers.
CoRR, 2023
QTSumm: A New Benchmark for Query-Focused Table Summarization.
,
,
,
,
,
,
,
,
,
,
CoRR, 2023
Enhancing Few-shot Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies.
CoRR, 2023
Enhancing Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, 2023
Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
Investigating Table-to-Text Generation Capabilities of Large Language Models in Real-World Information Seeking Scenarios.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023, 2023
QTSumm: Query-Focused Summarization over Tabular Data.
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
LoFT: Enhancing Faithfulness and Diversity for Table-to-Text Generation via Logic Form Control.
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023
RobuT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023
OpenRT: An Open-source Framework for Reasoning Over Tabular Data.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2023
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation.
,
,
,
,
,
,
,
,
,
,
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023
2022
Apparel-Invariant Feature Learning for Person Re-Identification.
IEEE Trans. Multim., 2022
FOLIO: Natural Language Reasoning with First-Order Logic.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2022
FinMath: Injecting a Tree-structured Solver for Question Answering over Financial Reports.
Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022
ReasTAP: Injecting Table Reasoning Skills During Pre-training via Synthetic Reasoning Examples.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022
R2D2: Robust Data-to-Text with Replacement Detection.
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022
MultiHiertt: Numerical Reasoning over Multi Hierarchical Tabular and Textual Data.
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022
2021
MusiCoder: A Universal Music-Acoustic Encoder Based on Transformer.
Proceedings of the MultiMedia Modeling - 27th International Conference, 2021
2020
LAMP: Label Augmented Multimodal Pretraining.
CoRR, 2020
Apparel-invariant Feature Learning for Apparel-changed Person Re-identification.
CoRR, 2020
MusiCoder: A Universal Music-Acoustic Encoder Based on Transformers.
CoRR, 2020