2025
TesserAct: Learning 4D Embodied World Models.
CoRR, April, 2025
Towards Understanding Camera Motions in Any Video.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, April, 2025
AdaWorld: Learning Adaptable World Models with Latent Actions.
CoRR, March, 2025
LuciBot: Automated Robot Policy Learning from Generated Videos.
CoRR, March, 2025
MatchMaker: Automated Asset Generation for Robotic Assembly.
CoRR, March, 2025
Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training.
CoRR, February, 2025
Scaling Autonomous Agents via Automatic Reward Modeling And Planning.
CoRR, February, 2025
Articulate AnyMesh: Open-Vocabulary 3D Articulated Objects Modeling.
CoRR, February, 2025
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search.
CoRR, February, 2025
TopoGaussian: Inferring Internal Topology Structures from Visual Clues.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
Delta: Dense Efficient Long-Range 3D tracking for any video.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
SafeDiffuser: Safe Planning with Diffusion Probabilistic Models.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
Revisiting Network Coding for Warm Blob Storage.
Proceedings of the 23rd USENIX Conference on File and Storage Technologies, 2025
UniMuMo: Unified Text, Music, and Motion Generation.
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025
2024
VCA: Video Curious Agent for Long Video Understanding.
CoRR, 2024
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences.
CoRR, 2024
SnapMem: Snapshot-based 3D Scene Memory for Embodied Exploration and Reasoning.
CoRR, 2024
Compositional Physical Reasoning of Objects and Events from Videos.
CoRR, 2024
CoNav: A Benchmark for Human-Centered Collaborative Navigation.
CoRR, 2024
RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text.
CoRR, 2024
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving.
CoRR, 2024
Virtual Foundry Graphnet for Metal Sintering Deformation Prediction.
CoRR, 2024
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation.
CoRR, 2024
Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble.
CoRR, 2024
Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs.
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Physically Compatible 3D Object Modeling from a Single Image.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Constrained Human-AI Cooperation: An Inclusive Embodied Social Intelligence Challenge.
,
,
,
,
,
,
,
,
,
,
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization.
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration.
Proceedings of the Seventh Annual Conference on Machine Learning and Systems, 2024
Disentangled Acoustic Fields For Multimodal Physical Scene Understanding.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2024
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the IEEE International Conference on Robotics and Automation, 2024
RoboDreamer: Learning Compositional World Models for Robot Imagination.
Proceedings of the Forty-first International Conference on Machine Learning, 2024
ContPhy: Continuum Physical Concept Learning and Reasoning from Videos.
Proceedings of the Forty-first International Conference on Machine Learning, 2024
3D-VLA: A 3D Vision-Language-Action Generative World Model.
Proceedings of the Forty-first International Conference on Machine Learning, 2024
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation.
Proceedings of the Forty-first International Conference on Machine Learning, 2024
LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery.
Proceedings of the Forty-first International Conference on Machine Learning, 2024
Speech Self-Supervised Learning Using Diffusion Model Synthetic Data.
Proceedings of the Forty-first International Conference on Machine Learning, 2024
HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
Building Cooperative Embodied Agents Modularly with Large Language Models.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
Thin-Shell Object Manipulations With Differentiable Physics Simulations.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
SALMON: Self-Alignment with Instructable Reward Models.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
DIFFTACTILE: A Physics-based Differentiable Tactile Simulator for Contact-rich Robotic Manipulation.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
GENOME: Generative Neuro-Symbolic Visual Reasoning by Growing and Reusing Modules.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
FlexAttention for Efficient High-Resolution Vision-Language Models.
Proceedings of the Computer Vision - ECCV 2024, 2024
RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
UBSoft: A Simulation Platform for Robotic Skill Learning in Unbounded Soft Environments.
Proceedings of the Conference on Robot Learning, 6-9 November 2024, Munich, Germany., 2024
Multi-Agent Alternate Q-Learning.
Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, 2024
Aligning Large Multimodal Models with Factually Augmented RLHF.
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Findings of the Association for Computational Linguistics, 2024
Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning.
Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, 2024
2023
Bird-Count: a multi-modality benchmark and system for bird population counting in the wild.
Multim. Tools Appl., December, 2023
TransCenter: Transformers With Dense Representations for Multiple-Object Tracking.
IEEE Trans. Pattern Anal. Mach. Intell., June, 2023
Self-supervised audiovisual representation learning for remote sensing data.
Int. J. Appl. Earth Obs. Geoinformation, February, 2023
DCIR: Dynamic Consistency Intrinsic Reward for Multi-Agent Reinforcement Learning.
CoRR, 2023
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation.
CoRR, 2023
Autonomous Tree-search Ability of Large Language Models.
CoRR, 2023
SALMON: Self-Alignment with Principle-Following Reward Models.
CoRR, 2023
Generalizable Long-Horizon Manipulations with Large Language Models.
CoRR, 2023
A<sup>2</sup>Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models.
CoRR, 2023
An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training.
CoRR, 2023
ModuleFormer: Learning Modular Large Language Models From Uncurated Data.
CoRR, 2023
SafeDiffuser: Safe Planning with Diffusion Probabilistic Models.
CoRR, 2023
EC^2: Emergent Communication for Embodied Control.
CoRR, 2023
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning.
CoRR, 2023
ClawSAT: Towards Both Robust and Accurate Code Models.
Proceedings of the IEEE International Conference on Software Analysis, 2023
RoboNinja: Learning an Adaptive Cutting Policy for Multi-Material Objects.
Proceedings of the Robotics: Science and Systems XIX, Daegu, 2023
Adaptive Online Replanning with Diffusion Models.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023
DiffuseBot: Breeding Soft Robots With Physics-Augmented Generative Diffusion Models.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023
Physion++: Evaluating Physical Scene Understanding that Requires Online Inference of Different Physical Properties.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023
DiffVL: Scaling Up Soft Body Manipulation using Vision-Language Driven Differentiable Physics.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023
3D-LLM: Injecting the 3D World into Large Language Models.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023
PockEngine: Sparse and Efficient Fine-tuning in a Pocket.
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023
Deep Masked Graph Matching for Correspondence Identification in Collaborative Perception.
Proceedings of the IEEE International Conference on Robotics and Automation, 2023
Reparameterized Policy Learning for Multimodal Trajectory Optimization.
Proceedings of the International Conference on Machine Learning, 2023
On the Forward Invariance of Neural ODEs.
Proceedings of the International Conference on Machine Learning, 2023
Learning Neural Constitutive Laws from Motion Observations for Generalizable PDE Dynamics.
Proceedings of the International Conference on Machine Learning, 2023
Planning with Large Language Models for Code Generation.
Proceedings of the Eleventh International Conference on Learning Representations, 2023
Hyper-Decision Transformer for Efficient Online Policy Adaptation.
Proceedings of the Eleventh International Conference on Learning Representations, 2023
FluidLab: A Differentiable Environment for Benchmarking Complex Fluid Manipulation.
Proceedings of the Eleventh International Conference on Learning Representations, 2023
SoftZoo: A Soft Robot Co-design Benchmark For Locomotion In Diverse Environments.
Proceedings of the Eleventh International Conference on Learning Representations, 2023
PAC-NeRF: Physics Augmented Continuum Neural Radiance Fields for Geometry-Agnostic System Identification.
Proceedings of the Eleventh International Conference on Learning Representations, 2023
DexDeform: Dexterous Deformable Object Manipulation with Human Demonstrations and Differentiable Physics.
Proceedings of the Eleventh International Conference on Learning Representations, 2023
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023
Learning Vision-and-Language Navigation from YouTube Videos.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023
EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023
Enabling Encrypted Delta Compression for Outsourced Storage Systems via Preserving Similarity.
Proceedings of the 41st IEEE International Conference on Computer Design, 2023
Sparse Universal Transformer.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
Masked Motion Encoding for Self-Supervised Video Representation Learning.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
EC<sup>2</sup>: Emergent Communication for Embodied Control.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
Learning Situation Hyper-Graphs for Video Question Answering.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
3D Concept Learning and Reasoning from Multi-View Images.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
Mod-Squad: Designing Mixtures of Experts As Modular Multi-Task Learners.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
JECC: Commonsense Reasoning Tasks Derived from Interactive Fictions.
Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023
2022
Text-instance graph: Exploring the relational semantics for text-based visual question answering.
Pattern Recognit., 2022
Graph Convolutional Module for Temporal Action Localization in Videos.
IEEE Trans. Pattern Anal. Mach. Intell., 2022
Purely Attention Based Local Feature Integration for Video Classification.
IEEE Trans. Pattern Anal. Mach. Intell., 2022
TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Devices.
IEEE Trans. Pattern Anal. Mach. Intell., 2022
Mod-Squad: Designing Mixture of Experts As Modular Multi-Task Learners.
CoRR, 2022
Retrospectives on the Embodied AI Workshop.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2022
M<sup>3</sup>Video: Masked Motion Modeling for Self-Supervised Video Representation Learning.
CoRR, 2022
MA2QL: A Minimalist Approach to Fully Decentralized Multi-Agent Reinforcement Learning.
CoRR, 2022
EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition.
CoRR, 2022
Certifiably robust interpretation via Rényi differential privacy.
Artif. Intell., 2022
SNAKE: Shape-aware Neural 3D Keypoint Field.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022
Learning Neural Acoustic Fields.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022
3D Concept Grounding on Neural Fields.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022
Learning Physical Dynamics with Subequivariant Graph Neural Networks.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022
Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022
Learning Active Camera for Multi-Object Navigation.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022
On-Device Training Under 256KB Memory.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022
Gait Recognition in the Wild with Multi-hop Temporal Switch.
Proceedings of the MM '22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10, 2022
Noisy Agents: Self-supervised Exploration by Predicting Auditory Events.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2022
The ThreeDWorld Transport Challenge: A Visually Guided Task-and-Motion Planning Benchmark Towards Physically Realistic Embodied AI.
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2022 International Conference on Robotics and Automation, 2022
Prompting Decision Transformer for Few-Shot Policy Generalization.
Proceedings of the International Conference on Machine Learning, 2022
Linking Emergent and Natural Languages via Corpus Transfer.
Proceedings of the Tenth International Conference on Learning Representations, 2022
FALCON: Fast Visual Concept Learning by Integrating Images, Linguistic descriptions, and Conceptual Relations.
Proceedings of the Tenth International Conference on Learning Representations, 2022
DiffSkill: Skill Abstraction from Differentiable Physics for Deformable Object Manipulations with Tools.
Proceedings of the Tenth International Conference on Learning Representations, 2022
Contact Points Discovery for Soft-Body Manipulations with Differentiable Physics.
Proceedings of the Tenth International Conference on Learning Representations, 2022
ComPhy: Compositional Physical Reasoning of Objects and Events from Videos.
Proceedings of the Tenth International Conference on Learning Representations, 2022
Network Augmentation for Tiny Deep Learning.
Proceedings of the Tenth International Conference on Learning Representations, 2022
RISP: Rendering-Invariant State Predictor with Differentiable Simulation and Rendering for Cross-Domain Parameter Estimation.
Proceedings of the Tenth International Conference on Learning Representations, 2022
Revisiting the Roles of "Text" in Text Games.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, 2022
Prototype-Guided Continual Adaptation for Class-Incremental Unsupervised Domain Adaptation.
Proceedings of the Computer Vision - ECCV 2022, 2022
Weakly Supervised Grounding for VQA in Vision-Language Transformers.
Proceedings of the Computer Vision - ECCV 2022, 2022
AutoGPart: Intermediate Supervision Search for Generalizable 3D Part Segmentation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
Fixing Malfunctional Objects With Learned Physical Simulation and Functional Prediction.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
Finding Fallen Objects Via Asynchronous Audio-Visual Integration.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
Planning with Spatial-Temporal Abstraction from Point Clouds for Deformable Object Manipulation.
Proceedings of the Conference on Robot Learning, 2022
Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction Following.
Proceedings of the Conference on Robot Learning, 2022
2021
A Real-Time Action Representation With Temporal Encoding and Deep Compression.
IEEE Trans. Circuits Syst. Video Technol., 2021
A novel domain activation mapping-guided network (DA-GNT) for visual tracking.
Neurocomputing, 2021
MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning.
CoRR, 2021
TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Device.
CoRR, 2021
Global Rhythm Style Transfer Without Text Transcriptions.
CoRR, 2021
Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning.
CoRR, 2021
TransCenter: Transformers with Dense Queries for Multiple-Object Tracking.
CoRR, 2021
The ThreeDWorld Transport Challenge: A Visually Guided Task-and-Motion Planning Benchmark for Physically Realistic Embodied AI.
,
,
,
,
,
,
,
,
,
,
CoRR, 2021
The 1st International Workshop on Machine Reasoning: International Machine Reasoning Conference (MRC 2021).
Proceedings of the WSDM '21, 2021
STAR: A Benchmark for Situated Reasoning in Real-World Videos.
Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, 2021
Memory-efficient Patch-based Inference for Tiny Deep Learning.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021
PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021
ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, 2021
When does Contrastive Learning Preserve Adversarial Robustness from Pretraining to Finetuning?
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021
Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021
STST: Spatial-Temporal Specialized Transformer for Skeleton-based Action Recognition.
Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021
Counterfactual Debiasing Inference for Compositional Action Recognition.
Proceedings of the MM '21: ACM Multimedia Conference, Virtual Event, China, October 20, 2021
OPEn: An Open-ended Physics Environment for Learning Without a Task.
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021
MAGCN: A Multi-Adaptive Graph Convolutional Network for Traffic Forecasting.
Proceedings of the International Joint Conference on Neural Networks, 2021
Temporal and Object Quantification Networks.
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021
AGENT: A Benchmark for Core Psychological Reasoning.
Proceedings of the 38th International Conference on Machine Learning, 2021
Global Prosody Style Transfer Without Text Transcriptions.
Proceedings of the 38th International Conference on Machine Learning, 2021
Adversarial Option-Aware Hierarchical Imitation Learning.
Proceedings of the 38th International Conference on Machine Learning, 2021
Learning Task Decomposition with Ordered Memory Policy Network.
Proceedings of the 9th International Conference on Learning Representations, 2021
PlasticineLab: A Soft-Body Manipulation Benchmark with Differentiable Physics.
Proceedings of the 9th International Conference on Learning Representations, 2021
Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning.
Proceedings of the 9th International Conference on Learning Representations, 2021
On Fast Adversarial Robustness Adaptation in Model-Agnostic Meta-Learning.
Proceedings of the 9th International Conference on Learning Representations, 2021
Curious Representation Learning for Embodied Intelligence.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021
Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021
Augmenting Policy Learning with Routines Discovered from a Single Demonstration.
Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021
MVFNet: Multi-View Fusion Network for Efficient Video Recognition.
Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021
RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning.
Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021
2020
Relation Attention for Temporal Action Localization.
IEEE Trans. Multim., 2020
Generating Visually Aligned Sound From Videos.
IEEE Trans. Image Process., 2020
Augmenting Policy Learning with Routines Discovered from a Demonstration.
CoRR, 2020
Object-Centric Diagnosis of Visual Reasoning.
CoRR, 2020
Synthetic Training for Monocular Human Mesh Recovery.
CoRR, 2020
Tiny Transfer Learning: Towards Memory-Efficient On-Device Learning.
CoRR, 2020
ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2020
Language Guided Networks for Cross-modal Moment Retrieval.
CoRR, 2020
MCUNet: Tiny Deep Learning on IoT Devices.
Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020
TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning.
Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020
Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization.
Proceedings of the MM '20: The 28th ACM International Conference on Multimedia, 2020
HUMA'20: 1st International Workshop on Human-Centric Multimedia Analysis.
Proceedings of the MM '20: The 28th ACM International Conference on Multimedia, 2020
Deep Concept-wise Temporal Convolutional Networks for Action Localization.
Proceedings of the MM '20: The 28th ACM International Conference on Multimedia, 2020
Look, Listen, and Act: Towards Audio-Visual Embodied Navigation.
Proceedings of the 2020 IEEE International Conference on Robotics and Automation, 2020
Deep Audio Priors Emerge From Harmonic Convolutional Networks.
Proceedings of the 8th International Conference on Learning Representations, 2020
CLEVRER: Collision Events for Video Representation and Reasoning.
Proceedings of the 8th International Conference on Learning Representations, 2020
Once-for-All: Train One Network and Specialize it for Efficient Deployment.
Proceedings of the 8th International Conference on Learning Representations, 2020
Interactive Fiction Game Playing as Multi-Paragraph Reading Comprehension with Reinforcement Learning.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020
DataMix: Efficient Privacy-Preserving Edge-Cloud Inference.
Proceedings of the Computer Vision - ECCV 2020, 2020
Foley Music: Learning to Generate Music from Videos.
Proceedings of the Computer Vision - ECCV 2020, 2020
Dense Regression Network for Video Grounding.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020
Music Gesture for Visual Sound Separation.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020
HAT: Hardware-Aware Transformers for Efficient Natural Language Processing.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020
Location-Aware Graph Convolutional Networks for Video Question Answering.
Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020
2019
Breaking Winner-Takes-All: Iterative-Winners-Out Networks for Weakly Supervised Temporal Action Localization.
IEEE Trans. Image Process., 2019
Toward Efficient Action Recognition: Principal Backpropagation for Training Two-Stream Networks.
IEEE Trans. Image Process., 2019
TruNet: Short Videos Generation from Long Videos via Story-Preserving Truncation.
CoRR, 2019
Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos.
CoRR, 2019
Once for All: Train One Network and Specialize it for Efficient Deployment.
CoRR, 2019
Interpreting Adversarial Examples by Activation Promotion and Suppression.
CoRR, 2019
Cross-channel Communication Networks.
Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 2019
Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement.
Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 2019
Visual Concept-Metaconcept Learning.
Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 2019
Facial Image-to-Video Translation by a Hidden Affine Transformation.
Proceedings of the 27th ACM International Conference on Multimedia, 2019
Watch, Reason and Code: Learning to Represent Videos Using Program.
Proceedings of the 27th ACM International Conference on Multimedia, 2019
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision.
Proceedings of the 7th International Conference on Learning Representations, 2019
Defensive Quantization: When Efficiency Meets Robustness.
Proceedings of the 7th International Conference on Learning Representations, 2019
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019
Graph Convolutional Networks for Temporal Action Localization.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019
TSM: Temporal Shift Module for Efficient Video Understanding.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019
Self-Supervised Moving Vehicle Tracking With Stereo Sound.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019
Self-supervised Audio-visual Co-segmentation.
Proceedings of the IEEE International Conference on Acoustics, 2019
Self-Supervised Segmentation and Source Separation on Videos.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019
Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering.
Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019
StNet: Local and Global Spatial-Temporal Modeling for Action Recognition.
Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019
Controllable Image-to-Video Translation: A Case Study on Facial Expression Generation.
Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019
2018
Video Captioning with Multi-Faceted Attention.
Trans. Assoc. Comput. Linguistics, 2018
Temporal Shift Module for Efficient Video Understanding.
CoRR, 2018
Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding.
Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, 2018
Weakly Supervised Dense Event Captioning in Videos.
Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, 2018
Unsupervised Domain Adaptation for 3D Keypoint Estimation via View Consistency.
Proceedings of the Computer Vision - ECCV 2018, 2018
Proceedings of the Computer Vision - ECCV 2018, 2018
Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification.
Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018
Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning.
Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018
End-to-End Learning of Motion Representation for Video Understanding.
Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018
Sparse, Smart Contours to Represent and Edit Images.
Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018
Multimodal Keyless Attention Fusion for Video Classification.
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018
T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition.
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018
2017
A Multisource Domain Generalization Approach to Visual Attribute Detection.
Proceedings of the Domain Adaptation in Computer Vision Applications., 2017
Smart, Sparse Contours to Represent and Edit Images.
CoRR, 2017
Unsupervised Domain Adaptation for 3D Keypoint Prediction from a Single Depth Scan.
CoRR, 2017
Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification.
CoRR, 2017
Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding.
CoRR, 2017
Recurrent Topic-Transition GAN for Visual Paragraph Generation.
Proceedings of the IEEE International Conference on Computer Vision, 2017
VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation.
Proceedings of the IEEE International Conference on Computer Vision, 2017
Semantic Compositional Networks for Visual Captioning.
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017
StyleNet: Generating Attractive Visual Captions with Styles.
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017
DECK: Discovering Event Composition Knowledge from Web Images for Zero-Shot Event Detection and Recounting in Videos.
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017
2016
Recognizing an Action Using Its Name: A Knowledge-Based Approach.
Int. J. Comput. Vis., 2016
Strategies for Searching Video Content with Text Queries or Video Examples.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
CoRR, 2016
Webly-Supervised Video Recognition by Mutually Voting for Relevant Web Images and Web Video Frames.
Proceedings of the Computer Vision - ECCV 2016, 2016
You Lead, We Exceed: Labor-Free Video Concept Learning by Jointly Exploiting Web Videos and Images.
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016
Learning Attributes Equals Multi-Source Domain Generalization.
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016
Concepts Not Alone: Exploring Pairwise Relationships for Zero-Shot Video Activity Recognition.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016
2015
CMU Informedia@TRECVID 2015: MED/SIN/LNK/SED.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2015 TREC Video Retrieval Evaluation, 2015
Automatic Concept Discovery from Parallel Text and Visual Corpora.
Proceedings of the 2015 IEEE International Conference on Computer Vision, 2015
DevNet: A Deep Event Network for multimedia event detection and evidence recounting.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015
Exploring Semantic Inter-Class Relationships (SIR) for Zero-Shot Action Recognition.
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015
2014
Informedia @ TRECVID 2014.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2014 TREC Video Retrieval Evaluation, 2014
2013
Salient object detection in image sequences via spatial-temporal cue.
Proceedings of the 2013 Visual Communications and Image Processing, 2013