Chen Sun

Affiliations:
  • Brown University, Department of Computer Science, Providence, RI, USA
  • Google Research, USA (former)
  • Facebook AI Research, USA (former)
  • University of Southern California, Los Angeles, CA, USA (former)


According to our database1, Chen Sun authored at least 94 papers between 2011 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources.
CoRR, 2024

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens.
CoRR, 2024

Do Music Generation Models Encode Music Theory?
CoRR, 2024

Learning Visual Grounding from Generative Vision and Language Model.
CoRR, 2024

Text-Aware Diffusion for Policy Learning.
CoRR, 2024

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts.
CoRR, 2024

Object-centric Video Representation for Long-term Action Anticipation.
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024

Potential Based Diffusion Motion Planning.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

Self-Correcting Self-Consuming Loops for Generative Model Training.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Vamos: Versatile Action Models for Video Understanding.
Proceedings of the Computer Vision - ECCV 2024, 2024

Pixel Aligned Language Models.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

End-to-End Spatio-Temporal Action Localisation with Video Transformers.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2023
Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?
Trans. Mach. Learn. Res., 2023

Towards A Unified Neural Architecture for Visual Recognition and Reasoning.
CoRR, 2023

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?
CoRR, 2023

Goal-Conditioned Predictive Coding as an Implicit Planner for Offline Reinforcement Learning.
CoRR, 2023

Dense Video Object Captioning from Disjoint Supervision.
CoRR, 2023

AVIS: Autonomous Visual Information Seeking with Large Language Models.
CoRR, 2023

Comparing Trajectory and Vision Modalities for Verb Representation.
CoRR, 2023

Steerable Equivariant Representation Learning.
CoRR, 2023

Goal-Conditioned Predictive Coding for Offline Reinforcement Learning.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

AVIS: Autonomous Visual Information Seeking with Large Language Model Agent.
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Does Visual Pretraining Help End-to-End Reasoning?
Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Emergence of Abstract State Representations in Embodied Sequence Modeling.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

Analyzing Modular Approaches for Visual Question Decomposition.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

How can objects help action recognition?
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Reveal: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2022
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency.
CoRR, 2022

Beyond Transfer Learning: Co-finetuning for Action Localisation.
CoRR, 2022

Do Vision-Language Pretrained Models Learn Primitive Concepts?
CoRR, 2022

Masking Modalities for Cross-modal Video Retrieval.
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022

Do Trajectories Encode Verb Meaning?
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022

AVATAR: Unconstrained Audiovisual Speech Recognition.
Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022

TL;DW? Summarizing Instructional Videos with Task Relevance and Cross-Modal Saliency.
Proceedings of the Computer Vision - ECCV 2022, 2022

Learning Audio-Video Modalities from Image Captions.
Proceedings of the Computer Vision - ECCV 2022, 2022

Multiview Transformers for Video Recognition.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2021
Local Metrics for Multi-Object Tracking.
CoRR, 2021

Attention Bottlenecks for Multimodal Fusion.
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

Episodic Transformer for Vision-and-Language Navigation.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

DenseTNT: End-to-end Trajectory Prediction from Dense Goal Sets.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Learning Temporal Dynamics from Cycles in Narrated Video.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Unified Graph Structured Models for Video Understanding.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

ViViT: A Video Vision Transformer.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Composable Augmentation Encoding for Video Representation Learning.
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021

Does Vision-and-Language Pretraining Improve Lexical Grounding?
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, 2021

HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021

2020
The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020).
CoRR, 2020

Learning Video Representations from Textual Web Supervision.
CoRR, 2020

D3D: Distilled 3D Networks for Video Action Recognition.
Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2020

What Makes for Good Views for Contrastive Learning?
Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

Multi-modal Transformer for Video Retrieval.
Proceedings of the Computer Vision - ECCV 2020, 2020

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos.
Proceedings of the Computer Vision - ECCV 2020, 2020

Speech2Action: Cross-Modal Supervision for Action Recognition.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

TNT: Target-driven Trajectory Prediction.
Proceedings of the 4th Conference on Robot Learning, 2020

2019
Contrastive Bidirectional Transformer for Temporal Representation Learning.
CoRR, 2019

Unsupervised learning of object structure and dynamics from videos.
Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, 2019

Unsupervised Discovery of Parts, Structure, and Dynamics.
Proceedings of the 7th International Conference on Learning Representations, 2019

Stochastic Prediction of Multi-Agent Interactions from Partial Observations.
Proceedings of the 7th International Conference on Learning Representations, 2019

VideoBERT: A Joint Model for Video and Language Representation Learning.
Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019

Composing Text and Image for Image Retrieval - an Empirical Odyssey.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

Relational Action Forecasting.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

Inferring Context from Pixels for Multimodal Image Classification.
Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019

2018
DiscrimNet: Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks.
CoRR, 2018

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification.
Proceedings of the Computer Vision - ECCV 2018, 2018

Actor-Centric Relation Network.
Proceedings of the Computer Vision - ECCV 2018, 2018

The INaturalist Species Classification and Detection Dataset.
Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions.
Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning.
Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

2017
Rethinking Spatiotemporal Feature Learning For Video Understanding.
CoRR, 2017

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions.
CoRR, 2017

Complex Event Recognition from Images with Few Training Examples.
Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision, 2017

TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals.
Proceedings of the IEEE International Conference on Computer Vision, 2017

TALL: Temporal Activity Localization via Language Query.
Proceedings of the IEEE International Conference on Computer Vision, 2017

VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation.
Proceedings of the IEEE International Conference on Computer Vision, 2017

Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors.
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017

DECK: Discovering Event Composition Knowledge from Web Images for Zero-Shot Event Detection and Recounting in Videos.
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017

2016
ACD: Action Concept Discovery from Image-Sentence Corpora.
Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, 2016

Webly-Supervised Video Recognition by Mutually Voting for Relevant Web Images and Web Video Frames.
Proceedings of the Computer Vision - ECCV 2016, 2016

ProNet: Learning to Propose Object-Specific Boxes for Cascaded Neural Networks.
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016

2015
Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images.
Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM '15, Brisbane, Australia, October 26, 2015

Automatic Concept Discovery from Parallel Text and Visual Corpora.
Proceedings of the 2015 IEEE International Conference on Computer Vision, 2015

2014
Evaluating Multimedia Features and Fusion for Example-Based Event Detection.
Proceedings of the Fusion in Computer Vision - Understanding Complex Visual Content, 2014

Evaluating multimedia features and fusion for example-based event detection.
Mach. Vis. Appl., 2014

ISOMER: Informative Segment Observations for Multimedia Event Recounting.
Proceedings of the International Conference on Multimedia Retrieval, 2014

Late fusion and calibration for multimedia event detection using few examples.
Proceedings of the IEEE International Conference on Acoustics, 2014

Semantic Aware Video Transcription Using Random Forest Classifiers.
Proceedings of the Computer Vision - ECCV 2014, 2014

DISCOVER: Discovering Important Segments for Classification of Video Events and Recounting.
Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014

2013
Large-scale web video event classification by use of Fisher Vectors.
Proceedings of the 2013 IEEE Workshop on Applications of Computer Vision, 2013


ACTIVE: Activity Concept Transitions in Video Event Classification.
Proceedings of the IEEE International Conference on Computer Vision, 2013

2012

2011


  Loading...