2024
Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention.
CoRR, 2024

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models.
CoRR, 2024

Diffusion Model-Based Image Editing: A Survey.
CoRR, 2024

Ferret: Refer and Ground Anything Anywhere at Any Granularity.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

Efficient-3Dim: Learning a Generalizable Single-image Novel-view Synthesizer in One Day.
Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023
Efficient-NeRF2NeRF: Streamlining Text-Driven 3D Editing with Multiview Correspondence-Enhanced Diffusion Models.
CoRR, 2023

Instruction-Following Speech Recognition.
CoRR, 2023

RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture.
CoRR, 2023

Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness.
CoRR, 2023

STAIR: Learning Sparse Text and Image Representation in Grounded Tokens.
CoRR, 2023

RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture.
Proceedings of the 31st ACM International Conference on Multimedia, 2023

STAIR: Learning Sparse Text and Image Representation in Grounded Tokens.
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2022
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition.
IEEE J. Sel. Top. Signal Process., 2022

Exploiting Category Names for Few-Shot Classification with Vision-Language Models.
CoRR, 2022

PriFit: Learning to Fit Primitives Improves Few Shot Point Cloud Segmentation.
Comput. Graph. Forum, 2022

Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2022

2021
SurFit: Learning to Fit Surfaces Improves Few Shot Learning on Point Clouds.
CoRR, 2021

Input Length Matters: An Empirical Study Of RNN-T And MWER Training For Long-form Telephony Speech Recognition.
CoRR, 2021

Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models.
CoRR, 2021

RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions.
Proceedings of the IEEE Spoken Language Technology Workshop, 2021

Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Exploring Targeted Universal Adversarial Perturbations to End-to-End ASR Models.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Residual Energy-Based Models for End-to-End Speech Recognition.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Bridging the Gap Between Streaming and Non-Streaming ASR Systems by Distilling Ensembles of CTC and RNN-T Models.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021

Learning Word-Level Confidence for Subword End-To-End ASR.
Proceedings of the IEEE International Conference on Acoustics, 2021

Confidence Estimation for Attention-Based Sequence-to-Sequence Models for Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2021

Improving Streaming Automatic Speech Recognition with Non-Streaming Model Distillation on Unsupervised Data.
Proceedings of the IEEE International Conference on Acoustics, 2021

2020
Product image recognition with guidance learning and noisy supervision.
Comput. Vis. Image Underst., 2020

Spatial-Temporal Alignment Network for Action Recognition and Detection.
CoRR, 2020

Deep Active Learning for Effective Pulmonary Nodule Detection.
Proceedings of the Medical Image Computing and Computer Assisted Intervention - MICCAI 2020, 2020

A Large Scale Speech Sentiment Corpus.
Proceedings of The 12th Language Resources and Evaluation Conference, 2020

Speech Sentiment Analysis via Pre-Trained Features from End-to-End ASR Models.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Zero-shot Entity Linking with Efficient Long Range Sequence Modeling.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, 2020

Label-Efficient Learning on Point Clouds Using Approximate Convex Decompositions.
Proceedings of the Computer Vision - ECCV 2020, 2020

2019
Focal Visual-Text Attention for Memex Question Answering.
IEEE Trans. Pattern Anal. Mach. Intell., 2019

Progressive Learning Algorithm for Efficient Person Re-Identification.
CoRR, 2019

Accurate and Robust Pulmonary Nodule Detection by 3D Feature Pyramid Network with Self-supervised Feature Learning.
CoRR, 2019

Product Image Recognition with Guidance Learning and Noisy Supervision.
CoRR, 2019

3DFPN-HS<sup>2</sup>: 3D Feature Pyramid Network Based High Sensitivity and Specificity Pulmonary Nodule Detection.
CoRR, 2019

3DFPN-HS ^2 2 : 3D Feature Pyramid Network Based High Sensitivity and Specificity Pulmonary Nodule Detection.
Proceedings of the Medical Image Computing and Computer Assisted Intervention - MICCAI 2019, 2019

Automatic Adaptation of Object Detectors to New Domains Using Self-Training.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019

Improving Object Detection from Scratch via Gated Feature Reuse.
Proceedings of the 30th British Machine Vision Conference 2019, 2019

2018
Matrix Factorization on GPUs with Memory Optimization and Approximate Computing.
Proceedings of the 47th International Conference on Parallel Processing, 2018

Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018

Learning Deterministic Policy with Target for Power Control in Wireless Networks.
Proceedings of the IEEE Global Communications Conference, 2018

Focal Visual-Text Attention for Visual Question Answering.
Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018

2017
Image-Based Appraisal of Real Estate Properties.
IEEE Trans. Multim., 2017

Context-Associative Hierarchical Memory Model for Human Activity Recognition and Prediction.
IEEE Trans. Multim., 2017

Mining Fashion Outfit Composition Using an End-to-End Deep Learning Approach on Set Data.
IEEE Trans. Multim., 2017

Guest editorial: mobile visual tagging with mobile context.
Multim. Syst., 2017

Learning Object Detectors from Scratch with Gated Recurrent Feature Pyramids.
CoRR, 2017

MemexQA: Visual Memex Question Answering.
CoRR, 2017

Learning from Noisy Labels with Distillation.
CoRR, 2017

Delving Deep into Personal Photo and Video Search.
Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 2017

ACM SIGMM Rising Star Award 2017.
Proceedings of the 2017 ACM on Multimedia Conference, 2017

Learning from Noisy Labels with Distillation.
Proceedings of the IEEE International Conference on Computer Vision, 2017

Visual Memory QA: Your Personal Photo and Video Search Agent.
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017

2016
Where the Photos Were Taken: Location Prediction by Learning from Flickr Photos.
Proceedings of the Deep Learning and Convolutional Neural Networks for Medical Image Computing, 2016

A hybrid term-term relations analysis approach for topic detection.
Knowl. Based Syst., 2016

Robust Visual-Textual Sentiment Analysis: When Attention meets Tree-structured Recursive Neural Networks.
Proceedings of the 2016 ACM Conference on Multimedia Conference, 2016

Detecting Sarcasm in Multimodal Social Platforms.
Proceedings of the 2016 ACM Conference on Multimedia Conference, 2016

GPU-FV: Realtime Fisher Vector and Its Applications in Video Monitoring.
Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, 2016

Incremental Learning for Fine-Grained Image Recognition.
Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, 2016

Building Joint Spaces for Relation Extraction.
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 2016

Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs.
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, 2016

TGIF: A New Dataset and Benchmark on Animated GIF Description.
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016

Video2GIF: Automatic Generation of Animated GIFs from Video.
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016

Multi-Scale Fully Convolutional Network for Fast Face Detection.
Proceedings of the British Machine Vision Conference 2016, 2016

Poker-CNN: A Pattern Learning Strategy for Making Draws and Bets in Poker Games Using Convolutional Networks.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016

2015
A Multifaceted Approach to Social Multimedia-Based Prediction of Elections.
IEEE Trans. Multim., 2015

Max-Confidence Boosting With Uncertainty for Visual Tracking.
IEEE Trans. Image Process., 2015

Massive-scale learning of image and video semantic concepts.
IBM J. Res. Dev., 2015

Poker-CNN: A Pattern Learning Strategy for Making Draws and Bets in Poker Games.
CoRR, 2015

LSIF: A System for Large-Scale Information Flow Detection Based on Topic-Related Semantic Similarity Measurement.
Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2015

Automated Axon Segmentation from Highly Noisy Microscopic Videos.
Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision, 2015

Understanding the crystallization mechanism of Ge-Te-Ti phase change material.
Proceedings of the 15th Non-Volatile Memory Technology Symposium, 2015

Multi-facet Learning using Deep Convolutional Neural Network for Person-Related Categories in Photos.
Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, 2015

Medical Synonym Extraction with Concept Space Models.
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015

You are what you tweet...pic! gender prediction based on semantic analysis of social media images.
Proceedings of the 2015 IEEE International Conference on Multimedia and Expo, 2015

2014
Large-Scale Geosocial Multimedia [Guest editorial].
IEEE Multim., 2014

Guest Editorial: Special issue on large scale multimedia semantic indexing.
Comput. Vis. Image Underst., 2014

A spatial-color layout feature for representing galaxy images.
Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2014

Learning mid-level features from object hierarchy for image classification.
Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2014

Modeling Attributes from Category-Attribute Proportions.
Proceedings of the ACM International Conference on Multimedia, MM '14, Orlando, FL, USA, November 03, 2014

The Placing Task: A Large-Scale Geo-Estimation Challenge for Social-Media Videos and Images.
Proceedings of the 3rd ACM Multimedia Workshop on Geotagging and Its Applications in Multimedia, 2014

GeoMM 2014: the third ACM multimedia workshop ongeotagging and its applications in multimedia.
Proceedings of the ACM International Conference on Multimedia, MM '14, Orlando, FL, USA, November 03, 2014

Cuteness Recognition and Localization in the Photos of Animals.
Proceedings of the ACM International Conference on Multimedia, MM '14, Orlando, FL, USA, November 03, 2014

2013
Introduction to the special section of best papers of ACM multimedia 2012.
ACM Trans. Multim. Comput. Commun. Appl., 2013

IBM Research and Columbia University TRECVID-2013 Multimedia Event Detection (MED), Multimedia Event Recounting (MER), Surveillance Event Detection (SED), and Semantic Indexing (SIN) Systems.
Proceedings of the 2013 TREC Video Retrieval Evaluation, 2013

Discovering Latent Clusters from Geotagged Beach Images.
Proceedings of the Advances in Multimedia Modeling, 19th International Conference, 2013

Massive-scale multimedia semantic modeling.
Proceedings of the ACM Multimedia Conference, 2013

Learning latent spatio-temporal compositional model for human action recognition.
Proceedings of the ACM Multimedia Conference, 2013

Second ACM multimedia workshop on geotagging and its applications in multimedia (GeoMM 2013).
Proceedings of the ACM Multimedia Conference, 2013

Learning by focusing: A new framework for concept recognition and feature selection.
Proceedings of the 2013 IEEE International Conference on Multimedia and Expo, 2013

Large-scale video event classification using dynamic temporal pyramid matching of visual semantics.
Proceedings of the IEEE International Conference on Image Processing, 2013

Designing Category-Level Attributes for Discriminative Visual Recognition.
Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013

Hierarchical Feature Pooling with Structure Learning: A New Method for Pedestrian Detection.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013

Learning Locally-Adaptive Decision Functions for Person Verification.
Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013

Efficient Maximum Appearance Search for Large-Scale Object Detection.
Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013

Action Detection by Fusing Hierarchically Filtered Motion with Spatiotemporal Interest Point Features.
Proceedings of the Human Behavior Recognition Technologies, 2013

2012
Hierarchical Filtered Motion for Action Recognition in Crowded Videos.
IEEE Trans. Syst. Man Cybern. Part C, 2012

Latent Community Topic Analysis: Integration of Community Discovery with Topic Modeling.
ACM Trans. Intell. Syst. Technol., 2012

Web-Scale Multimedia Information Networks.
Proc. IEEE, 2012

RankCompete: Simultaneous ranking and clustering of information networks.
Neurocomputing, 2012

BlueFinder: estimate where a beach photo was taken.
Proceedings of the 21st World Wide Web Conference, 2012

IBM Research and Columbia University TRECVID-2012 Multimedia Event Detection (MED), Multimedia Event Recounting (MER), and Semantic Indexing (SIN) Systems.
Proceedings of the 2012 TREC Video Retrieval Evaluation, 2012

MediaCCNY at TRECVID 2012: Surveillance Event Detection.
Proceedings of the 2012 TREC Video Retrieval Evaluation, 2012

Submodular video hashing: a unified framework towards video pooling and indexing.
Proceedings of the 20th ACM Multimedia Conference, MM '12, Nara, Japan, October 29, 2012

GeoMM'12: ACM international workshop on geotagging and its applications in multimedia.
Proceedings of the 20th ACM Multimedia Conference, MM '12, Nara, Japan, October 29, 2012

Delta-SimRank computing on MapReduce.
Proceedings of the 1st International Workshop on Big Data, 2012

How Multimedia in Enterprise Social Networks Matters to People's Performance.
Proceedings of the 2012 IEEE International Conference on Multimedia and Expo Workshops, 2012

Video Event Detection Using Temporal Pyramids of Visual Semantics with Kernel Optimization and Model Subspace Boosting.
Proceedings of the 2012 IEEE International Conference on Multimedia and Expo, 2012

Scene Aligned Pooling for Complex Video Recognition.
Proceedings of the Computer Vision - ECCV 2012, 2012

Beyond Mahalanobis distance: Learning second-order discriminant function for people verification.
Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2012

IBM T.J. Watson Research Center, Multimedia Analytics: Modality Classification and Case-Based Retrieval Tasks of ImageCLEF2012.
Proceedings of the CLEF 2012 Evaluation Labs and Workshop, 2012

2011
Heterogeneous Feature Fusion for Visual Recognition
PhD thesis, 2011

A general framework for efficient clustering of large datasets based on activity detection.
Stat. Anal. Data Min., 2011

Geographical topic discovery and comparison.
Proceedings of the 20th International Conference on World Wide Web, 2011

IBM Research and Columbia University TRECVID-2011 Multimedia Event Detection (MED) System.
Proceedings of the 2011 TREC Video Retrieval Evaluation, 2011

Diversified Trajectory Pattern Ranking in Geo-tagged Social Media.
Proceedings of the Eleventh SIAM International Conference on Data Mining, 2011

Learning to Search Efficiently in High Dimensions.
Proceedings of the Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, 2011

Compositional object pattern: a new model for album event recognition.
Proceedings of the 19th International Conference on Multimedia 2011, Scottsdale, AZ, USA, November 28, 2011

LPTA: A Probabilistic Model for Latent Periodic Topic Analysis.
Proceedings of the 11th IEEE International Conference on Data Mining, 2011

Large-scale image classification: Fast feature extraction and SVM training.
Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition, 2011

Multimedia Information Networks in Social Media.
Proceedings of the Social Network Data Analytics, 2011

2010
Image Segmentation by MAP-ML Estimations.
IEEE Trans. Image Process., 2010

RankCompete: simultaneous ranking and clustering of web photos.
Proceedings of the 19th International Conference on World Wide Web, 2010

Videos Semantic Indexing using Image Classification.
Proceedings of the TRECVID 2010 workshop participants notebook papers, 2010

A Study on Sampling Strategies in Space-Time Domain for Recognition Applications.
Proceedings of the Advances in Multimedia Modeling, 2010

The wisdom of social multimedia: using flickr for prediction and forecast.
Proceedings of the 18th International Conference on Multimedia 2010, 2010

Action detection using multiple spatial-temporal interest point features.
Proceedings of the 2010 IEEE International Conference on Multimedia and Expo, 2010

Accurate and efficient reconstruction of 3D faces from stereo images.
Proceedings of the International Conference on Image Processing, 2010

A worldwide tourism recommendation system based on geotaggedweb photos.
Proceedings of the IEEE International Conference on Acoustics, 2010

Cross-dataset action detection.
Proceedings of the Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, 2010

Visual cube and on-line analytical processing of images.
Proceedings of the 19th ACM Conference on Information and Knowledge Management, 2010

2009
Image Annotation Within the Context of Personal Photo Collections Using Hierarchical Event and Scene Models.
IEEE Trans. Multim., 2009

Responses to the Comments on "Plane-Based Optimization for 3D Object Reconstruction from Single Line Drawings".
IEEE Trans. Pattern Anal. Mach. Intell., 2009

Responses to the Comments on "What the Back of the Object Looks Like: 3D Reconstruction from Line Drawings without Hidden Lines".
IEEE Trans. Pattern Anal. Mach. Intell., 2009

GAD: General Activity Detection for Fast Clustering on Large Data.
Proceedings of the SIAM International Conference on Data Mining, 2009

Enhancing semantic and geographic annotation of web images via logistic canonical correlation regression.
Proceedings of the 17th International Conference on Multimedia 2009, 2009

Action detection in complex scenes with spatial and temporal ambiguities.
Proceedings of the IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, September 27, 2009

Heterogeneous feature machines for visual recognition.
Proceedings of the IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, September 27, 2009

2008
Plane-Based Optimization for 3D Object Reconstruction from Single Line Drawings.
IEEE Trans. Pattern Anal. Mach. Intell., 2008

What the Back of the Object Looks Like: 3D Reconstruction from Line Drawings without Hidden Lines.
IEEE Trans. Pattern Anal. Mach. Intell., 2008

Surveillance Event Detection.
Proceedings of the TRECVID 2008 workshop participants notebook papers, 2008

Image annotation using personal calendars as context.
Proceedings of the 16th International Conference on Multimedia 2008, 2008

Annotating photo collections by label propagation according to multiple similarity cues.
Proceedings of the 16th International Conference on Multimedia 2008, 2008

Gender recognition from body.
Proceedings of the 16th International Conference on Multimedia 2008, 2008

Annotating collections of photos using hierarchical event and scene models.
Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 2008

Multiple feature fusion by subspace learning.
Proceedings of the 7th ACM International Conference on Image and Video Retrieval, 2008

2007
Spatially Coherent Latent Topic Model for Concurrent Segmentation and Classification of Objects and Scenes.
Proceedings of the IEEE 11th International Conference on Computer Vision, 2007

Iterative MAP and ML Estimations for Image Segmentation.
Proceedings of the 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), 2007

2006
3D object retrieval using 2D line drawing and graph based relevance reedback.
Proceedings of the 14th ACM International Conference on Multimedia, 2006

Automatic Segmentation of Lung Fields from Radiographic Images of SARS Patients Using a New Graph Cuts Algorithm.
Proceedings of the 18th International Conference on Pattern Recognition (ICPR 2006), 2006

Degen Generalized Cylinders and Their Properties.
Proceedings of the Computer Vision, 2006

2005
3D Object Reconstruction from a Single 2D Line Drawing without Hidden Lines.
Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV 2005), 2005