2024
Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention.
CoRR, 2024
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models.
CoRR, 2024
Diffusion Model-Based Image Editing: A Survey.
CoRR, 2024
Ferret: Refer and Ground Anything Anywhere at Any Granularity.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
Efficient-3Dim: Learning a Generalizable Single-image Novel-view Synthesizer in One Day.
Proceedings of the Twelfth International Conference on Learning Representations, 2024
2023
Efficient-NeRF2NeRF: Streamlining Text-Driven 3D Editing with Multiview Correspondence-Enhanced Diffusion Models.
CoRR, 2023
Instruction-Following Speech Recognition.
CoRR, 2023
RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture.
CoRR, 2023
Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness.
CoRR, 2023
STAIR: Learning Sparse Text and Image Representation in Grounded Tokens.
CoRR, 2023
RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture.
Proceedings of the 31st ACM International Conference on Multimedia, 2023
STAIR: Learning Sparse Text and Image Representation in Grounded Tokens.
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
2022
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
IEEE J. Sel. Top. Signal Process., 2022
Exploiting Category Names for Few-Shot Classification with Vision-Language Models.
CoRR, 2022
PriFit: Learning to Fit Primitives Improves Few Shot Point Cloud Segmentation.
Comput. Graph. Forum, 2022
Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2022
2021
SurFit: Learning to Fit Surfaces Improves Few Shot Learning on Point Clouds.
CoRR, 2021
Input Length Matters: An Empirical Study Of RNN-T And MWER Training For Long-form Telephony Speech Recognition.
CoRR, 2021
Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models.
CoRR, 2021
RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions.
,
,
,
,
,
,
,
,
,
,
Proceedings of the IEEE Spoken Language Technology Workshop, 2021
Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021
Exploring Targeted Universal Adversarial Perturbations to End-to-End ASR Models.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021
Residual Energy-Based Models for End-to-End Speech Recognition.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021
Bridging the Gap Between Streaming and Non-Streaming ASR Systems by Distilling Ensembles of CTC and RNN-T Models.
Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30, 2021
Learning Word-Level Confidence for Subword End-To-End ASR.
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the IEEE International Conference on Acoustics, 2021
Confidence Estimation for Attention-Based Sequence-to-Sequence Models for Speech Recognition.
Proceedings of the IEEE International Conference on Acoustics, 2021
Improving Streaming Automatic Speech Recognition with Non-Streaming Model Distillation on Unsupervised Data.
Proceedings of the IEEE International Conference on Acoustics, 2021
2020
Product image recognition with guidance learning and noisy supervision.
Comput. Vis. Image Underst., 2020
Spatial-Temporal Alignment Network for Action Recognition and Detection.
CoRR, 2020
Deep Active Learning for Effective Pulmonary Nodule Detection.
Proceedings of the Medical Image Computing and Computer Assisted Intervention - MICCAI 2020, 2020
A Large Scale Speech Sentiment Corpus.
Proceedings of The 12th Language Resources and Evaluation Conference, 2020
Speech Sentiment Analysis via Pre-Trained Features from End-to-End ASR Models.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020
Zero-shot Entity Linking with Efficient Long Range Sequence Modeling.
Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, 2020
Label-Efficient Learning on Point Clouds Using Approximate Convex Decompositions.
Proceedings of the Computer Vision - ECCV 2020, 2020
2019
Focal Visual-Text Attention for Memex Question Answering.
IEEE Trans. Pattern Anal. Mach. Intell., 2019
Progressive Learning Algorithm for Efficient Person Re-Identification.
CoRR, 2019
Accurate and Robust Pulmonary Nodule Detection by 3D Feature Pyramid Network with Self-supervised Feature Learning.
CoRR, 2019
Product Image Recognition with Guidance Learning and Noisy Supervision.
CoRR, 2019
3DFPN-HS<sup>2</sup>: 3D Feature Pyramid Network Based High Sensitivity and Specificity Pulmonary Nodule Detection.
CoRR, 2019
3DFPN-HS ^2 2 : 3D Feature Pyramid Network Based High Sensitivity and Specificity Pulmonary Nodule Detection.
Proceedings of the Medical Image Computing and Computer Assisted Intervention - MICCAI 2019, 2019
Automatic Adaptation of Object Detectors to New Domains Using Self-Training.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019
Improving Object Detection from Scratch via Gated Feature Reuse.
Proceedings of the 30th British Machine Vision Conference 2019, 2019
2018
Matrix Factorization on GPUs with Memory Optimization and Approximate Computing.
Proceedings of the 47th International Conference on Parallel Processing, 2018
Lip2Audspec: Speech Reconstruction from Silent Lip Movements Video.
Proceedings of the 2018 IEEE International Conference on Acoustics, 2018
Learning Deterministic Policy with Target for Power Control in Wireless Networks.
Proceedings of the IEEE Global Communications Conference, 2018
Focal Visual-Text Attention for Visual Question Answering.
Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018
2017
Image-Based Appraisal of Real Estate Properties.
IEEE Trans. Multim., 2017
Context-Associative Hierarchical Memory Model for Human Activity Recognition and Prediction.
IEEE Trans. Multim., 2017
Mining Fashion Outfit Composition Using an End-to-End Deep Learning Approach on Set Data.
IEEE Trans. Multim., 2017
Guest editorial: mobile visual tagging with mobile context.
Multim. Syst., 2017
Learning Object Detectors from Scratch with Gated Recurrent Feature Pyramids.
CoRR, 2017
MemexQA: Visual Memex Question Answering.
CoRR, 2017
Learning from Noisy Labels with Distillation.
CoRR, 2017
Delving Deep into Personal Photo and Video Search.
Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 2017
ACM SIGMM Rising Star Award 2017.
Proceedings of the 2017 ACM on Multimedia Conference, 2017
Learning from Noisy Labels with Distillation.
Proceedings of the IEEE International Conference on Computer Vision, 2017
Visual Memory QA: Your Personal Photo and Video Search Agent.
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017
2016
Where the Photos Were Taken: Location Prediction by Learning from Flickr Photos.
Proceedings of the Deep Learning and Convolutional Neural Networks for Medical Image Computing, 2016
A hybrid term-term relations analysis approach for topic detection.
Knowl. Based Syst., 2016
Robust Visual-Textual Sentiment Analysis: When Attention meets Tree-structured Recursive Neural Networks.
Proceedings of the 2016 ACM Conference on Multimedia Conference, 2016
Detecting Sarcasm in Multimodal Social Platforms.
Proceedings of the 2016 ACM Conference on Multimedia Conference, 2016
GPU-FV: Realtime Fisher Vector and Its Applications in Video Monitoring.
Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, 2016
Incremental Learning for Fine-Grained Image Recognition.
Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, 2016
Building Joint Spaces for Relation Extraction.
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 2016
Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs.
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, 2016
TGIF: A New Dataset and Benchmark on Animated GIF Description.
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016
Video2GIF: Automatic Generation of Animated GIFs from Video.
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016
Multi-Scale Fully Convolutional Network for Fast Face Detection.
Proceedings of the British Machine Vision Conference 2016, 2016
Poker-CNN: A Pattern Learning Strategy for Making Draws and Bets in Poker Games Using Convolutional Networks.
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016
2015
A Multifaceted Approach to Social Multimedia-Based Prediction of Elections.
IEEE Trans. Multim., 2015
Max-Confidence Boosting With Uncertainty for Visual Tracking.
IEEE Trans. Image Process., 2015
Massive-scale learning of image and video semantic concepts.
IBM J. Res. Dev., 2015
Poker-CNN: A Pattern Learning Strategy for Making Draws and Bets in Poker Games.
CoRR, 2015
LSIF: A System for Large-Scale Information Flow Detection Based on Topic-Related Semantic Similarity Measurement.
Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2015
Automated Axon Segmentation from Highly Noisy Microscopic Videos.
Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision, 2015
Understanding the crystallization mechanism of Ge-Te-Ti phase change material.
Proceedings of the 15th Non-Volatile Memory Technology Symposium, 2015
Multi-facet Learning using Deep Convolutional Neural Network for Person-Related Categories in Photos.
Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, 2015
Medical Synonym Extraction with Concept Space Models.
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015
You are what you tweet...pic! gender prediction based on semantic analysis of social media images.
Proceedings of the 2015 IEEE International Conference on Multimedia and Expo, 2015
2014
Large-Scale Geosocial Multimedia [Guest editorial].
IEEE Multim., 2014
Guest Editorial: Special issue on large scale multimedia semantic indexing.
Comput. Vis. Image Underst., 2014
A spatial-color layout feature for representing galaxy images.
Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2014
Learning mid-level features from object hierarchy for image classification.
Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2014
Modeling Attributes from Category-Attribute Proportions.
Proceedings of the ACM International Conference on Multimedia, MM '14, Orlando, FL, USA, November 03, 2014
The Placing Task: A Large-Scale Geo-Estimation Challenge for Social-Media Videos and Images.
,
,
,
,
,
,
,
,
,
,
Proceedings of the 3rd ACM Multimedia Workshop on Geotagging and Its Applications in Multimedia, 2014
GeoMM 2014: the third ACM multimedia workshop ongeotagging and its applications in multimedia.
Proceedings of the ACM International Conference on Multimedia, MM '14, Orlando, FL, USA, November 03, 2014
Cuteness Recognition and Localization in the Photos of Animals.
Proceedings of the ACM International Conference on Multimedia, MM '14, Orlando, FL, USA, November 03, 2014
2013
Introduction to the special section of best papers of ACM multimedia 2012.
ACM Trans. Multim. Comput. Commun. Appl., 2013
IBM Research and Columbia University TRECVID-2013 Multimedia Event Detection (MED), Multimedia Event Recounting (MER), Surveillance Event Detection (SED), and Semantic Indexing (SIN) Systems.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2013 TREC Video Retrieval Evaluation, 2013
Discovering Latent Clusters from Geotagged Beach Images.
Proceedings of the Advances in Multimedia Modeling, 19th International Conference, 2013
Massive-scale multimedia semantic modeling.
Proceedings of the ACM Multimedia Conference, 2013
Learning latent spatio-temporal compositional model for human action recognition.
Proceedings of the ACM Multimedia Conference, 2013
Second ACM multimedia workshop on geotagging and its applications in multimedia (GeoMM 2013).
Proceedings of the ACM Multimedia Conference, 2013
Learning by focusing: A new framework for concept recognition and feature selection.
Proceedings of the 2013 IEEE International Conference on Multimedia and Expo, 2013
Large-scale video event classification using dynamic temporal pyramid matching of visual semantics.
Proceedings of the IEEE International Conference on Image Processing, 2013
Designing Category-Level Attributes for Discriminative Visual Recognition.
Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013
Hierarchical Feature Pooling with Structure Learning: A New Method for Pedestrian Detection.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013
Learning Locally-Adaptive Decision Functions for Person Verification.
Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013
Efficient Maximum Appearance Search for Large-Scale Object Detection.
Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013
Action Detection by Fusing Hierarchically Filtered Motion with Spatiotemporal Interest Point Features.
Proceedings of the Human Behavior Recognition Technologies, 2013
2012
Hierarchical Filtered Motion for Action Recognition in Crowded Videos.
IEEE Trans. Syst. Man Cybern. Part C, 2012
Latent Community Topic Analysis: Integration of Community Discovery with Topic Modeling.
ACM Trans. Intell. Syst. Technol., 2012
Web-Scale Multimedia Information Networks.
Proc. IEEE, 2012
RankCompete: Simultaneous ranking and clustering of information networks.
Neurocomputing, 2012
BlueFinder: estimate where a beach photo was taken.
Proceedings of the 21st World Wide Web Conference, 2012
IBM Research and Columbia University TRECVID-2012 Multimedia Event Detection (MED), Multimedia Event Recounting (MER), and Semantic Indexing (SIN) Systems.
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2012 TREC Video Retrieval Evaluation, 2012
MediaCCNY at TRECVID 2012: Surveillance Event Detection.
Proceedings of the 2012 TREC Video Retrieval Evaluation, 2012
Submodular video hashing: a unified framework towards video pooling and indexing.
Proceedings of the 20th ACM Multimedia Conference, MM '12, Nara, Japan, October 29, 2012
GeoMM'12: ACM international workshop on geotagging and its applications in multimedia.
Proceedings of the 20th ACM Multimedia Conference, MM '12, Nara, Japan, October 29, 2012
Delta-SimRank computing on MapReduce.
Proceedings of the 1st International Workshop on Big Data, 2012
How Multimedia in Enterprise Social Networks Matters to People's Performance.
Proceedings of the 2012 IEEE International Conference on Multimedia and Expo Workshops, 2012
Video Event Detection Using Temporal Pyramids of Visual Semantics with Kernel Optimization and Model Subspace Boosting.
Proceedings of the 2012 IEEE International Conference on Multimedia and Expo, 2012
Scene Aligned Pooling for Complex Video Recognition.
Proceedings of the Computer Vision - ECCV 2012, 2012
Beyond Mahalanobis distance: Learning second-order discriminant function for people verification.
Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2012
IBM T.J. Watson Research Center, Multimedia Analytics: Modality Classification and Case-Based Retrieval Tasks of ImageCLEF2012.
Proceedings of the CLEF 2012 Evaluation Labs and Workshop, 2012
2011
Heterogeneous Feature Fusion for Visual Recognition
PhD thesis, 2011
A general framework for efficient clustering of large datasets based on activity detection.
Stat. Anal. Data Min., 2011
Geographical topic discovery and comparison.
Proceedings of the 20th International Conference on World Wide Web, 2011
IBM Research and Columbia University TRECVID-2011 Multimedia Event Detection (MED) System.
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the 2011 TREC Video Retrieval Evaluation, 2011
Diversified Trajectory Pattern Ranking in Geo-tagged Social Media.
Proceedings of the Eleventh SIAM International Conference on Data Mining, 2011
Learning to Search Efficiently in High Dimensions.
Proceedings of the Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, 2011
Compositional object pattern: a new model for album event recognition.
Proceedings of the 19th International Conference on Multimedia 2011, Scottsdale, AZ, USA, November 28, 2011
LPTA: A Probabilistic Model for Latent Periodic Topic Analysis.
Proceedings of the 11th IEEE International Conference on Data Mining, 2011
Large-scale image classification: Fast feature extraction and SVM training.
Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition, 2011
Multimedia Information Networks in Social Media.
Proceedings of the Social Network Data Analytics, 2011
2010
Image Segmentation by MAP-ML Estimations.
IEEE Trans. Image Process., 2010
RankCompete: simultaneous ranking and clustering of web photos.
Proceedings of the 19th International Conference on World Wide Web, 2010
Videos Semantic Indexing using Image Classification.
Proceedings of the TRECVID 2010 workshop participants notebook papers, 2010
A Study on Sampling Strategies in Space-Time Domain for Recognition Applications.
Proceedings of the Advances in Multimedia Modeling, 2010
The wisdom of social multimedia: using flickr for prediction and forecast.
Proceedings of the 18th International Conference on Multimedia 2010, 2010
Action detection using multiple spatial-temporal interest point features.
Proceedings of the 2010 IEEE International Conference on Multimedia and Expo, 2010
Accurate and efficient reconstruction of 3D faces from stereo images.
Proceedings of the International Conference on Image Processing, 2010
A worldwide tourism recommendation system based on geotaggedweb photos.
Proceedings of the IEEE International Conference on Acoustics, 2010
Cross-dataset action detection.
Proceedings of the Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, 2010
Visual cube and on-line analytical processing of images.
Proceedings of the 19th ACM Conference on Information and Knowledge Management, 2010
2009
Image Annotation Within the Context of Personal Photo Collections Using Hierarchical Event and Scene Models.
IEEE Trans. Multim., 2009
Responses to the Comments on "Plane-Based Optimization for 3D Object Reconstruction from Single Line Drawings".
IEEE Trans. Pattern Anal. Mach. Intell., 2009
Responses to the Comments on "What the Back of the Object Looks Like: 3D Reconstruction from Line Drawings without Hidden Lines".
IEEE Trans. Pattern Anal. Mach. Intell., 2009
GAD: General Activity Detection for Fast Clustering on Large Data.
Proceedings of the SIAM International Conference on Data Mining, 2009
Enhancing semantic and geographic annotation of web images via logistic canonical correlation regression.
Proceedings of the 17th International Conference on Multimedia 2009, 2009
Action detection in complex scenes with spatial and temporal ambiguities.
Proceedings of the IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, September 27, 2009
Heterogeneous feature machines for visual recognition.
Proceedings of the IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, September 27, 2009
2008
Plane-Based Optimization for 3D Object Reconstruction from Single Line Drawings.
IEEE Trans. Pattern Anal. Mach. Intell., 2008
What the Back of the Object Looks Like: 3D Reconstruction from Line Drawings without Hidden Lines.
IEEE Trans. Pattern Anal. Mach. Intell., 2008
Surveillance Event Detection.
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Proceedings of the TRECVID 2008 workshop participants notebook papers, 2008
Image annotation using personal calendars as context.
Proceedings of the 16th International Conference on Multimedia 2008, 2008
Annotating photo collections by label propagation according to multiple similarity cues.
Proceedings of the 16th International Conference on Multimedia 2008, 2008
Gender recognition from body.
Proceedings of the 16th International Conference on Multimedia 2008, 2008
Annotating collections of photos using hierarchical event and scene models.
Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 2008
Multiple feature fusion by subspace learning.
Proceedings of the 7th ACM International Conference on Image and Video Retrieval, 2008
2007
Spatially Coherent Latent Topic Model for Concurrent Segmentation and Classification of Objects and Scenes.
Proceedings of the IEEE 11th International Conference on Computer Vision, 2007
Iterative MAP and ML Estimations for Image Segmentation.
Proceedings of the 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), 2007
2006
3D object retrieval using 2D line drawing and graph based relevance reedback.
Proceedings of the 14th ACM International Conference on Multimedia, 2006
Automatic Segmentation of Lung Fields from Radiographic Images of SARS Patients Using a New Graph Cuts Algorithm.
Proceedings of the 18th International Conference on Pattern Recognition (ICPR 2006), 2006
Degen Generalized Cylinders and Their Properties.
Proceedings of the Computer Vision, 2006
2005
3D Object Reconstruction from a Single 2D Line Drawing without Hidden Lines.
Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV 2005), 2005