Fast and Accurate Content-based Semantic Search in 100M

Fast and Accurate Content-based Semantic Search in 100M

Fast and Accurate Content-based Semantic Search in 100M Internet Videos Lu Jiang 1, Shoou-I Yu 1, Deyu Meng2, Yi Yang3, Teruko Mitamura 1, Alexander Hauptmann 1 1 Carnegie Mellon University 2 3 Xian Jiaotong University University of Technology Sydney

1 Acknowledgement This work was partially supported by Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20068. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.

Outline Introduction Proposed Approach Experimental Results Conclusions 3 Introduction We are living in an era of big multimedia data: 300 hours of video are uploaded to YouTube every minute; social media users are posting 12 million videos on Twitter every day; video will account for 80% of all the world's internet traffic by 2019. Video search is becoming a valuable source for

acquiring information and knowledge. Existing large-scale methods are still based on textto-text matching (user text query to video metadata), which may fail in many scenarios. 66% videos on the social media site Twitter are not associated with hashtag or mention [Vandersmissen et al.De2014] Baptist Vandersmissen, Frederic Godin, Abhineshwar Tomar, Wesley Neve,and Rik Van de Walle. The rise of mobile and social short-form video: an indepth measurement study of vine. In ICMR Workshop on Social Multimedia and Storytelling, 2014.

4 Introduction We are living in an era of big multimedia data: social media users are posting 12 millions videos on Twitter every day; Video search is becoming a valuable source for acquiring information and knowledge. Existing large-scale methods are still based on textto-text matching (user query to video metadata), which may fail in many scenarios. Much more video captured by mobile phones, surveillance cameras and wearable devices does not have any metadata at all. 5

Introduction We are living in an era of big multimedia data: 300 hours of video are uploaded to YouTube every minute; social media users are posting 12 millions videos on Twitter every day; video will account for 80% of all the world's internet traffic by 2019. Video search is becoming a valuable source for acquiring information and knowledge. Existing large-scale methods are still based on text-to-text How to(user acquire information or knowledge in video matching

query to video metadata), which may fail in many scenarios.if there is no way to find it? 66% videos on a social media site of Twitter are not associated with meaningful metadata (hashtag or a mention)[Vandersmissen et al. 2014] Much video captured by mobile phones, surveillance cameras and wearable devices does not have any metadata at all. 6 Introduction We address a content-based video retrieval problem which aims at searching videos solely based on content, without using any user-generated metadata

(e.g. titles or descriptions) or video examples. 7 Example Queries In response to a query, our system should be able to: find simple objects, actions, speech words; search complex activities; Information need: people running away after an explosion in urban areas. Query:Boolean logical operator urban_scene AND (walking OR running) OR fire OR smoke

OR audio:explosion TBefore(audio:explosion, running) Temporal operators 8 Introduction We study a content-based video retrieval problem which aims at searching videos solely based on content, without using any user-generated metadata (e.g. titles or descriptions) or video examples. We are interested in searching hundreds of millions of videos within the maximum recommended waiting time for a user, i.e. 2 seconds [Nah, 2004], while maintaining maximum accuracy.

Fiona Fui-Hoon Nah. A study on tolerable waiting time: how long are web users willing to wait? Behaviour & Information Technology, 23(3):153163, 2004. 9 20 0k vi de os From large-scale to web-scale 100M+ videos

Let the above videos represent the upper-bound of the current largest dataset for this problem (200K videos) (From Large-scale to Web-scale) 10 Result Overview We propose a novel and practical solution that can Scale up the search to hundreds of millions of Internet videos. 0.2 second to process a semantic query on 100 million videos Within a system called E-Lamp Lite, we implemented the first of its kind large-scale multimedia search engine for Internet videos: Achieved the best accuracy in TRECVID MED zero-example search 2013 and 2014, the most representative task on this topic.

To the best of our knowledge, it is the first content-based video retrieval system that can search a collection of 100 million videos. 11 Outline Introduction Proposed Approach Experimental Results Conclusions 12 Framework

Lu Jiang, Shoou-I Yu, Deyu Meng, Teruko Mitamura, Alexander Hauptmann. Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos. In ACM International Conference on Multimedia Retrieval (ICMR), 2015. 13 Indexing Semantic Features Semantic features include ASR (speech), OCR (visible text), visual concepts and audio concepts. Indexing textual features like ASR and OCR is well studied. Indexing semantic concepts is not well understood. Existing methods index the raw detection score of semantic concepts by dense matrices [Mazloom et al. 2014][Wu et al. 2014][Lee et al. 2014] We propose a scalable semantic concept indexing method. The key is

a novel method called concept adjustment. Masoud Mazloom, Xirong Li, and Cees GM Snoek. Few-example video event retrieval using tag propagation. In ICMR, 2014. Shuang Wu, Sravanthi Bondugula, Florian Luisier, Xiaodan Zhuang, and Pradeep Natarajan. Zero-shot event detection using multi-modal fusion of weakly supervised concepts. In CVPR, 2014. Hyungtae Lee. Analyzing complex events and human actions in in-the-wild videos. In UMD Ph.D Theses14and Dissertations, 2014. Method Overview steps in the offline video indexing. raw score representation Represent raw video (or video clip) by low-level features. Semantic concept detectors are of limited accuracy. The raw

detections are meaningful but very noisy. 15 Method Overview The raw score representation has two problems: Distributional inconsistency: every video has every concept in the vocabulary (with a small but nonzero score); Logical inconsistency: a video may contain a terrier but not a dog. To address the problems, we introduce a novel step called concept adjustment which represents a video by a few salient and logically 16 consistent visual/audio concepts. Concept Adjustment Model

The proposed adjustment model is: distributional consistency logical consistency where is the adjusted concept score. is a pooling on the raw detection score matrix : each row corresponds to a shot and each column corresponds to a concept. Our goal is to generate video representations that tends to be similar to the underlying concept representation in terms of the distributional and logical

consistency. Normalization : Indicator function 17 Concept Adjustment Model: Distributional Consistency A naive regularizer infeasible to solve. A more general regularizer : When lasso (approximate norm). When group lasso (nonzero entries in a sparse set of groups)

When sparse group lasso (group-wise sparse solution, but only few coefficients in the group will be nonzero) 18 Distributional Consistency: A Toy Example All the adjustment methods above special cases of our adjustment model. 19 Concept Adjustment Model: Distributional Consistency

A more general regularizer : When concepts are independent. When groups of concepts frequently co-occur, e.g. sky/cloud, beach/ocean/waterfront, and table/chair. Multimodal concepts baby/baby_crying. When only few concepts in a co-occurring group are nonzero [Simon et al. 2013]. The choice of the model parameters depends on the underlying distribution of the semantic concepts in the dataset. We can cluster the concepts in their training data to get the co-occurring groups. Noah Simon, Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A sparse group lasso. Journal of Computational and Graphical

Statistics, 22(2):231245,2013. 20 Concept Adjustment Model The proposed adjustment model is: distributional consistency logical consistency where is the adjusted concept score. is a pooling on the raw

detection score matrix : each row corresponds to a shot and each column corresponds to a concept. Our goal is to generate video representations that tends to be similar to the underlying concept representation in terms of the distributional and logical consistency. 21 Concept Adjustment Model: Logical Consistency [Deng et al, 2014 ] subsumption exclusion only make sense for shot-level features.

Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome, Kevin Murphy, Samy Bengio, Yuan Li, Hartmut Neven, and Hartwig Adam. Large-scale object classification using label relation graphs. In ECCV, 2014. Integer programming solved by mix-integer toolbox or by 22 constraint relaxation. Concept Adjustment Model: Logical Consistency [Deng et al, 2014 ] subsumption

exclusion only make sense for shot-level features. Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome, Kevin Murphy, Samy Bengio, Yuan Li, Hartmut Neven, and Hartwig Adam. Large-scale object classification using label relation graphs. In ECCV, 2014. Integer programming solved by mix-integer toolbox or by 23 constraint relaxation. Indexing Semantic Features Finally, the adjusted concept representation is indexed by an inverted index.

The index structure needs to be modified to account for: Indexing real-valued concepts Indexing the shot-level scores Supporting Boolean logical and temporal operators. 24 Indexing Semantic Features The adjusted concept representation is indexed by the inverted index. Indexing the real-valued score. Our index supports: modality search: visual:dog, ocr:dog score range search: score(dog, >=, 0.7) basic temporal search: tbefore(dog, cat), twindow(3s,dog, cat) Boolean logical search: dog AND NOT

score(cat, >=, 0.5) 25 Experiments on MED Dataset: MED13Test and MED14Test (around 25,000 videos). Each set contains 20 events. Official evaluation metric: Mean Average Precision (MAP) Supplementary metrics: Mean Reciprocal Rank = (1/rank of the first relevant sample)[Voorhees, 1999] [email protected] [email protected] Configurations:

NISTs HEX graph is used for IACC; We build the HEX graphs for other semantic concept features. Raw prediction scores of the 3000+ concepts trained in [Jiang et al. 2015]. E.M. Voorhees. Proceedings of the 8th Text Retrieval Conference. TREC-8 Question Answering Track Report. 1999 Lu Jiang, Shoou-I Yu, Deyu Meng, Teruko Mitamura, Alexander Hauptmann. Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos. In ACM International Conference on Multimedia Retrieval (ICMR), 2015. 26 Experiments on MED Dataset: MED13Test and MED14Test (around 25,000 videos). Each set contains 20 events. Official evaluation metric: Mean Average Precision (MAP) Supplementary metrics: Mean Reciprocal Rank = (1/rank of the first relevant sample)[Voorhees,

1999] [email protected] [email protected] Configurations: NISTs HEX graph is used for IACC; We build the HEX graphs for other semantic concept features. Raw prediction scores of the 3000+ concepts trained in [Jiang et al. 2015]. E.M. Voorhees. Proceedings of the 8th Text Retrieval Conference. TREC-8 Question Answering Track Report. 1999 Lu Jiang, Shoou-I Yu, Deyu Meng, Teruko Mitamura, Alexander Hauptmann. Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos. In ACM International Conference on Multimedia Retrieval (ICMR), 2015. 27 Experiments on MED

Comparison of the raw and the adjusted representation baseline 33x smaller index size comparable performances The accuracy of the proposed method is comparable to that of the baseline method. 28 Experiments on MED Better performances

Comparison of the full adjustment model with its special case top-k thresholding The MAP is low because here we only use 346 semantic features. 29 Experiments on the SIN dataset We test adjustment method on TRECVID SIN dataset, where the ground-truth labels on each video shot are available. Test on 1500 shots in 961 videos. Evaluated by Root Mean Squared Error (RMSE). The proposed method is more accurate than

the baseline methods. 30 Experiments on 100M Videos The scalability and efficiency test on 100 million videos. 20GB Baseline method (raw score representation) fails when the data reaches 5 million videos. 512MB Our method can scale to 100M videos.

take 0.2s on a single core (on-line search time); create an on-disk inverted index of 20G; Use 512MB memory. 0.2s The proposed method is scalable and efficient. 31 Experiments on YFCC (Yahoo Flickr Creative Commons)

We manually created queries for 30 products. Put commercials about the product to related video (in-video ads.) Search over 800K videos in the dataset. Put ads in relevant videos on Flickr. Queries and more results are available at: https://sites.google.com/site/videosearch100m/ 32 Experiments on YFCC

We manually created queries for 30 products. Put commercials about the product to related video (in-video ads.) Search over 800K videos in the dataset. Evaluate the relevance of the top 20 returned results. Average performance for 30 commercials on YFCC Queries and more results are available at: https://sites.google.com/site/videosearch100m/ 33 Experiments on YFCC

Queries and more results are available at: https://sites.google.com/site/videosearch100m/ 34 Outline Introduction Proposed Approach Experimental Results Conclusions 35 Conclusions

We proposed a scalable semantic concept indexing methods that extends the current scale of video search by a few orders of magnitude while maintaining state-of-the-art retrieval performance. The key is a novel step called concept adjustment that can represent a video by a few salient and consistent concepts which can be efficiently indexed by a modified inverted index. Take home: experimental results show that our system can search 100 million Internet videos within 0.2 second. We share our concept features of the 0.8 million videos in the YFCC dataset. 36

THANK YOU. QUESTIONS? 37

Recently Viewed Presentations

  • Phenotypic evolution: the emergence of a new synthesis

    Phenotypic evolution: the emergence of a new synthesis

    Micro- and meso-evolution (Simpson's phyletic evolution) is bounded with most divergence in the range of ± 6 within population standard deviations or, for size-related traits, in the range of ± 65% change. Remarkably, most evolution occurs within these boundaries at...
  • EYFS Phonics Workshop 2017

    EYFS Phonics Workshop 2017

    Read logos, point at street signs. Write the sounds they have learnt onto pieces of paper, Hide them around the house and go on a hunt. Reading book from school: Help your child be 'on their way' by pointing out...
  • Multi-Tiered System of Supports (MTSS) 101 Christy Berger,

    Multi-Tiered System of Supports (MTSS) 101 Christy Berger,

    Progress should be monitored frequently - at least monthly, but ideally weekly or biweekly. FACT. FIB. OR. FACT-It is vital that teachers know where students are and how they are improving and growing toward achieving their goals. Progress should be...
  • Pre-thesis preparation - ohsu.edu

    Pre-thesis preparation - ohsu.edu

    Third term of BMI 503 Thesis (cont.) Formatting and citation: See . SOM Guidelines and Regulations . Section 4 for formatting rules. Be consistent. Choose one citation style and be consistent. MLA. Vancouver. APA. Chicago. Online Writing Lab (OWL) at...
  • Présentation PowerPoint

    Présentation PowerPoint

    Winter voit rouge : les saluts sont formellement interdits dans son château! Puis d'autres prisonniers viennent le voir et lui racontent ces choses qu'on tente d'étouffer, comme ces morts soi-disant accidentelles.
  • ELECTRICAL SAFETY - votechsafety.net

    ELECTRICAL SAFETY - votechsafety.net

    electrical safety. ... Makita 5" disc sander . f. luorescent lights. First, let's keep in mind the key health effects: Ventricular fibrillation occurs from 75 to 150 milliamps and death is likely after exposures to currents of 1 to 4.3...
  • IDENTIFYING AND UNDERSTANDING WORDS - Oneonta

    IDENTIFYING AND UNDERSTANDING WORDS - Oneonta

    Identifying and Understanding Words BY: Annalisa C. Dimeo EDUC 231 Table of Contents: Identifying and Understanding Words There's a Nightmare in My Closet, By Mercer Mayer Differences between identifying and understanding Five ways to call children's attention to sight words...
  • Research To Practice

    Research To Practice

    Benjamin Levy, MD Assistant Professor ... PFS and OS benefit with pembrolizumab with chemotherapy was observed irrespective of PD-L1 TPS. ORR and DoR were greater with pembrolizumab with chemotherapy than with chemotherapy alone.