Tutorial on Latent Semantic Indexing - Columbia University

Tutorial on Latent Semantic Indexing - Columbia University

CLiMB: Computational Linguistics for Metadata Building Center for Research on Information Access Columbia University Libraries January 21, 2003 CLiMB - Columbia University 2 Overall Goals

Research: Development of richer retrieval through increased numbers of descriptors Research and Practice: Creation of enabling technologies for new large digitization projects Research and Practice: Expand capability for cross-collection searching Practice: Development of suite of CLiMB tools Resources: Vocabulary list which can be used by other visual resource professionals The essence of CLiMB: Use scholars themselves as catalogers by utilizing scholarly publications Enhance existing descriptive metadata January 21, 2003 CLiMB - Columbia University 3

Computational Linguistic Techniques What techniques have we tried? How well have they worked? What else do we want to try? January 21, 2003 CLiMB - Columbia University 4 Computational Linguistic Techniques What techniques have we tried? Goal: Identify high quality metadata terms Goal: Use metadata for finding images How well have they worked? What else do we want to try? January 21, 2003

CLiMB - Columbia University 5 Text about Images The Blacker House is known for its porte cochre and adjacent terraces. Samuel Parker Williams, an occasional Greene collaborator, worked on the site, particularly on the sandstone boulder foundation for the sleeping porch. -- Based on Bosley January 21, 2003 CLiMB - Columbia University 6

Techniques We Have Tried Supervised (using existing resources) Matching algorithms - proper names & variants Back of book index analysis Composite list of terms from authoritative lists Unsupervised Part of speech tagging Noun phrase identification Proper noun identification January 21, 2003 CLiMB - Columbia University 7 What about LSI?

Latent Semantic Indexing Builds a representation of a document Effective in information retrieval Why not for CLiMB? LSI is useful for text query and document retrieval LSI, a statistical technique, removes phrasal info CLiMB needs high quality phrases May be useful in later stages January 21, 2003 CLiMB - Columbia University 8 Indexing for What Purpose Index = find important terms and phrases Index = characterize a document with a set of terms that occurs in the doc

January 21, 2003 CLiMB - Columbia University 9 Indexing for What Purpose Index = find important terms and phrases sleeping porch occasional collaborator sandstone boulder foundation Index = characterize a document with a set of terms that occurs in the doc sleep*, porch, occas*, collaborat*, foundat* enables location of docs with similar profile January 21, 2003 CLiMB - Columbia University

10 Finding Similar Documents Linear Algebra Techniques Latent Semantic Indexing Singular Value Decomposition (SVD) Semidiscrete Decomposition Vector Space Models Term by Document matrices Term Weighting Polysemy and Synonymy Clustering Techniques K-means EM Clustering Wavelet January 21, 2003

CLiMB - Columbia University 11 Computational Linguistic Techniques What techniques have we tried? Goal: Identify high quality metadata terms Goal: Load metadata into image search database Goal: Use enriched metadata for finding images How well have they worked? What else do we want to try? January 21, 2003 CLiMB - Columbia University 12 Art Object Identification (AO-ID)

Need Unique Identifiers Key of database records Varies from collection to collection Greene & Greene Project Names Chinese Paper Gods God Names South Asian Temples Temple Names January 21, 2003 CLiMB - Columbia University 13 Text about Images The Blacker House is known for its porte cochre and adjacent terraces. Samuel Parker Williams, an occasional Greene collaborator, worked on the site, particularly on the sandstone boulder

foundation for the sleeping porch. -- Based on Bosley January 21, 2003 CLiMB - Columbia University 14 Compile list of subject vocabulary Find meaningful terms in texts Segment relevant texts Collect terms from all sources. Identify and link AO-ID described in text.

Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate January 21, 2003 CLiMB - Columbia University 15 Create Composite List of Subject Terms Philosophy: Use whatever resources exist Catalog records Robert R. Blacker house (Pasadena, Calif.) Greene, Charles Sumner Blacker, Robert R. Art and Architecture Thesaurus

porte cochre Back of the book index Blacker house January 21, 2003 CLiMB - Columbia University 16 Progress Composite List Greene & Greene Extracted back of the book indexes Direct matching of index terms to the text Terms found - highlighted in yellow David Gamble Pasadena Westmoreland Place furniture

January 21, 2003 CLiMB - Columbia University 17 January 21, 2003 CLiMB - Columbia University 18 Compile list of subject vocabulary Find meaningful terms in texts Segment relevant texts

Collect terms from all sources. Identify and link AO-ID described in text. Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate January 21, 2003 CLiMB - Columbia University 19 Three Term Types and Approaches 1) Art Object ID names and other proper nouns important to the domain (Charles Pratt) Named Entity noun phrase finders, POS taggers

2) Common noun terms, semantically significant to the domain (V-shaped plan) List of domain terms from authority sources 3) Common noun phrases in a generic domain vocabulary (chimney) Statistical methods for identifying relevant terms January 21, 2003 CLiMB - Columbia University 20 Part of Speech (POS) taggers

Why use a part of speech tagger? To identify nouns, verbs and proper nouns The Blacker House is known for its porte cochre The Blacker House is known for its adjacent terraces January 21, 2003 CLiMB - Columbia University 21 Part of Speech (POS) taggers

Strength: An essential step allows the rest of the system to work Weakness: The best POS taggers have 95% accuracy A typical 20-word sentence is likely to have a mistake! But: some errors do not matter much E.g. sleeping porch January 21, 2003 CLiMB - Columbia University 22 What We Tried: POS Taggers Mitre Alembic WorkBench Freeware from Mitre corporation Strong for proper nouns Average for common nouns

IBMs Nominator Accurate for both Restrictive licensing January 21, 2003 CLiMB - Columbia University 23 Proper Nouns Alembic WorkBench Results 91.2% recall Misses The senior Pratt, Hall brothers 97.5% precision using Alembic Successfully finds William Issac Ott, University of California This is very good!

Highlighted in light green Mary Greene Persian Etc. January 21, 2003 CLiMB - Columbia University 24 January 21, 2003 CLiMB - Columbia University

25 Noun Phrase Chunking [The [ Blacker House ] ] is known for [ [its Porte Cochre] and [adjacent terraces] ]. [Samuel Parker Williams], [an occasional Greene collaborator], worked on [the site], particularly on [the [ [sandstone boulder] foundation] ] for [the [ sleeping porch ] ]. -- Based on Bosley January 21, 2003 CLiMB - Columbia University 26 NP Chunkers Columbias LinkIT

Regular expression grammar over POS tags Improves WorkBench results through finding simplex NPs LTChunk By LTG Group, University of Edinburgh Not as many NPs Arizona - commercialized IBM also commercial January 21, 2003 CLiMB - Columbia University 27 Results: Proper Nouns Tool Precision Recall

Alembic WorkBench LinkIT 97.50 91.20 68.94 98.81 LTChunk 68.13 63.48 January 21, 2003

CLiMB - Columbia University 28 Results: Proper Nouns January 21, 2003 LTChunk WorkBench and LinkIT Recall of proper nouns in Bosley Chapter 5 Precision WorkBench

100 90 80 70 60 50 40 30 20 10 0 CLiMB - Columbia University 29 Results: NP Chunking Highlighted in purple: The design process

The southwest adobe-stucco July 1907 January 21, 2003 CLiMB - Columbia University 30 January 21, 2003 CLiMB - Columbia University 31 Experiments with Algorithms TF/IDF and term frequency ratios Filter technical terms from frequent common nouns Term frequency ratio algorithm to improve accuracy

Co-occurrence Useful terms may appear near other good ones Machine learning Use learning algorithms to discover complex associational context January 21, 2003 CLiMB - Columbia University 32 Compile list of subject vocabulary Find meaningful terms in texts Segment relevant

texts Collect terms from all sources. Identify and link AO-ID described in text. Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate January 21, 2003 CLiMB - Columbia University 33 What is Segmentation? Divide texts into cohesive chunks Needed for determining associational context Needed to determine what terms are

related to an art object January 21, 2003 CLiMB - Columbia University 34 Results: Segmentation January 21, 2003 Project People, Frequency 12 10 Cole Bolton 8

Thorsen Pratt 6 Gamble Blacker 4 Robinson Ford 2 49 46 43

40 37 34 31 28 25 22 19 16 13

10 7 4 0 1 Use the frequency that our terms appear within a document to estimate where the document is about that term This graph shows

where different names are mentioned in Bosley on Greene & Greene Ch. 5 Frequency Paragraph CLiMB - Columbia University 35 What Weve Tried: Segmenters Marti Hearsts TextTiling Performs well for a general algorithm, but not sufficient for this specialized task

M. Hearst, ACL, 1993 F. Chois C99 segmenter Performance comparable to TextTiling F. Y. Y. Choi, NAACL, 2000 Frequency ratio approach outperformed TextTiling In-house tool to be tested Kan & Klavans, WVLC-6, 1998, Segmenter January 21, 2003 CLiMB - Columbia University 36 Meronymy as Part-Of Why is this potentially useful? A method for identifying hot paragraphs

Descriptive text contains part of relations Details that correlate to the whole Porch is a part of house An early hypothesis in testing stages January 21, 2003 CLiMB - Columbia University 37 Meronymy for Cohesion The Spinks house design is an elaboration of the rectangular, large-gabled form of the California House .has porches and terraces. In front, an expanse of lawn rises nearly to the level of the entry terrace. The front door is approached obliquely in the shaded recess of the terrace.

January 21, 2003 CLiMB - Columbia University 38 Meronymy and Other Relations The California House Other Houses Spinks House porch terrace

entry terrace front entry front door January 21, 2003 CLiMB - Columbia University 39 Compile list of subject vocabulary Find meaningful terms in texts Segment relevant texts

Collect terms from all sources. Identify and link AO-ID described in text. Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate January 21, 2003 CLiMB - Columbia University 40 Progress Project Name Matching Finding project names in Greene & Greene Challenge: finding variations

AO-ID Robert Roe Blacker House RRB House The house 1214 Fairlawn Terrace. Possible techniques to improve matching Developing a semi-automatic technique Use existing information to label text An iterative platform for manual intervention January 21, 2003 CLiMB - Columbia University 41 Variants of The Culbertson House

Cordelia A. Culbertson house (Pasadena, Calif.) Francis F. Prentiss house (Pasadena, Calif.) Culbertson sisters house (Pasadena, Calif.) Prentiss, Francis F. Culbertson, Cordelia A. Allen, Elizabeth S. Allen, Mrs. Dudley P. House was purchased by Allens, who remarried and became Prentiss! January 21, 2003 CLiMB - Columbia University

42 Zaoshen (Chinese deity) USE FOR: Dingfuzhenjun (Chinese deity) USE FOR: Kitchen God (Chinese deity)

USE FOR: Simingzaojun (Chinese deity) USE FOR: Simingzaoshen (Chinese deity) USE FOR: Ssu-ming-tsao-chun (Chinese deity) USE FOR: Ssu-ming-tsao-shen (Chinese deity) USE FOR: Ting-fu-chen-chun (Chinese deity) USE FOR: Tsao-chun (Chinese deity) USE FOR: Tsao-shen (Chinese deity) USE FOR: Tsao-wang (Chinese deity) USE FOR: Tsao-wang-yeh (Chinese deity) USE FOR: Zaojun (Chinese deity) USE FOR: Zaowang (Chinese deity) REFERENCE: Encyc. Britannicab(Tsao Shen, pinyin Zao Shen, in Chinese mythology, the god of the kitchen (god of the hearth), who is believed to report to the celestial gods on family conduct and have it within his power to bestow poverty or riches on individual families; has also been confused with Ho Shen (god of fire) and Tsao Chun (Furnace Prince)) January 21, 2003 CLiMB - Columbia University

43 Some Data to Illustrate Unaltered Project Names 0 matches (both case sensitive and insensitive) Case Insensitive Project Name matching 4 matches {Theodore Irwin house} occurs 1 time {California Institute of Technology} occurs 1 time {William R. Thorsen house} occurs 1 time {William T. Bolton house} occurs 1 time

At least double in the chapter January 21, 2003 CLiMB - Columbia University 44 A Future Solution Bootstrapping algorithm Seed terms hand labelled Terms mapped into multi-dimensional feature space Other terms that are close to the seed terms are added to the set Features: Window size Headedness Modifier similar to that of a seed term January 21, 2003

CLiMB - Columbia University 45 Summary: Research Tools Tested Part of Speech Taggers Noun Phrase Chunkers Merging techniques Proper Noun Finders Proper Name Variant Finder Segmenters January 21, 2003 CLiMB - Columbia University 46 Compile list of subject vocabulary

Find meaningful terms in texts Segment relevant texts Collect terms from all sources. Identify and link AO-ID described in text. Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate January 21, 2003 CLiMB - Columbia University 47 Future: Determine relationships

The Blacker House related to Greene The Greenes built the house. Porte Cochre is related to Blacker House because they are directly a part of the house. William Issac Ott is related to Blacker House (on which he worked) Greene (with whom he worked). Detecting these semantic relationships statistically is a challenge for our next steps: Co-occurrence Use of subject headings Meronymy and other relations (WordNet) January 21, 2003 CLiMB - Columbia University 48

Compile list of subject vocabulary Find meaningful terms in texts Segment relevant texts Collect terms from all sources. Identify and link AO-ID described in text. Determine term relationships Extract metadata Insert into existing metadata records. Mount in image search platform. Process queries and evaluate January 21, 2003 CLiMB - Columbia University

49 Thank you! Any questions? www.columbia.edu/cu/cria/climb January 21, 2003 CLiMB - Columbia University 50

Recently Viewed Presentations

  • Chapter 1

    Chapter 1

    Cognitive Psychology. The branch of psychology concerned with the scientific study of the mind. The study of the mental process of acquiring knowledge and understanding through thought, experience. Cognition refers to the mental processes, such as perception, attention, and memory,...
  • BIOLOGIA M.11 Slides Abertura: Abertura: As As primeiras

    BIOLOGIA M.11 Slides Abertura: Abertura: As As primeiras

    Tratamento: drogas terapêuticas Toxoplasmose Causador: Toxoplasma gondii Contágio: carne contaminada ou contato com fezes de gato contaminadas Prevenção: alimentos cozidos e pouco contato com gatos Tratamento: drogas antitoxoplasmas 2 Protozooses humanas Outras protozooses Tricomoníase Causador: Trichomonas vaginalis Contágio: contato ...
  • Theresa A. Zesiewicz, MD FAAN Director, USF Ataxia

    Theresa A. Zesiewicz, MD FAAN Director, USF Ataxia

    Difficulty with heel to shin tests. Dysdiadochokinesia: rapid alternating movements. Ataxia: staggering gait, in which the patient is unable to perform a "tandem" or heel to toe, walk ... The most common side effect encountered was nausea, vivid dreams. There...
  • From the Baby to The Boardroom 9th December

    From the Baby to The Boardroom 9th December

    Background. Tender in 2012 - 'Proposals are sought from experts in the field of Strength Based Leadership Development to become a partner with NHS East of England, to provide a programme to support and develop leadership skills within the Health...
  • Excel 2000 xcel xcel 2000   Microsoft Office 2000.

    Excel 2000 xcel xcel 2000 Microsoft Office 2000.

    Times New Roman Wingdings Times Roman YU Times Roman SC Symbol Default Design Bitmap Image PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint Presentation PowerPoint ...
  • Artificial Construction Materials

    Artificial Construction Materials

    Plant production's high standards of quality control result in tighter tolerances that ensure a smoother, faster fit during erection. The PCI certification process, which we'll discuss next, further ensures consistently high quality and true compliance with the design integrity established...
  • PowerPoint Map Sierra Leone Introduction The aim of

    PowerPoint Map Sierra Leone Introduction The aim of

    The PowerPoint pack provides all the tools that you would need to produce including symbols, callout boxes, legends and labels. All you need to do is copy and paste. How to guide. Select the map you want to use. In...
  • Year 9 Term 1 Higher (Unit 1) CALCULATIONS,

    Year 9 Term 1 Higher (Unit 1) CALCULATIONS,

    SURDS. Key Concepts. ... Expanding brackets. Where every term inside each bracket is multiplied by every term all other brackets. Factorising expressions. Putting an expression back into brackets. To "factorise fully" means take out the HCF.