Semi-automatic methods for WordNet Construction

Semi-automatic methods for WordNet Construction

Semi-automatic methods for WordNet construction German Rigau i Claramunt http://www.lsi.upc.es/~rigau TALP Research Center Universitat Politcnica de Catalunya Eneko Agirre http://www.ji.si.upc.es/users/eneko IxA NLP Group University of the Basque Country 2002 International WordNet Conference Setting NLP and the Lexicon Theoretical: WG, GPSG, HPSG. Practical: realistic complexity and coverage Lexical bottleneck (Briscoe 91) Even worse for languages other than English Semi-automatic methods for WN construction

2002 International WordNet Conference - 2 Setting Which LK is needed by a concrete NLP system? Where is this LK located? Which procedures can be applied? Semi-automatic methods for WN construction 2002 International WordNet Conference - 3 Setting Which LK is needed by a concrete NLP system? Phonology: phonemes, stress, etc. Morphology: POS, etc. Syntactic: category, subcat., etc. Semantic: class, SRs, etc. Pragmatic: usage, registers, TDs, etc. Translations: translation links

Semi-automatic methods for WN construction 2002 International WordNet Conference - 4 Setting Where is this LK located? Human brain Structured Lexical Resources: Monolingual and bilingual MRDs Thesauri Unstructured Lexical Resources: Monolingual and bilingual Corpora Mixing resources Semi-automatic methods for WN construction 2002 International WordNet Conference - 5 Setting Which procedures can be applied? Prescriptive approach Machine-aided manual construction Descriptive approach Automatic acquisition from pre-existing

Lexical Resources Mixed approach Semi-automatic methods for WN construction 2002 International WordNet Conference - 6 Outline Setting Words and Works Merge approach Taxonomy construction: monolingual MRDs Mapping taxonomies: bilingual MRDs Expand approach Translation of synsets: bilingual MRDs Interface for manual revision Conclusions Semi-automatic methods for WN construction 2002 International WordNet Conference - 7 Words and Works

Where is this Lexical Knowledge located? Human brain: Linguistic String Project (Fox et al. 88) Lexical Information for 10,000 entries WordNet (Miller et al. 90) Semantic Information v1.6 with 99,642 synsets Comlex (Grishman et al. 94) Syntactic information 38,000 English words CYC Ontology (Lenat 95) a person-century of effort to produce 100,000 terms LDOCE3-NLP dictionary with 80,000 senses Semi-automatic methods for WN construction 2002 International WordNet Conference - 8 Words and Works Where is this Lexical Knowledge located?

Structured Lexical Resources Monolingual MRDs: LDOCE learners dictionary 35,956 entries and 76,059 definitions 86% semantic and 44% pragmatic codes controlled vocabulary of 2,000 words (Boguraev & Briscoe 89) (Vossen & Serail 90) (Bruce & Guthrie 92), (Wilks et al. 93) (Dolan et al. 93), (Richardson 97) Semi-automatic methods for WN construction 2002 International WordNet Conference - 9

Words and Works Where is this Lexical Knowledge located? Structured Lexical Resources Other Monolingual MRDs: Websters (Jensen & Ravin 87) LPPL (Artola 93) DGILE (Castelln 93), (Taul 95), (Rigau 98) CIDE (Harley & Glennon 97) AHD (Richardson 97) WordNet (Harabagiu 98) Bilingual MRDs Collins Spanish/English (Knigth & Luk 94) Vox/Harraps Spanish/English (Rigau 98)

Semi-automatic methods for WN construction 2002 International WordNet Conference - 10 Words and Works Where is this Lexical Knowledge located? Structured Lexical Resources Thesauri: Rogets Thesaurus 60,071 words in 1,000 categories (Yarowsky 92), (Grefenstette 93), (Resnik 95) Rogets II and The New Collins Thesaurus (Byrd 89) Macquaries thesaurus (Grefenstette 93) Bunrui Goi Hyou Japanese thesaurus (Utsuro et al. 93) Semi-automatic methods for WN construction

2002 International WordNet Conference - 11 Words and Works Where is this Lexical Knowledge located? Structured Lexical Resources Encyclopaedia Groliers Encyclopaedia (Yarowsky 92) Encarta (Richardson et al. 98) Others Telephonic Guides Mixing structured lexical resources Rogets Thesaurus and Groliers (Yarowsky 92) LDOCE, WN, Collins, ONTOS, UM (Knight & Luk 94) Japanese MRD to WN (Okumura & Hovy 94) LLOCE, LDOCE (Chen & Chang 98) Semi-automatic methods for WN construction 2002 International WordNet Conference - 12 Words and Works

Where is this Lexical Knowledge located? Unstructured Lexical Resources Corpora: WSJ, Brown Corpus (SemCor), Hansard Proper Nouns (Hearst & Schtze 95) Idiosyncratic Collocations (Church et al. 91) Preposition preferences (Resnik and Hearst 93) Subcategorization structures (Briscoe and Carroll 97) Selectional restrictions (Resnik 93), (Ribas 95) Thematic structure (Basili et al. 92) Word semantic classes (Dagan et al. 94) Bilingual Lexicons for MT (Fung 95)

Semi-automatic methods for WN construction 2002 International WordNet Conference - 13 Words and Works Where is this Lexical Knowledge located? Using both structured and nonstructured Lexical Resources MRDs and Corpora (Liddy & Paik 92) (Klavans & Tzoukermann 96) WordNet and Corpora (Resnik 93), (Ribas 95), (Li & Abe 95), (McCarthy 01) Semi-automatic methods for WN construction 2002 International WordNet Conference - 14 International Projects on Lexical Acquisition Japanese Projects

EDR (Yokoi 95) Nine years project oriented to MT Bilingual Corpora with 250,000 words Monolingual, bilingual and coocurrence dictionaries 200,000 general vocabulary 100,000 technical terminology 400,000 concepts Semi-automatic methods for WN construction 2002 International WordNet Conference - 15 International Projects on Lexical Acquisition American Projects Comlex (Grishman et al. 94) Syntactic information for 38,000 words WordNet (Miller 90) Semantic Information more than 123,000 words organised in 99,000 synsets more than 116,000 relations between synsets

Pangloss (Knight & Luk 94) PUM, ONTOS, LDOCE semantic categories, WordNet Cyc (Lenat 95) common-sense knowledge 100,000 concepts and 1,000,000 axioms Semi-automatic methods for WN construction 2002 International WordNet Conference - 16 International Projects on Lexical Acquisition European Projects Acquilex I and II LA from monolingual and bilingual MRDs and corpora LE-Parole Large-scale harmonised set of corpora and lexicons for all the EU languages EuroWordNet Multilingual WordNet for several European Languages

Meaning Large-scale of LK from the web Large-scale WSD Semi-automatic methods for WN construction 2002 International WordNet Conference - 17 Words and Works Lexical Acquisition from MRDs Syntactic Disambiguation (Dolan et al. 93) Semantic Processing (Vanderwende 95) WSD (Lesk 86), (Wilks & Stevenson 97), (Rigau 98) IR (Krovetz & Croft 92) MT (Knight and Luk 94), (Tanaka & Umemura 94) Semantically enriching MRDs

(Yarowsky 92), (Knight 93), (Chen & Chan 98) Building LKBs (Bruce & Guthrie 92) (Dolan et al. 93) (Artola 93) (Castelln 93), (Taul 95), (Rigau 98) Semi-automatic methods for WN construction 2002 International WordNet Conference - 18 Words and Works Acquisition of LK from MRDs This tutorial focus on: the massive acquisition of LK from MRDs (conventional, in any language) using (semi) automatic methodologies

Why MRDs? The conventional dictionaries for human use usually contain spelling, pronunciation, hyphenation, capitalization, usage notes for semantic domains, geographic regions, and propiety; ethimological, syntactic and semantic information about the most basic units of the language (Amsler 81) Semi-automatic methods for WN construction 2002 International WordNet Conference - 19 Words and Works Main Problems of MRDs Conventional dictionaries are not systematic Dictionaries are built for human use Implicit Knowledge words are described/translated in terms of words Semi-automatic methods for WN construction 2002 International WordNet Conference - 20

Words and Works MRDs and Semantic Knowledge jardn_1_1 Terreno donde se cultivan plantas y flores ornamentales. florero_1_4 Maceta con flores. ramo_1_3 Conjunto natural o artificial de flores, ramas o hierbas. ptalo_1_1 Hoja que forma la corola de la flor. tlamo_1_3 Receptculo de la flor. miel_1_1 Substancia viscosa y muy dulce que elaboran las abejas, en una distensin del esfago, con el jugo de las flores y luego depositan en las celdillas de sus panales. florera_1_1 Floristera; tienda o puesto donde se venden flores. florista_1_1 Persona que tiene por oficio hacer o vender flores. camelia_1_1 Arbusto cameliceo de jardn, originario de Oriente, de hojas perennes y lustrosas, y flores grandes, blancas, rojas o

rosadas (Camellia japonica). camelia_1_2 Flor de este arbusto. rosa_1_1 Flor del rosal. Semi-automatic methods for WN construction 2002 International WordNet Conference - 21 Outline Setting Words and Works Merge approach Taxonomy construction: monolingual MRDs Mapping taxonomies: bilingual MRDs Expand approach Translation of synsets: bilingual MRDs Interface for manual revision Conclusions Semi-automatic methods for WN construction 2002 International WordNet Conference - 22

Merge approach Main Methodology MRD1 ... MRDn LDB1 ... LDBn Tax1 LKB1 ... ... Taxn Semi-automatic methods for WN construction MLKB

LKBn 2002 International WordNet Conference - 23 Merge approach Main Methodology Taxonomy construction: (Rigau et al. 98, 97) monolingual MRDs Step 1: Selection of the main top beginners for a semantic primitive Step 2: Exploiting genus, construction of taxonomies for each semantic primitive Mapping taxonomies: (Daud et al. 99) bilingual MRDs Step 3: Creation of translation links Semi-automatic methods for WN construction

2002 International WordNet Conference - 24 Merge approach: Taxonomy Construction Methodology Problems following a pure descriptive approach Circularity Errors and inconsistencies Definitions with omitted genus Top dictionary senses do not usually represent useful knowledge for the LKB Too general Too specific Semi-automatic methods for WN construction 2002 International WordNet Conference - 25 Merge approach: Taxonomy Construction Methodology Mixed Methodology Prescriptive approach Manual construction of the Top Structure

Semi-automatic methods for WN construction 2002 International WordNet Conference - 26 Merge approach: Taxonomy Construction Methodology Mixed Methodology Prescriptive approach Manual construction of the Top Structure Descriptive approach Acquiring implicit information from MRDs Semi-automatic methods for WN construction 2002 International WordNet Conference - 27 Merge approach: Taxonomy Construction Methodology Mixed Methodology Prescriptive approach

Manual construction of the Top Structure Descriptive approach Acquiring implicit information from MRDs Semi-automatic methods for WN construction 2002 International WordNet Conference - 28 Step 1: Selection of the main top beginners Word sense: Attached-to: Definition: flores, zumo_1_1 c_art_subst type. lquido que se extrae de las hierbas, frutos, etc. (liquid extracted from flowers, herbs,

fruits, etc). Semi-automatic methods for WN construction 2002 International WordNet Conference - 29 Step 1: Selection of the main top beginners A) Attaching DGILE senses to semantic primitives 1) First labelling: Conceptual Distance (Rigau 94) 2) Second labelling: Salient Words (Yarowsky 92) B) Filtering Process Semi-automatic methods for WN construction 2002 International WordNet Conference - 30 Step 1: Selection of the main top beginners

A.1) First labelling: Conceptual Distance (Agirre et al. 94) length of the shortest path specificity of the concepts 1 dist(w1 , w 2 ) min c1i w1 c k path(c1i ,c 2i ) depth(c k ) c 2i w 2 using WordNet Bilingual dictionary Semi-automatic methods for WN construction 2002 International WordNet Conference - 31 Step 1: Selection of the main top beginners

abada_1_2 Iglesia o monasterio regido por un abad o abadesa (abbey, a church or a monastery ruled by an abbot or an abbess) Semi-automatic methods for WN construction 2002 International WordNet Conference - 32 Step 1: Selection of the main top

beginners 06 ARTIFACT abada_1_2 Iglesia o monasterio regido por un abad o abadesa (abbey, a church or a monastery ruled by an abbot or an abbess)

Semi-automatic methods for WN construction 2002 International WordNet Conference - 33 Step 1: Selection of the main top beginners A.1) First labelling (Results) 29,205 labelled definitions (31%) 61% accuracy at a sense level 64% accuracy at a file level Semi-automatic methods for WN construction 2002 International WordNet Conference - 34 Step 1: Selection of the main top beginners A.2) Second labelling: Salient Words (Yarowsky 92) Pr(w | SC) AR(w, SC) Pr(w | SC) log 2 Pr(w) Importance

local frequency appears more significantly more often in the corpus of a semantic category than at other points in the whole corpus Semi-automatic methods for WN construction 2002 International WordNet Conference - 35 Step 1: Selection of the main top beginners A.2) Second labelling (Results): bibern_1_1 ARTIFACT 4.8399 Frasco de cristal ... (glass flask ...) bibern_1_2 FOOD 7.4443 Leche que contiene este frasco .. (milk contained in that flask ...) 86,759 labelled definitions (93%) 80% accuracy at a file level Semi-automatic methods for WN construction 2002 International WordNet Conference - 36

Step 1: Selection of the main top beginners B) Filtering process (FOODs) removes all genus terms FILTER 1: mapping FILTER 2: in other FILTER 3: not FOODs by the bilingual appear more often as genus Semantic Primitive with a low frequency Semi-automatic methods for WN construction 2002 International WordNet Conference - 37 Step 1: Selection of the main top beginners B) Filtering process (FOOD Results) LABEL2

LABEL2+F3>9 LABEL2+F3>8 LABEL2+F3>7 LABEL2+F3>6 LABEL2+F3>5 LABEL2+F3>4 LABEL2+F3>3 LABEL2+F3>2 LABEL2+F3>1 FILTER 1 #GT 31 35 37 43 49 55 64 85 125 Semi-automatic methods for WN construction

Accuracy 94% 95% 91% 92% 92% 91% 85% 82% 78% FILTER 2 #GT 31 35 37 41 47 56 65 82 123 Accuracy

100% 100% 95% 94% 92% 91% 87% 83% 82% 2002 International WordNet Conference - 38 Merge approach: Taxonomy Construction Step 2: Exploiting Genus Word sense: Hypernym: Definition: vino_1_1 zumo_1_1. zumo de uvas fermentado. (fermented juice of grapes).

Word sense: Hypernym: Definition: de rueda_2_1 vino_1_1. vino procedente de la regin Rueda (Valladolid). (wine from the region of Rueda). Semi-automatic methods for WN construction 2002 International WordNet Conference - 39 Merge approach: Taxonomy Construction Step 2: Exploiting Genus Genus Sense Identification 97% accuracy for nouns Genus Sense Disambiguation Unrestricted WSD (coverage 100%) Knowledge-based WSD (not supervised)

Eight Heuristics (McRoy 92) Combining several lexical resources Combining several methods Semi-automatic methods for WN construction 2002 International WordNet Conference - 40 Merge approach: Taxonomy Construction Step 2: Exploiting Genus Results: Heuristic 1: Monosemous Genus Term Heuristic 2: Entry Sense Ordering Heuristic 3: Explicit Semantic Domain Heuristic 4: Word Matching Heuristic 5: Simple Concordance Heuristic 6: Cooccurrence Vectors Heuristic 7: Semantic Vectors Heuristic 8: Conceptual Distance Sum Semi-automatic methods for WN construction

Polysemous Prec. Cov. 70% 100% 100% 1% 72% 61% 57% 100% 60% 100% 58% 99% 49% 95% 79% 100% Overall Prec. 100% 75% 100% 79% 65% 66%

63% 57% 83% Cov. 16% 100% 2% 56% 95% 97% 94% 89% 100% 2002 International WordNet Conference - 41 Merge approach: Taxonomy Construction Step 2: Exploiting Genus Knowledge provided by each heuristic: - Heuristic 1: Monosemous Genus Term - Heuristic 2: Entry Sense Ordering

- Heuristic 3: Explicit Semantic Domain - Heuristic 4: Word Matching - Heuristic 5: Simple Concordance - Heuristic 6: Cooccurrence Vectors - Heuristic 7: Semantic Vectors - Heuristic 8: Conceptual Distance Sum Semi-automatic methods for WN construction Overall Prec. 79% 72% 82% 81% 81% 81% 81% 77% 83% Cov. 100%

100% 98% 100% 100% 100% 100% 100% 100% 2002 International WordNet Conference - 42 Merge approach: Taxonomy Construction Step 2: Exploiting Genus FOOD Genus terms Dictionary senses Levels Senses in level 1 Senses in level 2 Senses in level 3 Senses in level 4 Senses in level 5 Senses in level 6

[Castelln 93] F2+F3>9 F2+F3>4 63 33 68 392 952 1,242 6 5 6 2 18 48 67 490 604 88 379 452 67 44 65 87

21 60 6 0 13 F2+F3>9: 35,099 definitions F2+F3>4: 40,754 definitions No filters: 111,624 definitions Semi-automatic methods for WN construction 2002 International WordNet Conference - 43 Merge approach: Taxonomy Construction Step 2: Exploiting Genus ... zumo_1_1 vino_1_1 zumo_1_1 vino_1_1 zumo_1_1 vino_1_1 zumo_1_1

vino_1_1 zumo_1_1 vino_1_1 zumo_1_1 vino_1_1 zumo_1_1 vino_1_1 zumo_1_1 vino_1_1 zumo_1_1 vino_1_1 zumo_1_1 vino_1_1 zumo_1_1 vino_1_1 tarragona_1_1 zumo_1_1 vino_1_1 zumo_1_1 vino_1_1 zumo_1_1 vino_1_1 ... methods for WN construction Semi-automatic

quianti_1_1 raya_1_8 requena_1_1 reserva_1_12 ribeiro_1_1 rioja_1_1 roete_1_1 rosado_1_3 rueda_2_1 sherry_1_1 tintilla_1_1 tintorro_1_1 toro_3_1 2002 International WordNet Conference - 44 Merge approach: Mapping Taxonomies Step 3: Creation of translation links C1 C2 C3 C4 C5

C6 Semi-automatic methods for WN construction 2002 International WordNet Conference - 45 Merge approach: Mapping Taxonomies Step 3: Creation of translation links C1 C2 C3 C4 C5 C6 Semi-automatic methods for WN construction 2002 International WordNet Conference - 46 Merge approach: Mapping Taxonomies Step 3: Creation of translation links Connecting already existing Hierarchies Relaxation labelling Algorithm Constraints

Between Spanish taxonomy automatically derived from an MRD (Rigau et al. 98) WordNet using a bilingual MRD Semi-automatic methods for WN construction 2002 International WordNet Conference - 47 Merge approach: Mapping Taxonomies Step 3: Creation of translation links animal (Tops ) ave faisn rapaz (person

...>) blockhead, ...>) (person ) (artifact ) (food ) (person ) (animal ) (food ) (animal ) (artifact ) (food ) (person ) Semi-automatic methods for WN construction 2002 International WordNet Conference - 48 Merge approach: Mapping Taxonomies Step 3: Relaxation Labelling algorithm Iterative algorithm for function optimisation based on local information it can deal with any kind of constraints

variables (senses of the taxonomy) labels (synsets) Finds a weight assignment for each possible label for each variable weights for the labels of the same variable add up to one weight assignation satisfies -to the maximum possible extent- the set of constraints Semi-automatic methods for WN construction 2002 International WordNet Conference - 49 Merge approach: Mapping Taxonomies Step 3: Relaxation Labelling algorithm 1) Start with a random weight assignment 2) Compute the support value for each label of each variable (according to the constraints) 3) Increase the weights of the labels more compatible with context and decrease those and decrease those of the less compatible labels.

4) If a stopping/convergence is satisfied, stop, otherwise go to step 2. Semi-automatic methods for WN construction 2002 International WordNet Conference - 50 Merge approach: Mapping Taxonomies Step 3: Constraints Rely on the taxonomy structure Coded with three characters X: Spanish Taxonomy, I (immediate), Y: English Taxonomy, A (ancestor) X: Relation, E (hypernym), O (hyponym), B (both) Examples: IIE Semi-automatic methods for WN construction AAB

+ + + + 2002 International WordNet Conference - 51 Merge approach: Mapping Taxonomies Step 3: Results Poly TOK, FOK TOK, FNOK total animal 279 (90%) 30 (91%) 209 (90%) food 166 (94%) 3 (100%) 169 (94%)

cognition 198 (67%) 27 (90%) 225 (69%) communication 533 (77%) 40 (97%) 573 (78%) all TOK, FOK TOK, FNOK total animal 424 (93%) 62 (95%) 486 (90%) food 166 (94%) 83 (100%) 249 (96%) cognition 200 (67%) 245 (90%) 445 (82%) communication 536 (77%) 234 (97%) 760 (81%) Semi-automatic methods for WN construction 2002 International WordNet Conference - 52 Merge approach: Mapping Taxonomies Step 3: Example piel (substance )

marta visn (substance ) (substance ) Semi-automatic methods for WN construction 2002 International WordNet Conference - 53 Outline Setting Words and Works Merge approach Taxonomy construction: monolingual MRDs Mapping taxonomies: bilingual MRDs Expand approach Translation of synsets: bilingual MRDs Interface for manual revision Conclusions Semi-automatic methods for WN construction

2002 International WordNet Conference - 54 Expand approach Take one WordNet as starting point Translate synsets: English: Basque: We obtain a structurally similar WordNet in another language, but some of the synsets will be missing Use bilingual dictionary maintien n.m. (attitude) bearing; (conservation) maintenance 1. Keep bilingual senses (Agirre & Rigau 95) maintien1: (attitude) bearing maintien2: (conservation) maintenance 2. Produce all translation pairs (Atserias et al. 97) maintien - bearing maintien - maintenance Semi-automatic methods for WN construction 2002 International WordNet Conference - 55

Expand approach - produce all pairings Used to produce the first version of the nominal part of the Spanish WordNet Based on WN 1.5 Both directions in bilingual dictionary merged Spanish/English: 19,443 translation pairs English/Spanish: 16,324 translation pairs Harmonized bilingual: 28,131 translation pairs Overlap with WordNet: 12,665 nouns (14%) Two methods: class methods: consider only pairings conceptual distance methods: consider similarity of synsets Semi-automatic methods for WN construction 2002 International WordNet Conference - 56

Expand approach - produce all pairings Ten class methods Four monosemic criteria Four polysemic criteria Two hybrid criteria Three conceptual distance methods CD1: using pairwise word coocurrences CD2: using headword and genus CD3: using bilingual Spanish entries with multiple translations Semi-automatic methods for WN construction 2002 International WordNet Conference - 57 Expand approach - produce all pairings Class methods: Four possible configurations for pairs which either share an English word or an Spanish word: connected graph. SW

EW SW EW EW SW EW SW SW EW SW EW Semi-automatic methods for WN construction 2002 International WordNet Conference - 58

Expand approach - produce all pairings 4 monosemous class methods: All English words involved are monosemous in WN SW EW Synset M1 SW EW Synset EW Synset EW Synset SW

EW Synset SW EW Synset SW SW Semi-automatic methods for WN construction M2 M3 M4 2002 International WordNet Conference - 59 Expand approach - produce all pairings 4 polysemous class methods: At least 1 English word involved is polysemous

SW EW Synset+ SW EW Synset+ EW Synset+ EW Synset+ SW EW

Synset+ SW EW Synset+ SW SW Semi-automatic methods for WN construction P1 P2 P3 P4 2002 International WordNet Conference - 60 Expand approach - produce all pairings 2 other class methods Variant criterion: two synonyms share

a single SW <..., EW, ..., EW, ...> SW VC Field criterion: use field indicators<..., in headword-EW, ..., Ind-EW, ...> bilingual entry when available FC SW Semi-automatic methods for WN construction 2002 International WordNet Conference - 61 Expand approach - produce all pairings Ten class methods (results) Criterion mono1 mono2

mono3 mono4 poly1 poly2 poly3 poly4 Variant Field #links #synsets #words %ok 3697 3583 3697 92 935 929 661 89 1863 1158 1863 89 2688 1328

2063 85 5121 4887 1992 80 1450 1426 449 75 11687 6611 3165 58 40298 9400 3754 61 3164 2195 2261 85 510 379

421 78 Semi-automatic methods for WN construction 2002 International WordNet Conference - 62 Expand approach - produce all pairings Conceptual Distance Methods (Agirre et al. 94) length of the shortest path specificity of the concepts dist(w1 , w 2 ) min c1i w1 c 2i w 2 c k path(c1i ,c 2i 1 depth(ck ) ) Using WordNet

Bilingual dictionary Semi-automatic methods for WN construction 2002 International WordNet Conference - 63 Expand approach - produce all pairings Three conceptual distance methods CD1: using pairwise word coocurrences from monolingual dict. CD2: using headword and genus from monolingual def. CD3: using bilingual Spanish entries with multiple translations Semi-automatic methods for WN construction 2002 International WordNet Conference - 64 Expand approach - produce all pairings CD2

abada_1_2 Iglesia o monasterio regido por un abad o abadesa (abbey, a church or a monastery ruled by an abbot or an abbess) Semi-automatic methods for WN construction 2002 International WordNet Conference - 65

Expand approach - produce all pairings Three conceptual distance methods Criter. CD - 1 CD - 2 CD - 3 #links 23,828 24,739 4,567 #synsets 11,269 12,709 3,089 Semi-automatic methods for WN construction #words 7,283 10,300 2,313

%ok 56 61 75 2002 International WordNet Conference - 66 Expand approach - produce all pairings Keep SW-synset pairs produced by methods with precision above 85% mono1 mono2 mono3 mono4 variant But, if two different methods propose the same

SW-synset pair, it could get better confidence try pairwise combinations of methods Semi-automatic methods for WN construction 2002 International WordNet Conference - 67 Expand approach - produce all pairings Combinations of methods: higher precision in some cases method1 cd1 cd2 cd3 p1 p2 size %ok size %ok size %ok size

%ok size %ok method2 cd2 cd3 p1 p2 p3 p4 15736 1849 2076 556 3146 15105 79 85 86 86 72 64 0 2401 2536 592 3777 13246 0 86 88 86 75

67 0 0 205 180 215 3114 0 0 95 95 100 77 0 0 0 0 77 178 0 0 0 0 100 88

0 0 0 0 28 78 0 0 0 0 77 96 Semi-automatic methods for WN construction 2002 International WordNet Conference - 68 Expand approach - produce all pairings Results SpWN v 0.1 BasqueWN v 0.1: 2 bilingual dictionaries apply first 8 class methods only

WNs SpWN v0.0 Combination SpWN v0.1 BasqueWN v0.1 #links 10,982 7,244 15,535 41,107 #synsets 7,131 5,852 10,786 23,486 Semi-automatic methods for WN construction #word 8,396 3,939 9,986

22,166 #CS #poly links 87.4 1,777 85.6 2,075 86.4 3,373 >80.0 - 2002 International WordNet Conference - 69 Expand approach - bilingual senses Smaller experiment with French bilingual dictionary Based on WN 1.5 Keep structure of bilingual dictionary: bilingual senses 21322 entries, 31502 subentries (senses) 16917 nominal subentries Disambiguation is possible: 1)

2) 3) 4) one of the translation words is monosemous in WordNet. the translation is given by a list of words. a cue in French is provided alongside the translation. a semantic field is provided. folie 1: n.f. madness provision 1: n.f. supply, store trsor 2: n.m. (ressources) (comm.) finances Semi-automatic methods for WN construction 2002 International WordNet Conference - 70 Expand approach - bilingual senses Possible disambiguation case by case translation not in WordNet unique translation, n senses any combination of cases 1,2,3,4 total

891 5% 6,440 38% 9,586 57% 16,917 100% case 1; 1 sense case 2; more than one translation case 3; cue in French case 4; semantic field 5,119 30% 958 6% 3,702 22% 1,365 8% Semi-automatic methods for WN construction 2002 International WordNet Conference - 71 Expand approach - bilingual senses Disambiguation: Conceptual Density [Agirre & Rigau 95]: The relatedness of a certain word-sense to the

words in the context (cue, other translations and/ or semantic field) allows us to select that sense over the others no result 8,311 53% Bilingual dictionary + English WordNet result obtained case 1; 1 sense case 2; >1 trans case 3; cue total Semi-automatic methods for WN construction 7,241 47% 5,119 33% 723 5% 1,399 9% 15,552 100% 2002 International WordNet Conference - 72

Expand approach - summary all pairings coverage and precision produce a good starting point for manual revision bilingual senses keeping bilingual sense might help precision very low coverage Semi-automatic methods for WN construction 2002 International WordNet Conference - 73 Outline Setting Words and Works Merge approach Taxonomy construction: monolingual MRDs Mapping taxonomies: bilingual MRDs Expand approach Translation of synsets: bilingual MRDs

Interface for manual revision Conclusions Semi-automatic methods for WN construction 2002 International WordNet Conference - 74 Interface for manual revision Semi-automatic methods for WN construction 2002 International WordNet Conference - 75 Interface for manual revision Semi-automatic methods for WN construction 2002 International WordNet Conference - 76 Interface for manual revision Client/Server achitecture Data base: EWN design implemented on SQL tables English, Spanish, Catalan and Basque

Interface: Perl CGIs that access the data bases Semi-automatic methods for WN construction 2002 International WordNet Conference - 77 Outline Setting Words and Works Merge approach Taxonomy construction: monolingual MRDs Mapping taxonomies: bilingual MRDs Expand approach Translation of synsets: bilingual MRDs Interface for manual revision Conclusions Semi-automatic methods for WN construction 2002 International WordNet Conference - 78

Conclusions methods to automatically produce preliminary versions methods mainly for nouns need to manually revise merge approach method to produce native hierarchies and word senses trust lexicographers hierarchies need to map to ILI in independent process expand approach method to translate English WNs synsets trusts WNs hierarchies, sense distinctions mapping to ILI for free Semi-automatic methods for WN construction 2002 International WordNet Conference - 79 Conclusions merge approach manual work: revising and re-organizing the automatic hierarchies (hard) revising automatic mapping (very hard)

allows for integration of data from monolingual dictionary definition text itself lexico-semantic relations from definitions expand approach manual work: revise proposed translations (fast) review the rest of the synsets (many) include glosses Semi-automatic methods for WN construction 2002 International WordNet Conference - 80 Conclusions Interface to speed up manual work Downloadable soon: WN 1.5 in data-base format Interface WordNets can be checked at: http://www.lsi.upc.es/~nlp http://ixa.si.ehu.es/wei3.html

This slides will (shortly) be available at : http:// ... http://www.ji.si.ehu.es/users/eneko Semi-automatic methods for WN construction 2002 International WordNet Conference - 81 Bibliography Semi-automatic methods for WN construction 2002 International WordNet Conference - 82 Semi-automatic methods for WordNet construction German Rigau i Claramunt http://www.lsi.upc.es/~rigau TALP Research Center Universitat Politcnica de Catalunya Eneko Agirre http://www.ji.si.upc.es/users/eneko IxA NLP Group

University of the Basque Country 2002 International WordNet Conference

Recently Viewed Presentations

  • Asexual Reproduction

    Asexual Reproduction

    Asexual Reproduction World of Plants Standard Grade Biology Asexual reproduction Also known as vegetative propagation 3 methods of vegetative propagtaion -tubers -bulbs -runners Tubers Bulbs E.g. daffodils, lilies Runners Artificial Propagation 2 methods used to cultivate plants asexually -taking cuttings...
  • Re-engaging with NAFTA in the 21st Century: A Mexican Perspective

    Re-engaging with NAFTA in the 21st Century: A Mexican Perspective

    Dr. Gustavo Vega Cánovas. Conference on 2018 Mexican National Elections and Future of . U.S-Mexico Relations. Texas State UniversityApril 5, 2018 . The proposal by the Trump Administration to renegotiate NAFTA was not a surprise to Mexico or Canada. Over...
  • Office of the Chief Information Officer Technology, Education,

    Office of the Chief Information Officer Technology, Education,

    PowerPoint Presentation Author: Bob Kalal Last modified by: IT Department Created Date: 4/21/2003 9:22:02 PM Document presentation format: On-screen Show Company: The Ohio State University Other titles: Times Arial Times New Roman Verdana Blank Presentation Microsoft Photo Editor 3.0 Photo
  • | Higher Education Forum 2017: Research Leading the

    | Higher Education Forum 2017: Research Leading the

    Broadsheet.ie, Elsevier. Track Researcher Profile. Performance Assessment. Research Networking. Fostering . academic-corporate collaboration (3) Maximizing the economic effect of research. Fostering . academic-corporate collaboration (3) Maximizing the economic effect of research.
  • Name That Derivative!

    Name That Derivative!

    Instructions Use the words in the table, or replace them with your own. Use six words and six definitions placed randomly on the table. Students choose a letter and number (A1, C4) and the teacher or student clicks and slides...
  • Using Quotations Within Quotations  When you use someone

    Using Quotations Within Quotations When you use someone

    quotation within a quotation. Let's take a look at some examples based on the quote below: Here's an example of a quotation within a quotation: In the sentence above, all of the copied words are in double quotation marks. The...
  • The Five Points of Yoga - Ms. Robertson

    The Five Points of Yoga - Ms. Robertson

    The Five Points of Yoga Consider this: A car requires five things: A lubricating system A battery A cooling system Fuel A responsible driver behind the wheel The body is like a vehicle for the soul and needs specific requirements...
  • Week 1 - UCF CRCV

    Week 1 - UCF CRCV

    Geometry Method procedure. We then sift through the detections again by size of bounding boxes (too large or too small) Geometry Method procedure. Using the sifted bounding boxes we generate all possible combinations (no repeats, order doesn't matter) of possible...