Morphological and Syntactic Analysis

Morphological and Syntactic Analysis

Computational Morphology and Syntax of Natural Languages Daniel Zeman http://ufal.mff.cuni.cz/course/npfl094 [email protected] NPFL094 Presentations and talks will be in English Unless all students understand Czech Questions welcome in both Czech and English And I have many examples from Czech 8.10.2010 http://ufal.mff.cuni.cz/~zeman/

2 Caution No class on October 28 November 4 November 25 22.10.2010 http://ufal.mff.cuni.cz/course/npfl094 3 Getting Credits 2-3 smaller tasks homework style less flexible deadlines

Alternatively: one larger project ask me if interested can be combined with your mgr. (or bc.) thesis 22.10.2010 http://ufal.mff.cuni.cz/course/npfl094 4 An Unbalanced Course 1/3 linguistics, 2/3 tools

1/3 lab work, 2/3 lectures morphology, syntax Mostly rule-based almost no machine learning no neural networks 22.10.2010 http://ufal.mff.cuni.cz/course/npfl094 5 Outline: Morphology Morphemic segmentation un + beat + able Phonology (morphonology) and orthography

baby + s = babies Inflectional vs. derivational morphology Morphological analysis: word form lemma + morphosyntactic features (tag) Tagging (context-aware disambiguation) Unsupervised affix detection in corpus Mining of word forms from corpus 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 6 Morphological Analysis Input: word form (token)

Output: set (possibly empty) of analyses an analysis: lemma (base form of the lexeme) tag (morphological, POS) part of speech features and their values 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 7 MA Example Language: Czech

Input: malmi Output (only one selected analysis here): lemma = mal (small) tag = AAFP71A 8.10.2010 part of speech = AA (adjective / pdavn jmno) gender = F (feminine / ensk) number = P (plural / mnon) case = 7 (instrumental / 7. pd) degree of comparison = 1 (positive / 1. stupe) http://ufal.mff.cuni.cz/~zeman/

8 MA Example Language: English Input: flies Output: lemma 1 = fly-1 (to move in the air) tag 1 = VBZ (verb, present tense 3rd person singular) lemma 2 = fly-2 (an insect) tag 2 = NNS (noun, plural)

Output is not disambiguated with respect to context 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 9 MA versus Tagging By tagging we usually mean context-based disambiguation Most taggers employ statistical methods Taggers may or may not work on top of MA MA may provide readings not known from training If a tagged corpus is available but MA is not, a tagger can still be trained on the corpus

8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 10 Morphemic Segmentation Morpheme is the smallest unit of language that conveys some meaning Morphemic segmentation = finding morpheme boundaries within words Typically part of MA: 8.10.2010

input: closed identify the morphemes: close + d interpret them: verb (close) + past tense output: close + VBD http://ufal.mff.cuni.cz/~zeman/ 11 Morphemic Segmentation Sometimes it is useful to know the morphemes even if we cannot interpret them Data sparseness, e.g. in machine translation: en: city cs alignments in parallel corpus: msto (nom/acc/voc sg, 42), msta (gen sg, nom/acc/voc pl, 40), mst (loc sg, 32), mst (gen pl, 9), mstsk (adj, 7), mstem (ins sg, 7), mstskch

(adj, 4), mstsk (adj, 4), mstsk (adj, 2), mstu (dat sg, 2), mstech (loc pl, 2) missing cs: mstm (dat pl), msty (ins pl), mstskho, mstskmu, mstskm, mstskm, mstt, mstskmi, mstskou (adj remaining forms) 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 12 Morphemic Segmentation Sometimes it is useful to know the morphemes even if we cannot interpret them Data sparseness, e.g. in machine translation Stemming = stripping all morphemes but the stem

IN: The British players were unbeatable. OUT: the Brit play were beat . Lemmatization = replacing all words with their lemmas (as with tagging, disambiguation may be assumed) OUT: the British player be (un)beatable . 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 13 Inflection vs. Derivation Derivational morphology: New lemma! Often (but not always) new part of speech.

Inflectional morphology: Set of forms of one lemma (lexeme) The set is called paradigm The borderline is sometimes quite fuzzy 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 14 Outline: Syntax

Constituency vs. dependency Context-free grammars Transition network grammars Shallow parsing (chunking) Chart parsers Dependency parsers (transition-, graph-based) Clause boundaries 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 15

Dependency Tree Phrasal Tree (Penn Treebank) Applications of Morphology First step before broader NLP applications: (Input for (syntactic) parsing) (Machine translation) Rule-based MT: full-fledged analysis and generation Statistical MT: fighting data sparseness Finding word boundaries (Chinese, Japanese) Dictionaries 8.10.2010

http://ufal.mff.cuni.cz/~zeman/ 18 Applications of Morphology Text-to-speech systems (speech synthesis) Morphology affects pronunciation English th is normally pronounced or However, not in boathouse (boat + house) Czech proudit = proud + it (stream + INF = flow) pro + u + it (through + smoke + INF = smoke thoroughly) (Speech recognition) Morphology allows for smaller dictionaries 8.10.2010

http://ufal.mff.cuni.cz/~zeman/ 19 Applications of Morphology Word processing Spell checking dictionaries Inputting Japanese text Two kana syllabic scripts and kanji (Chinese characters) Typically, people type in kana and system converts to kanji whenever necessary Disambiguation needed! Bound morphemes remain in kana (morpho rules) 8.10.2010

http://ufal.mff.cuni.cz/~zeman/ 20 Applications of Morphology Word processing: find & replace terms Czech: kniha (book) dlo (work) knihy dla, knize dlu, knihu dlo, kniho dlo, knihou dlem, knih dl, knihm dlm, knihch dlech, knihami dly Document retrieval Keywords in query are typically base forms The forms in documents are inflected 8.10.2010

http://ufal.mff.cuni.cz/~zeman/ 21 Morphology-Based Typology Isolating languages Chinese: gu b i ch qngci = dog not like eat vegetable Inflectional languages Romance and Slavic languages: Spanish pued+es = poder + present indicative, 2nd person, singular Agglutinative languages Turkish: plklerimizdekilerdenmiydi = p + lk + ler + imiz + de + ki + ler + den + mi + y + di = was it from those that were in our garbage cans?

Polysynthetic languages Eskimo languages 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 22 Polysynthetic Languages Found in Siberia and the Americas Intricately compose words of many lexical morphemes that are not easily told apart Typically include both subject- and object-verb agreement.

Thats why linguists decided not to separate them orthographically Nevertheless, words usually are separated. They are just long One long word may cover a whole sentence in other languages Chukchi example (Skorik 1962: 102): T--mey--levt-pt--rkn.t--rkn. 1.SG.SUBJ-great-head-hurt-PRES.1 I have a fierce headache. 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 23

Morphological Devices (Overview) Affixes (prefixes and suffixes): concatenative morphology Compounding Infixation Circumfixation Root and pattern (templatic) morphology Reduplication

Subsegmental morphology Zero morphology Subtractive morphology 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 24 Affixation Most common way of inflection and derivation Three morpheme types: prefix + radix (stem) + suffix en: dog + s = dogs plural suffix s

de: mach + st = machst suffix st marks present indicative 2nd person singular en: un + beat + able prefix un- negates the meaning suffix able converts verb to adjective, expressing applicability of the action of the verb to something 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 25 Infixation Languages of the Philippines, e.g. Bontoc: fikas strong f-um+ikas be strong kilad red k-um+ilad be red

Could be analyzed as prefix to (stem minus the initial consonant) 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 26 Circumfixation Prefix + suffix act together as one morpheme German: legen lay down ge+leg+t laid down Indonesian: besar big k+besar+an bigness Similar, but not the same as Czech superlatives nej + mlad + + youngest

superlative + stem + comparative + singular nominative 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 27 Templatic Morphology Semitic languages (Arabic, Hebrew, Amharic) Arabic: root (usually 3 consonants): ktb write vowel pattern: aa = active, ui = passive template: CVCVC = first verb derivational class (binyan) result: katab write, kutib be written

8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 28 Reduplication Copy whole stem or part of it Indonesian plural: orang man orang+orang men Javanese habitual-repetitive: adus odas+adus take a bath bali bola+bali return Yidin (an Australian language): gindalba gindal+gindalba lizard

Reduplication cannot be modeled by finite-state automata! 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 29 Subsegmental Morphology Irish: cat (/kat/) = cat (singular) cait (/katj/) = cats (plural) The plural morpheme consists just of one phonological feature (high), resulting in palatalization. 8.10.2010

http://ufal.mff.cuni.cz/~zeman/ 30 Zero Morphology Zero (empty) morpheme, marked sometimes as 0, , or Czech feminine plural case endings for ena woman: 8.10.2010

nom: en+y = eny gen: en+ = en dat: en+m = enm acc: en+y = eny voc: en+y = eny loc: en+ch = ench ins: en+ami = enami http://ufal.mff.cuni.cz/~zeman/ 31 Subtractive Morphology Koasati (a Muskogean language, southeast US):

singular verb: pitaf+fi+n plural: pit+li+n singular verb: lasap+li+n plural: las+li+n Such examples are rare Moreover, one might argue that plural is the base form here 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 32 Compounding

English: maximally two stems written together Germanic languages in general favor compounds de: Hotentotenpotentatentantenatentter Hotentot + en + Potentat + en + Tante + n + Atentter Hottentot potentate aunt assassin assassin of aunt of potentate of Hottentots 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 33 Recommended Further Reading These books may be difficult to obtain from the MFF library. Reading them is not required. James Allen: Natural Language Understanding.

Benjamin/Cummings, USA, 1995 Richard Sproat: Morphology and Computation. MIT Press, USA, 1992 Kenneth R. Beesley, Lauri Karttunen: Finite State Morphology. CSLI Publications, 2003 Anna Feldman, Jirka Hana: A Resource-Light Approach to Morpho-syntactic Tagging. Rodopi, The Netherlands, 2009 Daniel Zeman: The World of Tokens, Tags and Trees. FAL, Czechia, 2018 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 34

Recently Viewed Presentations

  • Teaching with Depth An Understanding of Webb's Depth of Knowledge

    Teaching with Depth An Understanding of Webb's Depth of Knowledge

    (An example of a folk theory would be that because of the history of discrimination against African Americans, even those who work hard will never reap the rewards that whites do.) ... (Bloom's Taxonomy),but by the context in which the...
  • HiLumi-Pres-Template-4-3-LARP

    HiLumi-Pres-Template-4-3-LARP

    Addition of 2nd FPGA-based quench detector - as backup to primary FPGA-based QD and for signals which can trigger slow discharge, such as gas-cooled leads and LHe level, etc. (see previous drawing) to increase reliability of fast quench detection.
  • ADVISOR WORKSHOP - isu.edu

    ADVISOR WORKSHOP - isu.edu

    GENERAL FINANCE INFORMATION. A W9 and an ISU Non-Cash Prize or Award Documentation form are required for all gift cards and/or prizes such as iPads, mini fridges, giftbaskets, etc
  • Global Distribution of Karst Karst  Carbonate outcrops =

    Global Distribution of Karst Karst Carbonate outcrops =

    Reactions at pycnocline - similar to Yucatan. Some understanding from Bahamian blue holes(~Cenotes) Martin et al., 2012 J Hydro. Martin et al., 2013 ActaCarsologica. Some understanding from water intrusion into terrestrial conduits - Florida.
  • UK CONTINGENT Brand Champion training  November 2017 The

    UK CONTINGENT Brand Champion training November 2017 The

    UK CONTINGENT Brand Champion training - November 2017 The team Max Butler - Communications Champion lead Stuart Card - Head of UK Contingent Communications Tom Bell - Social Media Manager Fiona Cannon - Website editor Helen Syms - Content editor...
  • COMITE MONDIAL DE LAISG RAPPORT TRIENNAL 2014  2017

    COMITE MONDIAL DE LAISG RAPPORT TRIENNAL 2014 2017

    Développement de l'AISG Projet mondial de l'AISG en Haïti Soutien à l'orphelinat 'Enfant Haïtien Mon Frère' en construisant une clôture Construction du siege de l'Association des Guides Equipements, matériels scolaires pour l'école "Carmen René Durocher" Les donations reçues ont permis...
  • GCSE Science: Awarding and outcomes explored - AQA

    GCSE Science: Awarding and outcomes explored - AQA

    Secure key materials via e-AQA website including Enhanced Result Analysis (to pupil level); secure papers and resources. Telephone and email helplines. ... Certification against any criteria is available through our UAS - students meeting criteria, set by teachers, or from...
  • How Do Parent Caregivers Make the Decision About Residential ...

    How Do Parent Caregivers Make the Decision About Residential ...

    Introduction. The parent caregivers of a child with Intellectual and Developmental Disabilities (ID/DD) face lifelong challenges that at some point may involve the consideration for a safe living alternative and the decision about residential group home placement.