Computational Morphology and Syntax of Natural Languages Daniel Zeman http://ufal.mff.cuni.cz/course/npfl094 [email protected] NPFL094 Presentations and talks will be in English Unless all students understand Czech Questions welcome in both Czech and English And I have many examples from Czech 8.10.2010 http://ufal.mff.cuni.cz/~zeman/
2 Caution No class on October 28 November 4 November 25 22.10.2010 http://ufal.mff.cuni.cz/course/npfl094 3 Getting Credits 2-3 smaller tasks homework style less flexible deadlines
Alternatively: one larger project ask me if interested can be combined with your mgr. (or bc.) thesis 22.10.2010 http://ufal.mff.cuni.cz/course/npfl094 4 An Unbalanced Course 1/3 linguistics, 2/3 tools
1/3 lab work, 2/3 lectures morphology, syntax Mostly rule-based almost no machine learning no neural networks 22.10.2010 http://ufal.mff.cuni.cz/course/npfl094 5 Outline: Morphology Morphemic segmentation un + beat + able Phonology (morphonology) and orthography
baby + s = babies Inflectional vs. derivational morphology Morphological analysis: word form lemma + morphosyntactic features (tag) Tagging (context-aware disambiguation) Unsupervised affix detection in corpus Mining of word forms from corpus 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 6 Morphological Analysis Input: word form (token)
Output: set (possibly empty) of analyses an analysis: lemma (base form of the lexeme) tag (morphological, POS) part of speech features and their values 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 7 MA Example Language: Czech
Input: malmi Output (only one selected analysis here): lemma = mal (small) tag = AAFP71A 8.10.2010 part of speech = AA (adjective / pdavn jmno) gender = F (feminine / ensk) number = P (plural / mnon) case = 7 (instrumental / 7. pd) degree of comparison = 1 (positive / 1. stupe) http://ufal.mff.cuni.cz/~zeman/
8 MA Example Language: English Input: flies Output: lemma 1 = fly-1 (to move in the air) tag 1 = VBZ (verb, present tense 3rd person singular) lemma 2 = fly-2 (an insect) tag 2 = NNS (noun, plural)
Output is not disambiguated with respect to context 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 9 MA versus Tagging By tagging we usually mean context-based disambiguation Most taggers employ statistical methods Taggers may or may not work on top of MA MA may provide readings not known from training If a tagged corpus is available but MA is not, a tagger can still be trained on the corpus
8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 10 Morphemic Segmentation Morpheme is the smallest unit of language that conveys some meaning Morphemic segmentation = finding morpheme boundaries within words Typically part of MA: 8.10.2010
input: closed identify the morphemes: close + d interpret them: verb (close) + past tense output: close + VBD http://ufal.mff.cuni.cz/~zeman/ 11 Morphemic Segmentation Sometimes it is useful to know the morphemes even if we cannot interpret them Data sparseness, e.g. in machine translation: en: city cs alignments in parallel corpus: msto (nom/acc/voc sg, 42), msta (gen sg, nom/acc/voc pl, 40), mst (loc sg, 32), mst (gen pl, 9), mstsk (adj, 7), mstem (ins sg, 7), mstskch
(adj, 4), mstsk (adj, 4), mstsk (adj, 2), mstu (dat sg, 2), mstech (loc pl, 2) missing cs: mstm (dat pl), msty (ins pl), mstskho, mstskmu, mstskm, mstskm, mstt, mstskmi, mstskou (adj remaining forms) 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 12 Morphemic Segmentation Sometimes it is useful to know the morphemes even if we cannot interpret them Data sparseness, e.g. in machine translation Stemming = stripping all morphemes but the stem
IN: The British players were unbeatable. OUT: the Brit play were beat . Lemmatization = replacing all words with their lemmas (as with tagging, disambiguation may be assumed) OUT: the British player be (un)beatable . 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 13 Inflection vs. Derivation Derivational morphology: New lemma! Often (but not always) new part of speech.
Inflectional morphology: Set of forms of one lemma (lexeme) The set is called paradigm The borderline is sometimes quite fuzzy 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 14 Outline: Syntax
Dependency Tree Phrasal Tree (Penn Treebank) Applications of Morphology First step before broader NLP applications: (Input for (syntactic) parsing) (Machine translation) Rule-based MT: full-fledged analysis and generation Statistical MT: fighting data sparseness Finding word boundaries (Chinese, Japanese) Dictionaries 8.10.2010
http://ufal.mff.cuni.cz/~zeman/ 18 Applications of Morphology Text-to-speech systems (speech synthesis) Morphology affects pronunciation English th is normally pronounced or However, not in boathouse (boat + house) Czech proudit = proud + it (stream + INF = flow) pro + u + it (through + smoke + INF = smoke thoroughly) (Speech recognition) Morphology allows for smaller dictionaries 8.10.2010
http://ufal.mff.cuni.cz/~zeman/ 19 Applications of Morphology Word processing Spell checking dictionaries Inputting Japanese text Two kana syllabic scripts and kanji (Chinese characters) Typically, people type in kana and system converts to kanji whenever necessary Disambiguation needed! Bound morphemes remain in kana (morpho rules) 8.10.2010
http://ufal.mff.cuni.cz/~zeman/ 20 Applications of Morphology Word processing: find & replace terms Czech: kniha (book) dlo (work) knihy dla, knize dlu, knihu dlo, kniho dlo, knihou dlem, knih dl, knihm dlm, knihch dlech, knihami dly Document retrieval Keywords in query are typically base forms The forms in documents are inflected 8.10.2010
http://ufal.mff.cuni.cz/~zeman/ 21 Morphology-Based Typology Isolating languages Chinese: gu b i ch qngci = dog not like eat vegetable Inflectional languages Romance and Slavic languages: Spanish pued+es = poder + present indicative, 2nd person, singular Agglutinative languages Turkish: plklerimizdekilerdenmiydi = p + lk + ler + imiz + de + ki + ler + den + mi + y + di = was it from those that were in our garbage cans?
Polysynthetic languages Eskimo languages 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 22 Polysynthetic Languages Found in Siberia and the Americas Intricately compose words of many lexical morphemes that are not easily told apart Typically include both subject- and object-verb agreement.
Thats why linguists decided not to separate them orthographically Nevertheless, words usually are separated. They are just long One long word may cover a whole sentence in other languages Chukchi example (Skorik 1962: 102): T--mey--levt-pt--rkn.t--rkn. 1.SG.SUBJ-great-head-hurt-PRES.1 I have a fierce headache. 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 23
Morphological Devices (Overview) Affixes (prefixes and suffixes): concatenative morphology Compounding Infixation Circumfixation Root and pattern (templatic) morphology Reduplication
Subsegmental morphology Zero morphology Subtractive morphology 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 24 Affixation Most common way of inflection and derivation Three morpheme types: prefix + radix (stem) + suffix en: dog + s = dogs plural suffix s
de: mach + st = machst suffix st marks present indicative 2nd person singular en: un + beat + able prefix un- negates the meaning suffix able converts verb to adjective, expressing applicability of the action of the verb to something 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 25 Infixation Languages of the Philippines, e.g. Bontoc: fikas strong f-um+ikas be strong kilad red k-um+ilad be red
Could be analyzed as prefix to (stem minus the initial consonant) 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 26 Circumfixation Prefix + suffix act together as one morpheme German: legen lay down ge+leg+t laid down Indonesian: besar big k+besar+an bigness Similar, but not the same as Czech superlatives nej + mlad + + youngest
superlative + stem + comparative + singular nominative 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 27 Templatic Morphology Semitic languages (Arabic, Hebrew, Amharic) Arabic: root (usually 3 consonants): ktb write vowel pattern: aa = active, ui = passive template: CVCVC = first verb derivational class (binyan) result: katab write, kutib be written
8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 28 Reduplication Copy whole stem or part of it Indonesian plural: orang man orang+orang men Javanese habitual-repetitive: adus odas+adus take a bath bali bola+bali return Yidin (an Australian language): gindalba gindal+gindalba lizard
Reduplication cannot be modeled by finite-state automata! 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 29 Subsegmental Morphology Irish: cat (/kat/) = cat (singular) cait (/katj/) = cats (plural) The plural morpheme consists just of one phonological feature (high), resulting in palatalization. 8.10.2010
http://ufal.mff.cuni.cz/~zeman/ 30 Zero Morphology Zero (empty) morpheme, marked sometimes as 0, , or Czech feminine plural case endings for ena woman: 8.10.2010
singular verb: pitaf+fi+n plural: pit+li+n singular verb: lasap+li+n plural: las+li+n Such examples are rare Moreover, one might argue that plural is the base form here 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 32 Compounding
English: maximally two stems written together Germanic languages in general favor compounds de: Hotentotenpotentatentantenatentter Hotentot + en + Potentat + en + Tante + n + Atentter Hottentot potentate aunt assassin assassin of aunt of potentate of Hottentots 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 33 Recommended Further Reading These books may be difficult to obtain from the MFF library. Reading them is not required. James Allen: Natural Language Understanding.
Benjamin/Cummings, USA, 1995 Richard Sproat: Morphology and Computation. MIT Press, USA, 1992 Kenneth R. Beesley, Lauri Karttunen: Finite State Morphology. CSLI Publications, 2003 Anna Feldman, Jirka Hana: A Resource-Light Approach to Morpho-syntactic Tagging. Rodopi, The Netherlands, 2009 Daniel Zeman: The World of Tokens, Tags and Trees. FAL, Czechia, 2018 8.10.2010 http://ufal.mff.cuni.cz/~zeman/ 34
(An example of a folk theory would be that because of the history of discrimination against African Americans, even those who work hard will never reap the rewards that whites do.) ... (Bloom's Taxonomy),but by the context in which the...
Addition of 2nd FPGA-based quench detector - as backup to primary FPGA-based QD and for signals which can trigger slow discharge, such as gas-cooled leads and LHe level, etc. (see previous drawing) to increase reliability of fast quench detection.
Reactions at pycnocline - similar to Yucatan. Some understanding from Bahamian blue holes(~Cenotes) Martin et al., 2012 J Hydro. Martin et al., 2013 ActaCarsologica. Some understanding from water intrusion into terrestrial conduits - Florida.
UK CONTINGENT Brand Champion training - November 2017 The team Max Butler - Communications Champion lead Stuart Card - Head of UK Contingent Communications Tom Bell - Social Media Manager Fiona Cannon - Website editor Helen Syms - Content editor...
Développement de l'AISG Projet mondial de l'AISG en Haïti Soutien à l'orphelinat 'Enfant Haïtien Mon Frère' en construisant une clôture Construction du siege de l'Association des Guides Equipements, matériels scolaires pour l'école "Carmen René Durocher" Les donations reçues ont permis...
Secure key materials via e-AQA website including Enhanced Result Analysis (to pupil level); secure papers and resources. Telephone and email helplines. ... Certification against any criteria is available through our UAS - students meeting criteria, set by teachers, or from...
Introduction. The parent caregivers of a child with Intellectual and Developmental Disabilities (ID/DD) face lifelong challenges that at some point may involve the consideration for a safe living alternative and the decision about residential group home placement.
Ready to download the document? Go ahead and hit continue!