Parts of Speech Part 2 - KFUPM

Parts of Speech Part 2 - KFUPM

Parts of Speech Part 2 ICS 482 Natural Language Processing Lecture 10: Husni Al-Muhtaseb 1 ICS 482 Natural Language Processing Lecture 10: Parts of Speech Part 2 Husni Al-Muhtaseb 2 NLP Credits and Acknowledgment These slides were adapted from presentations of the Authors of the book

SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Ling uistics, and Speech Recognition and some modifications from presentations found in the WEB by several scholars including the following NLP Credits and Acknowledgment If your name is missing please contact me muhtaseb At Kfupm. Edu. sa NLP Credits and Acknowledgment Husni Al-Muhtaseb James Martin Jim Martin Dan Jurafsky Sandiway Fong

Song young in Paula Matuszek Mary-Angela Papalaskari Dick Crouch Tracy Kin L. Venkata Subramaniam Martin Volk Bruce R. Maxim Jan Haji Srinath Srinivasa Simeon Ntafos Paolo Pirjanian Ricardo Vilalta Tom Lenaerts Heshaam Feili Bjrn Gambck Christian Korthals Thomas G. Dietterich Devika Subramanian Duminda Wijesekera Lee McCluskey David J. Kriegman

Kathleen McKeown Michael J. Ciaraldi David Finkel Min-Yen Kan Andreas Geyer-Schulz Franz J. Kurfess Tim Finin Nadjet Bouayad Kathy McCoy Hans Uszkoreit Azadeh Maghsoodi Khurshid Ahmad Martha Palmer julia hirschberg Staffan Larsson Elaine Rich Robert Wilensky Christof Monz Feiyu Xu Bonnie J. Dorr Nizar Habash Jakub Piskorski

Massimo Poesio Rohini Srihari David Goss-Grubbs Mark Sanderson Thomas K Harris John Hutchins Andrew Elks Alexandros Marc Davis Potamianos Ray Larson Mike Rosner Latifa Al-Sulaiti Jimmy Lin Giorgio Satta Marti Hearst Jerry R. Hobbs Andrew McCallum Christopher Manning Hinrich Schtze Nick Kushmerick Mark Craven Alexander Gelbukh Gina-Anne Levow Chia-Hui Chang

Guitao Gao Diana Maynard Qing Ma James Allan Zeynep Altan Previous Lectures Pre-start questionnaire

Introduction and Phases of an NLP system NLP Applications - Chatting with Alice Finite State Automata & Regular Expressions & languages Deterministic & Non-deterministic FSAs Morphology: Inflectional & Derivational Parsing and Finite State Transducers Stemming & Porter Stemmer 20 Minute Quiz Statistical NLP Language Modeling N-Grams Smoothing and N-Gram: Add-one & Witten-Bell Return Quiz 1 Parts of Speech 6 Today's Lecture Continue with Parts of Speech Arabic Parts of Speech

7 Parts of Speech Start with eight basic categories Noun Verb preposition Pronoun adjective

adverb Article Conjunction These categories are based on morphological and distributional properties (not semantics) Some cases are easy, others are not 8 Parts of Speech Closed classes Prepositions: on, under, over, near, by, at, from, to, with, etc.

Determiners: a, an, the, etc. Pronouns: she, who, I, others, etc. Conjunctions: and, but, or, as, if, when, etc. Auxiliary verbs: can, may, should, are, etc. Particles: up, down, on, off, in, out, at, by, etc. Open classes: Nouns: Verbs: Adjectives: Adverbs: 9 Sets of Parts of Speech: Tagsets

There are various standard tagsets to choose from; some have a lot more tags than others The choice of tagset is based on the application Accurate tagging can be done with even large tagsets 10 Some of the known Tagsets (English) Brown corpus: 87 tags Penn Treebank: 45 tags Lancaster UCREL C5: 61 tags Lancaster C7: 145 tags 11 Some of Penn Treebank tags

12 Verb inflection tags 13 The entire Penn Treebank tagset 14 UCREL C5 15 Tagging Part of speech tagging is the process of assigning parts of speech to each word in a sentence Assume we have

A tagset A dictionary that gives you the possible set of tags for each entry A text to be tagged A reason? 16 POS Tagging: Definition The process of assigning a part-of-speech or lexical class marker to each word in a corpus: WORDS the driver put the keys on

the table TAGS N V P DET 17 Tag Ambiguity (updated) 87-tagset 45-tagset Unambiguous (1 tag) 44,019 38,857 Ambiguous (2-7 tags)

5,490 8,844 2 tags 4,967 6,731 3 tags 411 1621 4 tags 91 357 5 tags

17 90 6 tags 2 (well, beat) 32 7 tags 2 (still, down) 6 (well, set, round, open, fit, down) 8 tags 9 tags Most

4 (s, half, back, a) words are unambiguous 3 (that, more, in) 18 Many of the most common English words are ambiguous Tagging: Three Methods Rules Probabilities (Stochastic) Transformation-Based: Sort of both 19 Rule-based Tagging

Use dictionary (lexicon) to assign each word a list of potential POS Use large lists of hand-written disambiguation rules to identify a single POS for each word. Example of rules: NP Det (Adj*) N For example: the clever student 20 Probabilities: Tagging with lexical frequencies

Sami is expected to race tomorrow. Sami/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN People continue to inquire the reason for the race for outer space. People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN Problem: assign a tag to race given its lexical frequency Solution: we choose the tag that has the greater P(race|VB) P(race|NN) Actual estimate from the Switchboard corpus: P(race|NN) = .00041 P(race|VB) = .00003 21 Transformation-based: The Brill Tagger An example of Transformation-based Learning Very popular (freely available, works fairly well) A SUPERVISED method: requires a tagged

corpus Basic idea: do a quick job first (using frequency), then revise it using contextual rules 22 An example Examples: It is expected to race tomorrow. The race for outer space. Tagging algorithm: 1. Tag all uses of race as NN (most likely tag in the Brown corpus)

2. It is expected to race/NN tomorrow the race/NN for outer space Use a transformation rule to replace the tag NN with VB for all uses of race preceded by the tag TO: It is expected to race/VB tomorrow the race/NN for outer space 23 Stochastic (Probabilities) Simple approach

N-gram approach The best tag for given words is determined by the probability that it occurs with the n previous tags Viterbi Algorithm Disambiguate words based on the probability that a word occurs with a particular tag Trim the search for the most probable tag using the best N Maximum Likelihood Estimates (N is the number of tags of the following word) Hidden Markov Model combines the above two

approaches 24 Viterbi Maximum Likelihood Estimates Want the most likely path through this graph. noun noun noun DT the aux aux can will

verb rust 25 Viterbi Maximum Likelihood Estimates S1 S2 S3 S4 S5 JJ DT VB

VB NNP NN RB NN VBN TO VBD promised to back the bill 26

Viterbi Maximum Likelihood Estimates We want the best set of tags for a sequence of words (a sentence) W is a sequence of words W= w1w2w3..wn T is a sequence of tags T= t1t2t3..tn P (W | T )P (T ) arg max P (T |W ) P (W )

P(w) is common 27 Viterbi Maximum Likelihood Estimates We want the best set of tags for a sequence of words (a sentence) W is a sequence of words W= w1w2w3..wn T is a sequence of tags T= t1t2t3..tn arg max P (T |W ) P (W | T )P (T )

P(w) is common 28 Stochastic POS Tagging: Example 1) 2) Sami is expected to race tomorrow. People continue to inquire the reason for the race for outer space. 29 Stochastic POS Tagging: Example Example: suppose wi = race, a verb (VB) or a noun (NN)? Assume that other mechanism has already done the best tagging to the surrounding words, leaving only race untagged 1) Sami/NNP is/VBZ expected/VBN to/TO race/? tomorrow/NN 2) People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/? For/IN outer/JJ space/NN Bigram Simplify the problem:

to/To race/??? the/DT race/??? ti arg max j P(t j | ti 1 ) P( wi | t j ) P(VB|TO) P(race | VB) P(NN|TO) P(race | NN) 30 Where is the data? Look at the Brown and Switchboard corpora P(NN | TO) = 0.021 P(VB | TO) = 0.34 If we are expecting a verb, how likely it would be race P( race | NN) = 0.00041 P( race | VB) = 0.00003 Finally: P(NN | TO) P( race | NN) = 0.000007 P( VB | TO) P(race | VB) = 0.00001 31 Example: Bigram of Tags from a Corpus

Cat # at i Pair # at i, i+1 Bigram Estimate 0 300 0, ART 213 Prob(ART|0) 0.71 0

300 0, N 87 Prob(N|0) 0.29 ART 558 ART, N 558 Prob(N | ART) 1

N 833 N, V 358 Prob(V | N) 0.43 N 833 N, N 108 Prob(N | N) 0.13

N 833 N, P 366 Prob(P | N) 0.44 V 300 V, N 75 Porb( N | V)

0.35 V 300 V, ART 194 Prob(ART | V) 0.65 P 307 P, ART 226 Prob (ART | P)

0.74 P 307 P, N 81 Prob (N | P) 0.26 32 A Markov Chain 0.74 0.71 0

ART 1 V 0.65 0.43 0.35 assume 0.0001 for any unseen bigram 0.29 0.26 N P 0.44

0.13 33 Word Counts N V ART P Total flies 21 23 0 0

44 fruit 49 5 1 0 55 like 10 30 0

21 61 a 1 0 201 0 202 the 1 0 300

2 303 flower 53 15 0 0 68 flowers 42 16

0 0 58 birds 64 1 0 0 65 others 592 210

56 284 1142 Total 833 300 558 307 1998 34 Computing Probabilities using previous Tables P(the | ART ) = 300/558 = 0.54

P(flies | N ) = 0.025 P(flies | V) = 0.076 P(like | V ) = 0.1 P(like | P) = 0.068 P(like | N) = 0.012 P(a | ART ) = 0.360 P(a | N ) = 0.001 P(flower | N ) = 0.063 P(flower | V ) = 0.05 P(birds | N) = 0.076 35 Viterbi Algorithm - Example assume 0.0001 for any unseen bigram Iteration 1 flies/V 7.6*10-6 flies/N

0.00725 flies/P 0 Flies like a flower NULL/0 flies/ART 0 36 Viterbi Algorithm - Example Iteration 2 flies/V like/V

flies/N like/N Flies like a flower like/P like/ART 37 Viterbi Algorithm - Example Iteration 2 flies/V like/V flies/N like/N 1.3*10-5

like/P 0.00022 Flies like a flower like/ART 0.00031 0 38 Viterbi Algorithm - Example Iteration 3 flies/N like/V a/V

like/N a/N like/P a/P Flies like a flower a/ART 39 Viterbi Algorithm - Example Iteration 3 flies/N like/V a/V like/N

a/N 1.2*10-7 like/P a/P 0 Flies like a flower a/ART 0 7.2*10-5 40 Viterbi Algorithm - Example Iteration 4

flower/V like/V a/N flies/N flower/P like/P Flies like a flower flower/N a/ART flower/ART 41 Viterbi Algorithm - Example Iteration 4

flower/V 2.6*10-9 like/V a/N flies/N like/P Flies like a flower a/ART flower/N 4.3*10-6 flower/P 0

flower/ART 0 42 Performance This method has achieved 95-96% correct with reasonably complex English tagsets and reasonable amounts of hand-tagged training data. Forward pointer its also possible to train a system without hand-labeled training data 43 How accurate are they? POS Taggers boast accuracy rates of 95-99%

Vary according to text/type/genre Of pre-tagged corpus Of text to be tagged Worst case scenario: assume success rate of 95% Prob(one-word sentence) = .95 Prob(two-word sentence) = .95 * .95 = 90.25% Prob(ten-word sentence) = 59% approx 44 End of Part 1

45 Natural Language Processing Lecture 10: Parts of Speech 2-2 Morphosyntactic Tagset Of Arabic - - Husni Al-Muhtaseb 46 - - Shereen Khoja 177 tags 177

103 Nouns 103 57 Verbs 57 9 Particles 9 7 Residual 7 1 Punctuation 1 47 - :three genders

Masculine Feminine Neuter 48 Three persons The speaker The person being addressed The person that is not present Three numbers Singular Dual Plural 49

three moods of the verb Indicative Subjunctive Jussive - three case forms of the noun Nominative Accusative Genitive

50 51 Word Noun Verb Particle Residual Punctuation

52 Word Noun Common Verb Proper Particle Pronoun Residual Punctuation Numeral 53

Adjective Word Noun Common Personal Verb Proper Relative Particle Pronoun Residual Punctuation

Numeral Demonstrative 54 Adjective Word Noun Common Personal Specific Verb Proper Relative

Common Particle Pronoun Residual Punctuation Numeral Demonstrative 55 Adjective Word Noun Common

Verb Proper Particle Pronoun Cardinal Residual Punctuation Numeral Ordinal Adjective Numerical Adjective 56

Word Noun Perfect Verb Particle Imperfect Residual Punctuation Imperative 57 Word Noun

Subordinates Verb Answers Particle Residual Explanations Punctuation Prepositions 58 Adverbial Word Noun

Verb Conjunctions Particle Interjections Residual Punctuation Exceptions 59 Negatives Word Noun Verb

Foreign Particle Residual Punctuation Mathematical Formulae Numerals 60 Word Noun Verb Question Mark Particle

Residual Punctuation Exclamation Mark Comma 61 Arabic POS Tagger Plain Arabic Text ManTag Training Corpus Probability Matrix Tagged

Corpus APT DataExtract Lexicons Untagged Arabic Corpus 62 DataExtract Process Takes in a tagged corpus and extracts various lexicons and the probability matrix Lexicon that includes all clitics.

(Sprout, 1992) defines a clitic as a syntactically separate word that functions phonologically as an affix Lexicon that removes all clitics before adding the word 63 DataExtract Process Produces a probability matrix for various levels of the tagset Lexical probability: probability of a word having a certain tag Contextual probability: probability of a tag following another tag 64

DataExtract Process N V P No. Pu. N 0.711 0.065 0.143 0.010 0.071

V 0.926 0.037 0.0 0.008 0.029 P 0.689 0.199 0.085 0.016

0.011 No. 0.509 0.06 0.098 0.009 0.324 Pu. 0.492 0.159 0.152 0.046

0.151 65 Arabic Corpora 59,040 words of the Saudi al-Jazirah newspaper, dated 03/03/1999 3,104 words of the Egyptian al-Ahram newspaper, date 25/01/2000 5,811 words of the Qatari al-Bayan newspaper, date 25/01/2000 17,204 words of al-Mishkat, an Egyptian published paper in social science, April 1999 66 APT: Arabic Part-of-speech Tagger LexiconLookup

Arabic Words Stemmer Words with multiple tags StatisticalComponent Words with a unique tag 67 68 5

N [noun] .1 V [verb] .2 P [particle] .3 : - R [residual] .4 () PU [punctuation]: all .5 69 C [common] .1.1 P [proper] .1.2 Pr [pronoun] .1.3 Nu [numeral] .1.4 A [adjective] .1.5 70

Singular, masculine, accusative, common noun Singular, masculine, genitive, common noun Singular, feminine, nominative, common noun 71

P [personal] .1.3.1 detached words such as attached to a word to nouns to indicate possession to verbs as direct object prepositions

R [relative] .1.3.2 D [demonstrative] .1.3.3 72 Third person, singular, masculine, personal pronoun Singular, feminine, demonstrative pronoun 73 Relative Pronoun S [specific] .1.3.2.1 C [common] .1.3.2.2

Dual, feminine, specific, relative pronoun Plural, masculine, specific, relative pronoun - Common, relative pronoun 74 Ca [cardinal] .1.4.1 O [ordinal].1.4.2 :Na [numerical adjective] .1.4.3

Singular, masculine, nominative, indefinite cardinal number Singular, masculine, nominative, indefinite ordinal number Singular, masculine, numerical adjective 75

Gender M [masculine] F [feminine] N [neuter] Person ]first[ 1 ]second[ 2 ]third[ 3 Case

Number Sg [singular] Du [dual] Pl [plural] N [nominative] A [accusative] G [genitive] Definiteness D [definite] I [indefinite] 76

Verbs P [perfect] .1 - ( )) ( ) I [imperfect] .2 Iv [imperative] .3 ) First person, singular, neuter, perfect verb

First person, singular, neuter, indicative, imperfect verb Second person, singular, masculine, imperative verb 77 Verbal Attributes Used Gender M [masculine] F [feminine] N [neuter] Number Sg [singular] Pl [plural] Du [dual] Person ]first[ 1

]second[ 2 ]third[ 3 Mood I [indicative] S [subjunctive] J [jussive] - 78 Pr [prepositions] .1.1 A [adverbial] .1.2 C [conjunctions] .1.3 I [interjections] .1.4 E [exceptions] .1.5 N [negatives] 1.6 A [answers] .1.7 X [explanations] .1.8 S [subordinates] .1.9 79

Prepositions in : Adverbial particles shall : Conjunctions and : Interjections you : Exceptions Except : Negatives Not : Answers yes : Explanations that is : Subordinates if : 80

81 82 83 84

85 86 87 - 88 - 89 - 90

- 91 - 92 Parts of Speech 93 1. Noun Nouns ( N ) I. Type II. Definiteness III. Gender IV. Number V. Case VI. Followship

VII. Variability VIII. Soundness 94 I. Type Type Common C- Proper P- Adjective J- Personal Pronoun- S- Numeral N-

Relative Pronoun R- Demonstrative Pronoun D- 95 II. Definiteness Definiteness Definite D- Indefinite - I-

96 III. Gender Gender Masculine M- / / Feminine F - / / Unmarked U 97 IV. Number Number Singular 1 - 1 Dual 2 -2 Plural -

3 - 3 Sound S - Broken B - Mass M - Unmarked 4 - 3 Singular & Dual & Plural (man) A - Dual & Plural (na, nahno) 98 T - V. Case Case Nominative

N Agent A Subject of cana C Predicative of inn I - Subject S Duty Agent D Predicative of subject P

Subject of cada K 99 Case Accusative A Patient P Predicative of cada K Predicative of cana C

State (manner) S Infinitave F Subject of inn I Distinguative D - Cause U 100

Case Genitive G )Adjunct (post noun A Post preposition P Case Vocative V

101 VI. Followship Followship Assertion A Coordinated C Attributive T Substitute S

102 VII. Variability Variability Semi-Variable / S Variable V Letters L 103

)Invariable (static / / I Vowels W VII. Soundness Soundness Defective D Ending with ya Y

Ending with alif + hamza H Sound^ S Ending with alif A 104 .. Type .. Adjective Adjective J - Degree Positive

P Comparative C - Superlative S - 105 Type . Numeral Numeral N - Function Cardinal R

Ordinal O Numerical adjective A 106 ..Type Personal Pronoun Personal Pronoun S - Person First 1 1 Second

2 2 Attachment Third 3 3 Attached / T 107 Detached / D Type Relative Pronoun Relative Pronoun R -

Type Common M 108 Specific F - Example < Noun , Common, Definite , Feminine, Singular , Nominative (Agent) , , Variable- Vowels, Sound> < N C- I F- 3B AP- - V W- S> 109

< Noun , Personal Pronoun ,Definite , Feminine , Singular , Genitive post noun (Adjunct ), , Invariable (static) , , Third , attached < N S D F 1 GA I 3 T > 110 2. Adverbs Adverbs D Aspect Time

T Place P Case Nominative Accusative 111 Genitive > < Adverb , Place ,Genitive >

3. Verbs Verbs V I. Tens (Aspect) II. Gender III. number IV. Person

V. Case VI. Conditional VII. Voice VIII. Variability IX. Perfectness X. Augmentation XI. Amount XII. Soundness XIII. Transitivity 113 I. Tense 1. 2.

3. Past ( P - ( Present (Durative \ Future ).(R - ) Imperative (I ) II. Gender 1. Masculine ( M 2. Feminine (F - ) 3. Unmarked (U - ( ) 114 III. Number

Singular (1-1) Dual (2 -2) Plural (3-3) Unmarked (4-4) Singular & Dual & Plural : verb of (man) (A ) Dual & Plural : verb of (ma , nahno) (T - ) 115 IV. Person First (1-1). Second (2-2). Third (3-3).

1. 2. 3. V. Case ) ) (N - - ( 1. Indicative ) ) (A - ( 2. Subjunctive ( 1. Infinitive ) ) (F - 2. Non Infinitive (N ) 3. Jussive ) )(G - - ( 116 VI. Conditional 1. 2. The condition (C The answer (A -

V. Voice ) ) 1. Active - ( A 2. Passive ( P ) 117 ) VIII. Variability 1. 2. Invariable (Static) ( I - -) Variable (V - )

Vowels (W - ) . Letters (L - ) IX. Perfectness 1. Perfect - ( P -) 2. Imperfect( Can and cada ) ( I ) 118 X. Augmentation Augmentation ( A Non Augmentation (N ) - ) XI. Amount Amount

Trilateral (T ) Quadric-Literal ( Q- ). Penta Literal / ( P-). 119 XII. Soundness ): ) Defective (D Initial (I / )Hollow (Meddle) (H-

/ ) Last (L - / ) Initial + last ( T - / ) Hollow + Last (O - / ) Sound ( S - 120 XIII. Transitivity

Transitive One Patient (T - ) / Intransitive - (O - ) /

Two Patient (T - ) (I - ) Agent only ( A - ) Agent + State or Distinquitor ( S - ) ) Nominal Sentence (N - ) 121 /

< Verb , Past , Feminine , Singular , Third , Subjunctive non infinitive , , Active , Invariable (static ) , Perfect , Augmented , Trilateral ,Sound, intransitive Agent only > 122 4. Particles P Coordinating )1 (1 / / / / / / )Subordinating (2 - 2 Contrast ) ) ) (c - / / ( / / /( Exception / ) ) (E -/

( Initial ) . ) (I - / ) (3 /( Interrogative3). ( Preposition ) (4- 4) . 123 /

) (11). 124 ( Possibility ) (5). ( Protection ) (6) . Future / ( ) (7). Conditional )(8). / / / /( Answer / / /( )(9). )(10 ). ( Exclamation /( 11.Interjection/Introgative

. Negative /( / / )(12). ( )Imperative (Order )(13). Cause )(14). / / / / ( Gerund )(15). / /( Deporticle )(16). ( . ()17 Explanation )(18).( 125

/ / / / 126 Assertion ( / )(19). Wishing /(

) ) .(20 ( Swearness )(21). / > < Particles , ta of Femininity

. } } { >

5. Unique (U - ) Unique U Past P - Denominal D The letters at the beginning Of some of the soar Of Al-Quran ) ( Present R

Imperative Order I 128 { ( )1 (} )2 > < Unique , The litters at the beginning Of some of the soar Of Al-Quran >

6. Residual Residual R Type Gender M R F Assertion

A Dual 2-2 Accusative A Coordinated C Genitive Plural 3-3

Attributive T Unmarked Acronym A Nominative N Feminine Formula Abbreviation B

U Followship Singular 1-1 Masculine Foreign F Case Number G Vocative

V 130 Substitute S / )Singular , Genitive Adjunct (post noun >

? ! . ; - Question Mark (Q- ). Exclamation Mark (X Ellipsis ( E ) . Full Stop (F-) . Comma (C-) . Dotted Comma (D-). Hyphen (H-). ) .

132 .. - Interspersion Marks (I-) . , The English Comma (G-) . , , Interspersion Marks (R-) . ( ) Brackets (B-) . " Quotation Marks (U-) . : Colon (O-) . [ ] Square Brackets (S-) . {} / slash (L-) . 133

/ . >

Recently Viewed Presentations