Linguistic and Knowledge Resources Vincenzo Maltese University of

Linguistic and Knowledge Resources Vincenzo Maltese University of

Linguistic and Knowledge Resources Vincenzo Maltese University of Trento LDKR course 2014 Roadmap Introduction Linguistic resources Knowledge resources Capturing diversity with the UKC and Entitypedia The DERA methodology Vincenzo Maltese 01/28/2020 2 Introduction Roadmap Problem: The semantic heterogeneity problem Solution: Current approaches to interoperability

Ontologies Linguistic and knowledge resources: what and why Exercises Vincenzo Maltese 01/28/2020 4 The semantic heterogeneity problem The difficulty of establishing a certain level of connectivity between people, software agents or IT systems [Uschold & Gruninger, 2004] at the purpose of enabling each of the parties to

appropriately understand the exchanged information [Pollock, 2002] Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 5 Early solutions Physical connectivity relies on the presence of a stable communication channel between the parties, for instance ODBC data gateways and software adapters. Syntactic connectivity is established by instituting a common vocabulary of terms to be used by the parties or by point-to-point bridges that translate messages written in one vocabulary in messages in the other vocabulary.

This rigidity and lack of explicit meaning causes very high maintenance costs (up to 95% of the overall ownership costs) as well as integration failure (up to 88% of the projects) [Pollock, 2002] Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 6 The semantic interoperability solution The solution in three points: Semantic mediation: the usage of an ontology, providing a shared vocabulary of terms with explicit meaning. Semantic mapping: using the ontology, the establishment of a mapping constituted by a set of

correspondences between semantically similar data elements independently maintained by the parties. Context sensitivity: the mapping has contextual validity, i.e. it has to be used by taking into account the conditions and the purposes for which it was generated. Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 7 Ontologies An explicit specification of

Animal a shared Part-of Is-a Is-a conceptualization Part-of [Gruber, 1993] Bir Mamm Hea Bod Directed graphs d al d y Nodes represent concepts Is-a Is-a Is-a Edges represent relations Chicke between concepts Predator Herbivore n They provide a common

(formal) terminology and Is-a Is-a Eats Is-a understanding of a given Eats Eats domain of interest Cat Tiger Goat They allow for automation (logical inference), Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 8 01/28/2020 support reuse and favor Concepts and relations (I) CONCEPT: it represents a set of objects or individuals EXTENSION: the set of individuals

is called the concept extension or the concept interpretation RELATION: a link from the source concept to the target concept ANIMAL is-a Concepts are often lexically defined, i.e. they have natural language labels which are used to describe the concept extensions, often with an additional description or gloss Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES DOG 01/28/2020

9 Concepts and relations (II) The backbone structure of an ontology graph is a taxonomy in which the ontological relations are genus-species (is-a and instance-of) and whole-part (part-of). Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 10 Concepts and relations (III) The remaining structure of the graph supplies auxiliary information about the modeled domain and may include relations of any kind. Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 11

Conceptualization An abstract model of how people theorize (part of) the world in terms of basic cognitive units called concepts. Concepts represent the intention, i.e. the set of properties that distinguish the concept from others, and summarize the extension, i.e. the set of objects having such properties. Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 12 Explicit specification the abstract model is made explicit by providing names and definitions for the concepts, i.e. the name and the definition of the concept provide a specification of its meaning in relation with other concepts. DOG a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds

Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 13 Formal specification The abstract model is formal when it is written in a language with formal syntax and formal semantics, i.e. in a logic-based language. Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 14 Shared conceptualization It captures knowledge which is common to a community of people and therefore represents concretely the level of agreement reached in that community. Vincenzo Maltese

PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 15 Kinds of ontologies Ontologies differ according to the purpose, the syntax and the semantics There is also a tension between expressivity and effectiveness [Uschold and Gruninger, 2004] Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 16 Informal ontologies User classifications Folders in a file system Web directories Business catalogs

Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 17 Semi-formal ontologies (I) Knowledge Organization Systems: Library classifications, Thesauri Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 18 Semi-formal ontologies (II) In Knowledge Organization Systems (KOS) there are two main kinds of relations: hierarchical (BT/NT) and associative (RT) relations. Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES

01/28/2020 19 Formal ontologies Formal ontologies are expressed into a formal logic language (in syntax and semantics) and represented via formal specifications (e.g. OWL) Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 20 Descriptive ontologies [Giunchiglia et al., 2009]

Used to describe objects in a domain Real world semantics: the extension of a concept is the set of real world entities about the label of the concept We need to distinguish between classes (Animals) and individuals (Italy) Is-a relations are translated into DL subsumption () Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 21 Classification ontologies [Giunchiglia et al., 2009] Used to categorize objects Classification semantics: the extension of a concept is the set of documents about the entities or individual objects

described by the label of the concept. The semantics of the links is subset. No distinction between classes (Animals) and individuals (Italy) Subset relations are translated into DL subsumption () Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 22 Converting ontologies FROM DESCRIPTIVE TO CLASSIFICATION ONTOLOGY convert instances into classes convert instance-of, is-a and transitive part-of into NT/BT relations convert other relations into RT relations The translation process can be easily automated However, with the translation we have a clear loss of information.

Vincenzo Maltese FROM CLASSIFICATION TO DESCRIPTIVE ONTOLOGY each class is mapped to either a real world class or instance each NT/BT relation (assuming them to be transitive) has to be converted to either an instance-of, is-a or transitive part-of each RT relation has to be The translation process cannot codified into an appropriate bereal automated. world associative It relation needs significant manual work to reconstruct implicit information.

PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 23 What a linguistic and knowledge resource is? Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 24 Why do we need linguistic and knowledge resources? SEMANTIC MATCHING NLP The banks of the river Nile SEARCH : bank: sloping land

(especially the slope beside a body of water) automobile river: a large natural stream of water (larger than a creek) Nile: a major north-flowing river in northeastern Africa SEMANTIC SEARCH DATA INTEGRATION 1957 Ferrari 625 TRC Spider

This two-of-a-kind classic Ferrari is lauded by historians as one of the prettiest Ferraris ever built. The 1957 Ferrari 625 TRC Spider is an absolutely stunning automobile, one as dashing in the garage as it is at 120 mph. Back in the Saddle: Presenting our Porsche 911 (997) Carrera S Cabriolet Theres a reason the Porsche 911 is one of the most popular sports cars ever, and after a few minutes behind the wheel of one youll understand why. Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 25 Exercises 1. Is a ER diagram a formal ontology? Explain why yes or no.

2. Is a database schema a formal ontology? Explain why yes or no. 3. Create an ontology to describe your family in terms or general classes, relations between them and actual individuals 4. Identify in the web two thesauri in the agricultural domain 5. Identify in the web an OWL ontology 6. Identify a sub-tree in your file system and convert it into a descriptive ontology where each node label is given a definition Vincenzo Maltese PROBLEM :: SOLUTION :: ONTOLOGIES :: USE-CASES :: EXERCISES 01/28/2020 26

Linguistic resources Roadmap WordNet MultiWordNet Weaknesses of existing linguistic resources Exercises Vincenzo Maltese 01/28/2020 28 WordNet (1985) stream watercourse A natural body of running water flowing on or under the earth hyponym-of word sense

synset relation A large natural stream of water (larger than a creek) river Vincenzo Maltese WORDNET :: MULTIWORDNET :: WEAKNESSES :: EXERCISES 01/28/2020 29 Words Words are the basic constituents of a language WordNet focuses on lemmas, i.e. the canonical form of a set of words in a language. In English, for example, run, runs, ran and running are forms of the same lexeme, with the verb run as the lemma.

WordNet also accounts for exceptional forms. For nouns, they are usually the irregular plural forms, for adjectives and adverbs irregular superlatives, for verbs irregular conjugations. For instance, the noun wives is an exceptional form of the noun wife. Vincenzo Maltese WORDNET :: MULTIWORDNET :: WEAKNESSES :: EXERCISES 01/28/2020 30 Senses and synsets A (word) sense is a word in a language (e.g. English) having a distinct meaning. Senses for each word are ranked. Words having same sense are grouped together into a synset. Each synset is associated a part of speech (POS) in the set {noun, adjective, verb, adverb} and a gloss. For instance, in English the word good: (noun) good : an article for commerce (adjective) good : having positive qualities.

Vincenzo Maltese WORDNET :: MULTIWORDNET :: WEAKNESSES :: EXERCISES 01/28/2020 31 Lexical relations Lexical relations are between word senses. Synonymy is a symmetric relation connecting two senses of two different words with same POS and same meaning. WordNet implements synonymy through the notion of synset. stream and watercourse are synonym Antonym is a symmetric relation connecting two senses of two different words with same POS and opposite meaning. black is antonym of white. Vincenzo Maltese

WORDNET :: MULTIWORDNET :: WEAKNESSES :: EXERCISES 01/28/2020 32 Semantic relations Semantic relations are between synsets. Y is a hypernym of X (and X is hyponym of Y) if every X is a (kind of) Y canine is a hypernym of dog Y is a meronym of X (and X is holonym of Y) if Y is a part of X window is a meronym of building Vincenzo Maltese WORDNET :: MULTIWORDNET :: WEAKNESSES :: EXERCISES 01/28/2020 33

MultiWordNet (2002) stream watercourse A natural body of running water flowing on or under the earth Mapping via synset IDs - corso dacqua Strengths Mapping with languages Lexical GAPs can defined Vincenzo Maltese 6 be Weaknesses Only a partial coverage A few glosses available Biased towards English

WORDNET :: MULTIWORDNET :: WEAKNESSES :: EXERCISES 01/28/2020 34 Lexical GAPs and phrasets The fact that a language (e.g. English) expresses in a lexical unit what the other language (e.g. Italian) expresses with a free combination of words (e.g. borrower = chi prende in prestito) Vincenzo Maltese WORDNET :: MULTIWORDNET :: WEAKNESSES :: EXERCISES 01/28/2020 35 Problems with WordNet-like resources (I) Nodes in similar position do not share same ontological properties Glosses exhibit space and time bias Some concepts are too similar in meaning

Some concepts are actually individuals Vincenzo Maltese WORDNET :: MULTIWORDNET :: WEAKNESSES :: EXERCISES 01/28/2020 36 Problems with WordNet-like resources (II) Polysemy too fine grained distinctions in meaning Vincenzo Maltese WORDNET :: MULTIWORDNET :: WEAKNESSES :: EXERCISES 01/28/2020 37 Exercises 1. Identify in WordNet two synsets denoting individuals 2. Identify in WordNet two equivalent synsets, i.e. two synsets

having same meaning 3. Identity in WordNet a word with a polysemy > 10 4. Identity in WordNet the direct hypernym of museum 5. Identity in WordNet a word with an antonym 6. Identity in WordNet three cases of space bias and three cases of time bias 7. Identify in MultiWordNet three words having a GAP in another language Vincenzo Maltese WORDNET :: MULTIWORDNET :: WEAKNESSES :: EXERCISES 01/28/2020

38 Knowledge resources Roadmap Renowned knowledge resources The (open) linked data initiative Applications Exercises Vincenzo Maltese 01/28/2020 40 Example of knowledge content Germany Ulm part-of Albert Einstein date of birth e la c p

th bir CITY COUNTRY Mileva Maric spouse March, 14 1879 SCIENTIST af fil ia tio n PERSON ETH Zurich UNIVERSITY Vincenzo Maltese RESOURCES :: LINKED DATA :: APPLICATIONS :: EXERCISES

01/28/2020 41 CYC ontology (1984) Triples such as: #$isa #$BillClinton #$UnitedStatesPresid ent #$capitalCity #$France #$Paris A general-purpose common sense knowledge base Hand-crafted It contains around 2.2 million assertions and more than 250,000 terms Content into three levels from broader and abstract knowledge (the upper ontology) and widely used knowledge (the middle ontology) to domain specific knowledge (the lower ontology). RESOURCES :: LINKED DATA :: APPLICATIONS :: EXERCISES Vincenzo Maltese 01/28/2020 42

SUMO ontology (2001) Suggested Upper Merged Ontology Vincenzo Maltese A general-purpose common sense knowledge base Hand-crafted It contains around 1,000 terms and 4,000 definitional statements Its extension, called MILO (Mid-Level Ontology), covers individual domains RESOURCES :: LINKED DATA :: APPLICATIONS :: EXERCISES 01/28/2020 43 DBPedia (2007) Wikipedia

It is automatically built by extracting semi-structured content from Wikipedia Text is not semantically analyzed Vincenzo Maltese RESOURCES :: LINKED DATA :: APPLICATIONS :: EXERCISES 01/28/2020 44 YAGO ontology (2008) physicist word a scientist trained in physics class instance-of Max Planck

Vincenzo Maltese instance Concepts are taken from noun synsets of WordNet Instances and their properties are automatically extracted from Wikipedia The linking of concepts with instances is done via NLP techniques Accuracy is claimed to be ~95% It is available in triple (RDF) format RESOURCES :: LINKED DATA :: APPLICATIONS :: EXERCISES 01/28/2020 45 Freebase (2010) Semi-automatically built It contains data harvested from several sources such as Wikipedia, NNDB, FMD and MusicBrainz, as well as individually contributed data from its users. Vincenzo Maltese

RESOURCES :: LINKED DATA :: APPLICATIONS :: EXERCISES 01/28/2020 46 The Schema.org initiative Vincenzo Maltese RESOURCES :: LINKED DATA :: APPLICATIONS :: EXERCISES 01/28/2020 47 Linked Data Cloud (since 2007) Vincenzo Maltese RESOURCES :: LINKED DATA :: APPLICATIONS :: EXERCISES 01/28/2020 48 Linked Data The Linked Data approach forms the basis of data

publishing guidelines pinpointing how data from government, public and private sectors can be more valuable for the consumers. Principles o the use of http URIs as the identifiers of things (concepts, entities and attributes) o the provision of meaningful content published in open format (RDF) for each URI reference o the production of navigable content via links Vincenzo Maltese RESOURCES :: LINKED DATA :: APPLICATIONS :: EXERCISES 01/28/2020 49 Vincenzo Maltese RESOURCES :: LINKED DATA :: APPLICATIONS :: EXERCISES links to other RDF open datasets W3C open format (e.g. RDF) Non-proprietary format (e.g. CSV)

structured format publishing on the Web with an open license regardless of format Linked Open Data 01/28/2020 50 the PAT Vincenzo Maltese RESOURCES :: LINKED DATA :: APPLICATIONS :: EXERCISES 01/28/2020 51 Open Data Trentino portal Vincenzo Maltese RESOURCES :: LINKED DATA :: APPLICATIONS :: EXERCISES 01/28/2020

52 Open Government Data in UK Vincenzo Maltese RESOURCES :: LINKED DATA :: APPLICATIONS :: EXERCISES 01/28/2020 53 Exercises 1. Design two small knowledge graphs about a famous person taking information from Wikipedia, and YAGO (use the YAGO browser) 2. Explore Freebase and find information about Trento 3. Explore http://data.gov.uk/ and find useful information about museums

4. Search for the linked data cloud and check how many datasets it currently contains Vincenzo Maltese RESOURCES :: LINKED DATA :: APPLICATIONS :: EXERCISES 01/28/2020 54 Capturing diversity with the UKC and Entitypedia Roadmap Diversity and diversity dimensions The entity-centric approach The UKC and Entitypedia Exercises Vincenzo Maltese 01/28/2020 56 The inherent diversity of the world

What does bug mean? ENTOMOLOGY COMPUTER SCIENCE FOOD goals, culture, belief, personal experience Vincenzo Maltese 01/28/2020 DIVERSITY :: THE APPROACH :: UKC & ENTITYPEDIA :: EXERCISES 57 Diversity is pervasive in world descriptions Within a natural language o bug as malfunction vs. bug as food (homonymy) o stream and watercourse have same meaning (synonymy) Across natural languages o watercourse in English is same as corso dacqua in Italian (concepts) o There is no lemma in Italian for biking (lexical GAP) In formal language o There are several types of bodies of water (semantic relations)

o Rivers have a length, lakes have a depth (schematic knowledge) In data (ground knowledge) o The Adige river is 410 Km long; The Garda lake is 136 m deep o Bugs are great food vs. how can you eat bugs? (the role of culture) o Climate is/is not an important issue (the role of schools of thought) Vincenzo Maltese DIVERSITY :: THE APPROACH :: UKC & ENTITYPEDIA :: EXERCISES 01/28/2020 58 Diversity in language Diversity in Language Vincenzo Maltese DIVERSITY :: THE APPROACH :: UKC & ENTITYPEDIA :: EXERCISES 01/28/2020 59 in Knowledge

DiversityDiversity in Knowledge Billions of locations Billions of people Millions of organizations and events, artifacts, creative works, Vincenzo Maltese DIVERSITY :: THE APPROACH :: UKC & ENTITYPEDIA :: EXERCISES 01/28/2020 60 Terminological and ground Knowledge Actor acted in Movie, Film Michael J. Fox acted in

Back to the future II Vincenzo Maltese DIVERSITY :: THE APPROACH :: UKC & ENTITYPEDIA :: EXERCISES 01/28/2020 61 An entity-centric vision of the world (I) o Entities are objects which are so important in our everyday life to be referred with a name o Each entity has its own attributes (e.g. latitude, longitude, height) o Each entity is in relation with other entities (e.g. Eiffel Tower is located in Paris, France) Eiffel Tower o Each entity as a reference class (e.g. monument) which determines

its entity DIVERSITY :: THE APPROACH :: UKC & ENTITYPEDIA :: EXERCISES Vincenzo Maltese 01/28/2020 type (e.g. location) 62 world (II) locatio n event organizat ion person Entities are not all the same; they have different metadata according to the type of entity Vincenzo Maltese DIVERSITY :: THE APPROACH :: UKC & ENTITYPEDIA :: EXERCISES 01/28/2020 63 What do we aim to? How to achieve that?

Name: Coliseum Class: Amphitheatre Height: 48,5 m Latitude: 41.89 Longitude: 12.49 Location: Name: Rome Fori Imperiali Class: Bus Stop Company: ATAC Vincenzo Maltese Name: Arch of Constantine Class: Triumphal arch Latitude: 41.88 Longitude: 12.49 Location: Rome Customer: Constantine I Name: John Doe

Class: Person Date of Birth: 1960-05-12 DIVERSITY :: THE APPROACH :: UKC & ENTITYPEDIA :: EXERCISES 01/28/2020 64 The UKC and Entitypedia (since 2010) NATURAL LANGUAGE EN stream NATURAL LANGUAGE IT FORMAL LANGUAGE corso dacqua watercourse A natural body of running water flowing on or under the earth Uno specchio dacqua che scorre

sulla tera o al di sotto di essa #123 is-a A large natural stream of water (larger than a creek) river Un grande corso dacqua di origine naturale (piu grande di un ruscello) #456 Mississippi River fiume GROUND KNOWLEDGE Manually built via collaborative development [Tawfik et al., 2014], bootstrapped from WordNet, MultiWordNet, GeoNames Split natural language, formal language and ground knowledge [Giunchiglia et al., 2012b] Domain knowledge is created following the DERA methodology [Giunchiglia et al., 2012a] and principles [Giunchiglia et al., 2009] with distinction between entities, classes, relations, attributes and values

DIVERSITY :: THE APPROACH :: UKC & ENTITYPEDIA :: EXERCISES Vincenzo Maltese 01/28/2020 65 The UKC components The natural language: our vocabulary in multiple languages Natural Language Core (NLC) The fomal language: our graph of languageindependent notions Concept Core (CC) Schematic knowledge: Our schema of basic entity types EType Core (ETC) Domain knowledge: Domain-specific partition of the language above Vincenzo Maltese Domain Core (DC)

DIVERSITY :: THE APPROACH :: UKC & ENTITYPEDIA :: EXERCISES 01/28/2020 66 Concept Core Vincenzo Maltese DIVERSITY :: THE APPROACH :: UKC & ENTITYPEDIA :: EXERCISES 01/28/2020 67 Natural Language Core Languag e Synset en Canal it

canale; naviglio Gloss long and narrow strip of water made for boats or for irrigation corso d'acqua artificiale, costruito per l'irrigazione o la navigazione mn bn zh ; Languag Synset Gloss , hi e ;

en Rivulet A small stream mn GAP Vincenzo Maltese DIVERSITY :: THE APPROACH :: UKC & ENTITYPEDIA :: EXERCISES 01/28/2020 68 Etype Core: lattice (sample) Movie Abstract Entity Entity Mind Product Song

Organizati on Document Event Conferenc e Proceedin gs Session Informatio n Object Artifact Physical Entity Paper Presentati on Seminar Person Location

Vincenzo Maltese DIVERSITY :: THE APPROACH :: UKC & ENTITYPEDIA :: EXERCISES CORE Extend ed 01/28/2020 69 Domain Core: the DERA methodology o To capture terminology relevant to a specific domain o Based on the faceted approach from Library and Information Science o Terminology can be directly codified into Description Logic Domai Entity Relations Attributes n Classes D E

R A ARRAY CATEGORY Vincenzo Maltese FACET CONCEP T DIVERSITY :: THE APPROACH :: UKC & ENTITYPEDIA :: EXERCISES 01/28/2020 70 Entitypedia compared with existing knowledge bases KB #entitie s #fact

s Domain s Distinctio n classes and instances Distinctio n NL/FL Manu al 250K 2.2 M Yes No No Yes OpenCY

C 47k 306k Yes No No Yes SUMO MILO DBPedia 1k 21k 3.5 M 4k 74k 500 M No Yes No

Yes Yes No Yes Yes No Yes Yes No YAGO 2.5 M 20 M No No No No Freebas e

22 M ? Yes Yes No Yes Entitype dia 10 M 80 M Yes Yes Yes Yes CYC

Vincenzo Maltese DIVERSITY :: THE APPROACH :: UKC & ENTITYPEDIA :: EXERCISES 01/28/2020 71 Exercises 1. Search on the Web information about how many languages are spoken in Europe and in the whole world. 2. What is the most widely spoken language in the world? 3. Provide an example of concept which is heavily cultural dependant. 4. What are the top level entity types (up to 10) that to you are necessary to codify the whole world knowledge?

5. What are the main novelties introduced by the UKC and Entitypedia w.r.t. previous approaches? Vincenzo Maltese DIVERSITY :: THE APPROACH :: UKC & ENTITYPEDIA :: EXERCISES 01/28/2020 72 Methodologies for content generation Roadmap Introduction Motivation The original faceted approach Primitive notions in DERA Steps in the methodology Guiding principles Converting DERA ontologies into DL Applications Exercises

Vincenzo Maltese 01/28/2020 74 METHODOLOGY? BECAUSE SMALL DIFFERENCES MATTER Humans and chimps share a surprising 98.8 percent of their DNA. How to build ontologies which are of the highest quality possible? Vincenzo Maltese INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES 01/28/2020 75 Methodologies to ontology development Several methodologies have been [D] Medicine developed for the construction and [E] Body Part maintenance of ontologies (KR) or

. Digestive System . . Stomach controlled vocabularies (KO) The faceted approach [Ranganathan, [P] Disease . Cancer 1967] from library science is known to . . Carcinoma have great benefits in terms of quality . . . Adenocarcinoma and scalability It is based on the fundamental notions [A] Action . Treatment of domain and facets, which allow [M] Kind (to be applied to [A] Action) capturing the different aspects of a . Chemotherapy domain and allow for an incremental growth. Originally facets were of 5 types (PMEST): Personality, Matter, Energy, Space, Time. INTRO DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES 01/28/2020 Vincenzo Maltese A key 76 feature is:: compositionality

(meccano property), i.e. the system The DERA framework o To capture terminology relevant to a specific domain o DERA is faceted as it is inspired to the faceted approach o DERA is a KR approach as it models entities of a domain (D) by their entity classes (E), relations (R) and attributes (A) Domai Entity Attributes o Terminology can be directlyRelations codified into Description n Classes Logic D R A E ARRAY CATEGORY Vincenzo Maltese FACET

CONCEP T INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES 01/28/2020 77 Domains Any area of knowledge or field of study that we are interested in or that we are communicating about that deals with specific kinds of entities: Domains are the main means by which the diversity of the world is captured, in terms of language, knowledge and personal experience. Vincenzo Maltese INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES

01/28/2020 78 Primitive notions Entity: a (digital) description of any real world physical or abstract object so important to be denoted with a proper name. A single person, a place or an organization are all examples of entities. Entity Class: any set of objects with common characteristics. Relation: any object property used to connect two entities. Typical examples of relations include part-of, friend-of and affiliated-to. Attribute: any data property of an entity. Each attribute has a name and one or more INTRO :: DERA STEPS :: PRINCIPLES :: APPLICATIONS Vincenzo Maltese values taken

from a ::range of possible values.:: EXERCISES 01/28/2020 79 Elements of DERA A DERA domain is a triple D = where: E (for Entity) is a set of facets grouping terms denoting entity classes, whose instances (the entities) have either perceptual or conceptual existence. Terms in these hierarchies are explicitly connected by is-a or part-of relation. R (for Relation) is a set of facets grouping terms denoting relations between entities. Terms in these hierarchies are connected by is-a relation. A (for Attribute) is a set of facets grouping terms denoting qualitative/quantitative or descriptive attributes of the entities. We differentiate between attribute names and attribute values such that each attribute name is associated corresponding values. Attribute names are connected by is-a relation, while Vincenzo Maltese INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES 01/28/2020 attribute values are connected to corresponding 80

DERA facets DERA provides the ENTITY CLASS RELATION Location Direction Landform (is-a) East (is-a) Natural elevation (is-a) North (is-a) Continental (is-a) South elevation (is-a) West (is-a) Mountain (is-a) Hill Relative level (is-a) Oceanic (is-a) Above elevation (is-a) Below (is-a) Seamount (is-a) Submarine hill Containment (is-a) Natural depression (is-a) part-of (is-a)Continental depression

(is-a) Valley (is-a) Trough (is-a) Oceanic depression (is-a) Oceanic valley (is-a) Oceanic trough Body of water (is-a) Flowing body of water INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS (is-a) Stream, :: EXERCISES Watercourse language required to describe entities of a certain entity type in a given domain (D) Language comprises entity classes (E), relations (R) and attributes (A), names and values. Concepts and semantic relations between them form hierarchies of homogeneous nature called facets, each of them codifying a different aspect

of the domain. Each facet is a descriptive Vincenzo Maltese ontology ATTRIBUTE Name Latitude Longitude Altitude Area Population Depth (value-of) deep (value-of) shallow Length (value-of) long (value-of) short 01/28/2020 81 Analysis of the term school Term: School Source Definition

Genus Differentia WordNet an educational institution institutio educational n Oxford an institution for educating institutio for educating children dictionary children Merriam- an institution for the teaching institutio for Webster of children Wikipedia

an institution designed for the institutio for n n the teaching of teaching of children the The term school is in general highly polysemous. Among others, school teaching In of the students n students may denote a building. context (or of educational organizations, as

from above, it seems is the quite an agreement about the fact that it "pupils")there under direction indicates a kindof of educational institution, but in some cases (such as teachers fore WordNet) the meaning is left very generic. We coined the following definition: an educational institution designed for the teaching of students under the direction of teachers. Vincenzo Maltese INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES 01/28/2020 82 Synthesis of educational organizations Educational Institution Preschool School

Primary school Secondary school Post-secondary school Training school Vocational school Technical school Graduate school College University Vincenzo Maltese INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES 01/28/2020 83 Synthesis of educational organizations Educational Institution (an institution dedicated to education) Preschool (an educational institution for children too young for primary school) School (an educational institution designed for the teaching of students under the direction of teachers) Primary school (a school for children where they receive the first stage of basic education) Secondary school (a school for students intermediate between primary school and tertiary school)

Tertiary school (a school where programmes are largely theory based and designed to provide sufficient qualification for entry to advanced research programmes or professions with high skill requirements and leading to a degree) Training school (a tertiary school providing theoretical and practical training on a specific topic or leading to certain degree) Vocational school (a tertiary school where students are given education and training which prepares for direct entry, without further training, into specific occupation) Technical school (a tertiary school where students learn about technical skills required for a certain job) Graduate school (a tertiary school in a university or independent offering study leading to degrees beyond the bachelor's degree) INTRO :: DERAinstitution :: STEPS :: PRINCIPLES :: APPLICATIONS Vincenzo Maltese College (an educational or a constituent part of:: aEXERCISES university 01/28/2020 or independent institution, providing higher education or specialized professional 84 Guiding principles

Principle Example Relevance breed is more realistic to classify the universe of cows instead of by grade Ascertainability flowing body of water Permanence spring as a natural flow of ground water Exhaustiveness to classify the universe of people, we need both male and female Exclusiveness age and date of birth, both produce the same divisions Context

bank, a bank of a river, OR, a building of a financial institution Currency metro station vs. subway station Reticence minority author, black man Ordering stream preferred to watercourse Vincenzo Maltese INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES 01/28/2020 85 Guidelines for the formal language Concepts: facets in UKC are descriptive ontologies where each concept denotes a set of real world entities (classes) or a property of real world entities (relations and

attributes). Look for essential concepts: a property of an entity (that we codify as a concept) is essential (as opposite of accidental) to that entity if it must hold for it. As special form of essence, a property is rigid if it is essential to all its instances [Guarino and Welty, 2002]. Avoid complex concepts: e.g. red car. Avoid redundancies: e.g. nursery school and kindergarten are synonyms Avoid individuals: e.g. United States military academy Pay attention to meronymy relations: while part-of is assumed to be transitive in general, substance-of and :: DERA :: STEPS :: PRINCIPLES the :: APPLICATIONS EXERCISES 01/28/2020 Vincenzo Maltese INTRO member-of are not. Therefore, latter ::two cannot be 86 Guidelines for the natural

language (I) Terms and synsets: terms are grouped into synsets. In UKC multiple languages are accounted for by developing multiple dictionaries, i.e. by assigning either a synset or a GAP to every concept. Lemmas: for the selection of terms we focus on lemmas. We do not accept in UKC: articles (e.g. the) and plural forms; capitalization, except for cases such as acronyms and abbreviations; punctuation characters and parenthesis; The following are instead accepted, but not recommended: loan terms, i.e. terms borrowed from other languages, if widely used. For instance, the term kindergarten in English is typically well accepted. Vincenzo Maltese INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES 01/28/2020 87 transliterations, i.e. when a terms is a transcript from Guidelines for the natural language (II) Parts of speech: noun, adjective, adverb and verb. A lemma can be a single word (e.g. bank), a multi-word (e.g. traffic light) or a prepositional phrase (e.g. place of warship).

Homographs: terms which are spelled the same, but have different meaning. The same term can be associated to multiple concepts. Glosses: in line with principle of reticence, a gloss should Primary school: a school for young children; usually the first 6 or 8 grades Infant school: British for children aged 5-7 or regional bias. not convey anyschool cultural, temporal NO Junior school: British school for children aged 7-11 Primary school: a school for children where they receive the first stage of basic education Infant school: a primary school for very young children where they learn basic reading and writing skills Junior school: a primary school for young children where they learn basic notions of core subjects such as math, history and other social sciences Vincenzo Maltese INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES YES

01/28/2020 88 Back to entities Entity Class Attribut es Class: River Name: Thames Latitude: 51.50 Longitude: Length: Relation s 0.61 346 km

(long) Part-of: UK Thames Each of the terms above comes from a DERA ontology in KB Vincenzo Maltese INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES 01/28/2020 89 Localization [Ganbold et. al., 2014] translation English Mongolian road transportation facility

part-of part-of road is-a highway is-a is-a track is-a synset {highway, main road}

a major road for any form of motor transport { } gloss Vincenzo Maltese INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES 01/28/2020 90 Formalizing DERA into DL (I) With the formalization, DL concepts denote either sets of entities or sets of attribute values. DL roles denote either relations or attributes. A DL interpretation I = <, I> consists of the domain of interpretation = F G where: o F is a set of individuals denoting real world entities o G is a set of attribute values and of an interpretation function I where: EiI F Vincenzo Maltese

RjI F x F AkI F x G vrI G INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES 01/28/2020 91 Formalizing DERA into DL (II) E1, , Ep R1,, Rq A1,, As value-of is-a part-of any other relation e1,, en v1,, vr r1,, rm a1,, at instance-of Vincenzo Maltese Object DL formalization

entity classes relations between classes Attributes hierarchical relation hierarchical relation hierarchical relation associative relations entities instances Concepts attribute values relations between entities attributes of entities hierarchical relation Roles Roles TBo x

role restrictions subsumption () Roles Roles individuals in F (entities) individuals in G (values) role assertions ABo x role assertions concept assertions INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES 01/28/2020 92 Advantages of DERA DERA facets have explicit semantics and are modeled as descriptive ontologies DERA facets inherits all the important properties of

the faceted approach, such as robustness and scalability DERA allows for automated reasoning via the formalization into Description Logics ontologies. In particular, DERA allows for a very expressive search by any entity property Vincenzo Maltese INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES 01/28/2020 93 The space ontology [Giunchiglia et al., 2012] Knowledge is extracted from GeoNames and the Getty Thesaurus of Geographic Names Terms are collected, categorized into classes, entities, relations and attributes, and synsets are generated Synsets are mapped to and integrated

with WordNet Synsets are analyzed and arranged into facets Objects Terms are standardized Quantity and ordered Entity classes (E) 845 Entities (e) 6,907,417 Relations (R) 70 Attributes (A) 31 Vincenzo Maltese Landform Natural depression

Oceanic depression Oceanic valley Oceanic trough Continental depression Trough Valley Natural elevation Oceanic elevation Seamount Submarine hill Continental elevation Hill Mountain Body of water Flowing body of water Stream River Brook Stagnant body of water Lake Pond INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES 01/28/2020 94 The semantic-geo catalogue

[Farazi et al., 2012] Knowledge is extracted from the geographical dataset of the Province of Trento The faceted ontology was built in English and Italian Usage of the ontology The ontology is used in combination with S-Match within the search component of the geo-catalogue to improve search The evaluation shows that at the price of Objects Quantity a drop in precision of 0.16% we double recall Facets 5 Entity classes (E) Entities (e) part-of relations Alternative names Vincenzo Maltese

39 20,162 20,161 7,929 Body of water Lake Group of lakes Stream River Rivulet Spring Waterfall Cascade Canal Natural elevation Highland Hill Mountain Mountain range Peak Chain of peaks Glacier Natural depression Valley Mountain pass INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES

01/28/2020 95 Exercises 1. Analyse the following terms: o (geography) river, lake, salt lake, depth o (business) organization, company, business o (literature) newspaper, newsletter, book, archive, author, publisher, format, frequency 2. Take one domain of your choice, identify the entity types which are relevant and define corresponding terminology using DERA (concentrate on a few classes, relations and attributes). Vincenzo Maltese

INTRO :: DERA :: STEPS :: PRINCIPLES :: APPLICATIONS :: EXERCISES 01/28/2020 96 Some reference material [Ranganathan, 1967] S. R. Ranganathan, Prolegomena to library classification, Asia Publishing House. [Gruber, 1993] A translation approach to portable ontology specifications. Knowledge Aquisition, 5 (2), 199220. [Pollock, 2002] Integrations Dirty Little Secret: Its a Matter of Semantics. Whitepaper, The Interoperability Company. [Guarino and Welty, 2002] Guarino, N., Welty, C. (2002). Evaluating ontological decisions with OntoClean. Communications of the ACM, 45(2), 61-65. [Uschold and Gruninger, 2004] Ontologies and semantics for seamless connectivity. SIGMOD Rec., 33(4), 5864. [Varzi, 2006] Varzi, A. (2006). A note on the transitivity of parthood. Applied Ontology, 1 (2), 141-146. [Giunchiglia et al., 2009] Faceted Lightweight Ontologies. In: Conceptual Modeling: Foundations and Applications, LNCS Springer. [Giunchiglia et al., 2012a] A facet-based methodology for the construction of a large-scale geospatial ontology. Journal on Data Semantics, 1 (1), pp. 57-73. [Giunchiglia et al., 2012b] Domains and context: first steps towards managing diversity in knowledge. Journal of Web Semantics, special issue on Reasoning with Context in the Semantic Web. [Giunchiglia et al., 2014] From Knowledge Organization to Knowledge Representation. Knowledge Organization. 41(1), 44-56. [Tawfik et al., 2014] A Collaborative Platform for Multilingual Ontology Development.

International Conference on Knowledge Engineering and Ontology. Vincenzo Maltese 01/28/2020 97 [Ganbold et. al., 2014] An Experiment in Managing Language Diversity Across cultures.

Recently Viewed Presentations

  • Poetry Terms - Loudoun County Public Schools

    Poetry Terms - Loudoun County Public Schools

    POETRY TERMS Ms. Mathews English 9H TYPES OF POEMS VERSE Rhymed Verse the most commonly used form of verse generally has an end rhyme Blank Verse generally identified by a regular meter, but no end rhyme Free Verse usually defined...
  • Malagueña - WordPress.com

    Malagueña - WordPress.com

    Salerosa Composição: Elpidio Ramirez Pedro Galindo www.vitanoblepowerpoints.net Que bonitos ojos tienes, debajo de esas dos cejas Que bonitos olhos tens, debaixo dessas sobrancelhas Debajo de esas dos cejas, que bonitos ojos tienes Debaixo dessas sobrancelhas, que bonito olhos tens Ellos...
  • Reading Fiction

    Reading Fiction

    Reading and Writing About Fiction The Bedford Introduction to Literature Read Responsively Definition of literature Value of literature Intense and demanding Must have a conscious, sustained involvement w/a literary work Both the reader and the writer create the work Different...
  • Analysis of Algorithms - University of Pennsylvania

    Analysis of Algorithms - University of Pennsylvania

    Bottom line: Forget the constants! * Simplifying the formulae Throwing out the constants is one of two things we do in analysis of algorithms By throwing out constants, we simplify 12n2 + 35 to just n2 Our timing formula is...
  • Senior Parent Night Class of 2019 - PC&#92;|MAC

    Senior Parent Night Class of 2019 - PC\|MAC

    Financial Aid. FAFSA. Opens Oct. 1st of Senior Year. US Citizen or Permanent Resident. Parent/Student Info (Social Security numbers, dates of birth, etc) Parent/Student Taxes (if applicable) Due March 2nd. Dream Act. Opens Oct. 1st of Senior Year. DACA or...
  • RAVEN for SCDM

    RAVEN for SCDM

    Outsourcing models require a rethinking of the Monitoring and reporting need . Wearables studies let the data volume grow dramatically. With new approaches the need for study analytics changes. ... Deep dive down to the record level or jump right...
  • The Federal Reserve and Monetary Policy

    The Federal Reserve and Monetary Policy

    The Federal Reserve and Monetary Policy The Demand for Money and the Quantity Equation The quantity of money and the rate of interest Reducing the interest rate increases investment, and therefore (with a multiplier effect) GDP.
  • A Christian Approach to Biological Complexity Dr. Ard

    A Christian Approach to Biological Complexity Dr. Ard

    The best biography on Newton is probably "Never at Rest : A Biography of Isaac Newton" by Richard S. Westfall (Cambridge University Press, 1983), see also the shorter The Life of Isaac Newton by the same author (CUP, 1994). ......