Transcription

AI supported Topic Modeling using KNIME-Workflows1Jamal Al Qundus1, Silvio Peikert1 and Adrian Paschke11Fraunhofer-Institut FOKUS, Berlin, Germany{jamal.al.qundus, silvio.peikert, adrian.paschke}@fokus.fraunhofer.deAbstract. Topic modeling algorithms traditionally model topics as list ofweighted terms. These topic models can be used effectively to classify texts or tosupport text mining tasks such as text summarization or fact extraction. The general procedure relies on statistical analysis of term frequencies. The focus of thiswork is on the implementation of the knowledge-based topic modelling servicesin a KNIME2 workflow. A brief description and evaluation of the DBPedia3based enrichment approach and the comparative evaluation of enriched topicmodels will be outlined based on our previous work. DBpedia-Spotlight4 is usedto identify entities in the input text and information from DBpedia is used to extend these entities. We provide a workflow developed in KNIME implementingthis approach and perform a result comparison of topic modeling supported byknowledge base information to traditional LDA. This topic modeling approachallows semantic interpretation both by algorithms and by humans.Keywords: Topic Modeling, Workflow, Text Enrichment, Knowledge Base.1IntroductionRecent developments related to Semantic Web made knowledge from the web available as machine readable ontologies. Links and vocabulary mappings between publicontologies enable algorithms to make use of knowledge from the web available aslinked open data. One of the most popular public knowledge repositories is DBpedia.The DBpedia project extracts structured data from Wikipedia and makes it accessibleas knowledge base via a SPARQL interface. ([1])Topic modeling performs analysis on texts to identify topics. These topic models areused to classify documents and to support further algorithms to perform context adaptive feature, fact and relation extraction.1This work has been partially supported by the "Wachstumskern Qurator – Corporate SmartInsights" project (03WKDA1F) funded by the German Federal Ministry of Education andResearch a.org/4https://www.dbpedia-spotlight.org/Copyright 2020 for this paper by its authors.Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

2While Latent Dirichlet Allocation (LDA) [2], Pachinko Allocation [3], or Probabilistic Latent Semantic Analysis (PLSA) [4] traditionally perform topic modeling by statistical analysis of co-occurring words, the approaches in [1], [5] and [6] integrate semantics into LDA.[1], [5] and [6] propose methods to improve word-based topic modeling approachesby introducing semantics from knowledge bases. This reduces perplexity issues arisingfrom ambiguous terms and produces topic models that directly link to the knowledgebase. Topic models created using a knowledge base are easier to understand by humansthan topic models created exclusively by means of statistics.This proof of concept work applies the method from [6] to perform knowledge basesupported topic modeling using DBpedia. The presented approach to topic modeling isbased on the semantics of entities identified in the document. The basic idea of LDA toperform analysis based on term frequency is maintained. The extension of [6] is to enrich the input using a knowledge base to perform LDA with semantics. Therefore,DBpedia Spotlight API is used to recognize entities and additional information to theseentities is retrieved via the DBpedia API endpoint. During a preprocessing stage thetext is tagged with semantic annotations from the knowledge base and the tagged textis used as input to the LDA algorithm. This results in improved topic models due tomore context and less ambiguities in the input.2ArchitectureThe text to be examined is transferred to DBpedia Spotlight API. Spotlight returns aJSON object containing all entities recognized in the text. Additional information tothese entities is retrieved using DBpedia API. The response for each entity is a set ofproperties e.g. tags, Uri, type and hypernym (see Section 4 for details). A tagger combines these sets with the corresponding entities in the text. The result is processed byLDA. LDA performs topic modeling and provides the result in two formats (as a tablewith weights and as an image visualization). The architecture of the processing pipelineis illustrated in Fig. 1.Fig. 1 architecture of the topic modeling pipeline using DBpedia-Spotlight, DBpedia andthe LDA algorithm.

33KNIMEThe KNIME information miner is an open source modular platform for visualizationand selective execution of data pipelines. KNIME as a powerful data analysis tool thatenables simple integration of algorithms, data manipulation and visualization methodsin the form of modules or nodes [7].There are a number of additional services in the KNIME ecosystem, e.g. KNIMEServer, which connects the different actors (services, teams and individuals) in a centralplace and thus offers a platform for collaboration. KNIME Workflow Hub makes workflows publicly available on the KNIME Examples Server. Members of the user community can share workflows and receive ratings and comments from other users. In ourwork with the KNIME analytical platform we have implemented and performed variousmodelling methods to offer complete services around semantic analysis.4Workflow for Topic ModelingThe workflow developed in this work consists of four stages: (1) Reading the text inconsideration and Entity recognition using DBpedia-Spotlight API. (2) Getting properties of the entities included in the JSON. (3) Tagging the text by combining the entitiesand the related properties gained from the previous phase. (4) Text cleaning and topicmodeling using the LDA algorithm. Fig. 2 gives an overview of the workflow developed and its modularization into four stages.1324Fig. 2 represents the workflow developed for topic modeling.The Reading stage includes Table Creator node, which provides the settings of theparameters used to request entities from DBpedia Spotlight. We use confidence 0.5and support 0 to get as many entities as possible from the text. File Reader node forreading the text from a path. String Manipulation node to repair the text, e.g. to replace

4double spaces. Column Appender node to combine the data provided by the Table Creator and File Reader nodes. String Manipulation node prepares the URL request toDBpedia Spotlight, which is then sent by the node Get Request. The output of this stageis a table of the text entities recognized by DBpedia Spotlight as shown in Fig. 3.Fig. 3 represents the response JSON object of DBpedia Spotlight.The Get properties stage contains Column Filter node to extract the entities column from the table. Java Snippet node to filter Resources from the JSON. The Stringto JSON, JSON to Table, Transpose and JSON to Table nodes to put the columncontent in the format required for further processing. Column Filter node filters typesand surfaceForms. Java Snippet node sends a HTTP request containing a SPARQLquery to DBpedia API and retrieves entities and the related tags. Missing Value nodedeletes null values and Column Filter node filters surface forms (entities) and tagsfrom the table created, which build the output of this stage as illustrated in Fig. 4.Fig. 4 represents the table filtered by surface name and tagsThe tagging stage implements a loop taking the original input text and the recognized entities with their tags to match these entities with their mentions in the original text and enrich it by the tags as shown in Fig. 5. This loop consists of RecursiveLoop Start to begin the loop, Row Filter to get the rows one by one, Row Table toVariable as a converter, String Manipulation as a tagger and Recursive Loop End toget back into the loop in case there are still entries in the table. At the end of the loop

5a String to Document node converts the text into a document format and forwards itto the next stage.Fig. 5 illustrates the tagged textIn the text cleaning and topic modeling stage, the produced document will be cleanedby Column Filter5, Number Filter, Punctuation Erasure, Stop Word Filter, Case Converter and Snowball Stemmer. That preprocessed text is passed to Topic Extractor nodethat implements the LDA algorithm. LDA creates the topic model as list of weightedterms, which are then visualized using Color Manager and Tag Cloud nodes.5EvaluationThe focus of this work was on the implementation of the knowledge-based topicmodelling services in a Knime workflow. For a detailed description and evaluation ofthe DBPedia based enrichment approach and the comparative assessment of enrichedtopic models we refer to our earlier work in [6] and [8]. In this paper we demonstratethe Knime-based proof-of-concept implementation by comparing the results of topicmodeling supported by knowledge base information to traditional LDA using the following text:Barack Obama is only passing through Germany on his trip to Europe later this week anddoes not plan to hold substantial talks with Angela Merkel. The White House views the chancellor as difficult and Germany is increasingly being left out of the loop.This text is expanded with annotations from DBpedia as follows:Barack Obama [Barack Obama, Politician, Agent, President, Person, Politician] is only passing through Germany [Germany, Republic, Place, Country, Person, PopulatedPlace, Location], on his trip to Europe [Europe, Continent, Location, PopulatedPlace, Place, Continent]later this week and does not plan to hold substantial talks with Angela Merkel [Angela Merkel,Politician, Agent, Person, OfficeHolder]. The White House [White House, Residence, Location, Building, Place 5Column Filter which is common needed, since the output of the most nodes includes, in additionto the result, its input that is mostly not needed any more.

6Fig. 6 shows the image visualization of weighted, normalized terms created by LDAwithout semantic annotations and Fig. 7 the weighted, normalized terms obtainedwith support of a knowledge base.Fig. 6 shows the traditional LDA topicmodelFig. 7 shows the knowledge base enrichedtopic modelThe results reflect the expectations. LDA provides a naive topic model for the original text comprising of weighted lemmatized terms from the input text with only oneterm having a significantly higher weight than other terms of the model. The knowledgebase supported method creates a superior topic model also containing weighted lemmatized terms from the knowledge base, which are not present in the input text. Thistopic model enables semantic interpretation by algorithms as well as by humans.In particular the enriched topic model enables algorithms to infer from the topicmodel linked to a knowledge base that the input text contains information about politicsand actions of relevant officials, while a classification based on the traditional LDAtopic model might result in a false classification as geographical text.6Summary and Future ProspectsThis proof of concept work developed a KNIME workflow to perform comprehensive topic modeling using a knowledge base. The use of information from a knowledgebase is achieved by using DBpedia Spotlight API for entity recognition and DBpediaAPI to retrieve entity properties. The presented results show that the developed approach is applicable and delivers results containing more comprehensive insights intoa text than statistical topic models based on words only. The created topic models canimprove the results of various methods used for text mining tasks such as text classification or fact and relation extraction.Topic modeling using knowledge bases is a step towards improved automated methods for knowledge base population. Other methods in natural language processingmight also be extendable by applying the idea of annotating text with information from

7knowledge bases. We expect improved results over word-based approaches for thesetasks in future work, especially when analyzing small corpora.References[1] M. Allahyari and K. Kochut, ‘Automatic topic labeling using ontology-based topicmodels’, in 2015 IEEE 14th International Conference on Machine Learning andApplications (ICMLA), 2015, pp. 259–264.[2] D. M. Blei, A. Y. Ng, and M. I. Jordan, ‘Latent dirichlet allocation’, J. Mach.Learn. Res., vol. 3, no. Jan, pp. 993–1022, 2003.[3] W. Li and A. McCallum, ‘Pachinko allocation: DAG-structured mixture modelsof topic correlations’, in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 577–584.[4] T. Hofmann, ‘Probabilistic latent semantic analysis’, in Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, 1999, pp. 289–296.[5] I. Hulpus, C. Hayes, M. Karnstedt, and D. Greene, ‘Unsupervised graph-basedtopic labelling using dbpedia’, in Proceedings of the sixth ACM international conference on Web search and data mining, 2013, pp. 465–474.[6] A. Todor, W. Lukasiewicz, T. Athan, and A. Paschke, ‘Enriching topic modelswith DBpedia’, in OTM Confederated International Conferences" On the Move toMeaningful Internet Systems", 2016, pp. 735–751.[7] M. R. Berthold et al., ‘KNIME-the Konstanz information miner: version 2.0 andbeyond’, ACM SIGKDD Explor. Newsl., vol. 11, no. 1, pp. 26–31, 2009.[8] Wojciech Lukasiewicz, Alexandru Todor, Adrian Paschke, ‘Human Perception ofEnriched Topic Models‘. In: Business Information Systems - 21st International Conference, BIS 2018, Berlin, Germany: 15-29