Transcription

Analyzing the Web from Start to FinishKnowledge Extraction from a Web Forum usingKNIMEBernd WiswedelTobias KötterRosaria [email protected] 2013 by KNIME.com AG all rights reservedpage 1

Table of ContentsAnalyzing the Web from Start to Finish Knowledge Extraction from a Web Forum usingKNIME . 1Summary . 3Web Analytics and the Desire for Extra-Knowledge . 3The KNIME Forum . 4The Data . 5The Analysis . 6The “WebCrawler” Workflow. 7The HtmlParser Node from Palladian . 7The XML Parser Node . 8Web Analytics and Simple Statistics . 12Forum Growth over the Years . 12Forum Efficiency and Scalability . 13Forum Ownership . 14Topic Classification . 15Unzipping Files and the “drop” folder . 15Text Pre-processing . 16Model Training and Evaluation: the Decision Tree Ensemble. 18Topic Shift Detection . 21Identify Contents and Users . 22The Tag Cloud . 23Build the User Network . 25Build the Summary Report . 27Moving into Production. 31Selective Data Reading . 31Controlling the KNIME Workflows via Quickform Nodes . 31Exploiting the KNIME Server to Share Common Metanodes . 33Running the Workflows from the Web Portal. 33Conclusions. 36References . 36Copyright 2013 by KNIME.com AG all rights reservedpage 2

SummaryIn recent years, the web has become the source of all possible information. Web contents are usedmore and more frequently to find out about customers’ and real-world people’s preferences. Thiswhitepaper implements all parts of the process of extracting information from the web using KNIME.In order to produce a practical example and at the same time to learn more about the KNIMEcommunity users, this analysis is focused on data from the KNIME Forum. The analysis is divided in 4parts.The first workflow is a web crawler. It is dedicated to web content extraction and data reorganizationwhich renders it suitable for the following analysis. It has often been said that one of the bestfeatures of KNIME is the community behind it. Indeed, the download of the forum content isoutsourced to one of the community nodes. The following steps for data extraction andreorganization, on the opposite, mainly rely on the KNIME XML node collection.A few basic statistical measures are calculated to get insights about the forum performance as anindirect measure of the KNIME community performance. Here users can be posters and commentersat the same time. While the total number of users and of posts over time gives a measure of thecommunity growth, the average number of comments for each post can be considered a measure offorum answer efficiency.The topics discussed in the KNIME Forum represent another big source of information: they clearlydescribe the evolution of the users’ interests and wishes over time. A full workflow has beenimplemented to classify topics and detect topic shifts in time, using text mining techniques todescribe the post contents and predictive analytics to classify them. The results show that populartopics have always been “data manipulation” and “data mining”, i.e. tasks where KNIME quality iswell known. They also show that the interest for the “flow control” category has been growingstrongly and steadily over time, re-enforcing the choice to keep improving these category nodes evenfurther.Finally, a fourth workflow examines how the forum users interact with each other in differentdiscussion groups. Depending on the discussed topics, experts emerge quickly here from the usernetwork graph.Those are four very popular areas in the analysis of web contents which are easily exportable toother business contexts: web crawling, web analytics, topic detection, and description of userinteraction. All workflows are available on the KNIME public server and the KNIME software can bedownloaded from www.knime.com.Web Analytics and the Desire for Extra-KnowledgeThe analysis of web sites and especially of forums and social networks has become necessary formany companies, in order to know what their customers wish, need, and use. However, theextraction of knowledge from web sources is a complex task and involves many different steps in awider variety of disciplines. Often, employees as well as tools are specialized in only one of thosedisciplines and therefore can carry out only one of the tasks required for the knowledge extractionfrom the web, be this web crawling, database storage, th text analytics for topic detection, sentimentanalysis, user network representation, statistics, web analytics, or other such tasks.The ambitious goal of this whitepaper is to implement and describe a full approach to the analysis ofweb forums and social media from the beginning to the end, through:Copyright 2013 by KNIME.com AG all rights reservedpage 3

-data collection via a web crawling algorithm and XML parsing functionssimple statistics to see the evolution of the forum in timea topic detection application using text mining techniquesa representation of the user network via network analyticsthe productization of all these steps on a KNIME ServerKNIME (www.knime.com) has been selected as the tool for the implementation of this project. Thechoice fell on KNIME, not only because KNIME has a graphical interface, is easy to use, and is reliable,but also because it contains all the necessary tools for all of the steps required by this project.Indeed, many nodes are available in the KNIME Core for basic statistics. A full suite can be installedfrom the KNIME extensions for text mining as well as for network analytics, containing all of thenodes you need to perform sentiment analysis, topic detection, and network graph representations.Finally, a web crawler node has been made available via a community extension provided by thePalladian project (http://www.palladian.ws/).All these features are powerful already by themselves, but combined with all the other dataprocessing, data mining, and reporting functionalities, they turn KNIME into a very powerful tool forany analytics task, and in particular for extracting data and knowledge from the web communities.For this whitepaper we focused on the KNIME Forum (http://tech.knime.org/forum) wherecommunity users exchange questions and answers about technical KNIME topics. The goal was togain insights about the evolution in time of the KNIME software and about the user wishes and needsfor the future.The KNIME open source platform can be downloaded from the KNIME site at www.knime.com. Allworkflows reported in this whitepaper were developed using KNIME 2.7. However, an updatedversion for KNIME 2.8, including the forum data and the metanode template, is available from theKNIME public server at knime://publicserver/050 Applications/050007 ForumAnalysis.The KNIME ForumThe goal of this whitepaper is to show how to extract data and therefore knowledge from web sites.As a data mining company, we decided it would be best if we actually learned something from ourown data. So, we centered the project on the KNIME Forum data.The KNIME Forum (http://tech.knime.org/forum) is the main community place for KNIME users. HereKNIME users post questions to get help from other KNIME users to solve some particular problem.Usually and quickly, at least one KNIME user knows the answer and is willing to help with theproblem at hand.You need to officially register to take part into the KNIME Forum. After logging in, you can then placeyour post or comment on other users’ posts. The Forum is divided in 5 topic-dependent parts, withposts about:-KNIME Core and ExtensionsKNIME LabsDevelopment of new KNIME nodesKNIME ServerCommunity ContributionsEach one of these sub-forums consists of many categories, like KNIME General, Text Processing, ThirdParty nodes, and so on (see Fig. 1).Copyright 2013 by KNIME.com AG all rights reservedpage 4

Figure 1. The KNIME ForumThe DataIn order to retrieve all posts and comments from the KNIME Forum we could have accessed theunderlying database storing the forum content. However, this would have been a very specializedapproach, which is not universally applicable to other systems and hence not of interest in the scopeof this article. To be able to extract data from any web site, with no access to the underlyingdatabase, we need to crawl and read all (linked) pages.There are a number of open source as well as commercial web crawling tools available. One optioncould have been to start and run one of those web crawling tools via the “External Tool” node frominside a KNIME workflow. However, as it often happens with KNIME, surfing a bit around the KNIMEcommunity, we discovered that a web reader node had already been developed and made availableto KNIME users for free by Palladian (http://www.palladian.ws/). The node reads a given onlineHTML web site and extracts all information in further processable XML format.The KNIME Forum actually hosts five discussion groups: KNIME; KNIME Labs, CommunityContributions, Enterprise Tools, and KNIME Development.The most visited place is of course KNIME, since this is where all general posts about KNIME usageare.The KNIME Labs forum page contains all questions about nodes in the KNIME Labs category. Just as areminder, KNIME Labs contains a number of new node extensions, which are not yet part of thestandard KNIME distribution but are still made available as a sneak preview for the KNIME users.Community Contributions hosts all discussions about community developed nodes.Recently a full discussion group has been opened for the KNIME Enterprise Tools. Indeed, while theKNIME Enterprise Tools become more widely adopted, a category for discussions on this topic hasbecome necessary.Copyright 2013 by KNIME.com AG all rights reservedpage 5

The last discussion group, KNIME Development, hosts all posts about standards and guidelines innode development and the KNIME tools available to make the development of custom nodes easier.Each discussion group separates its posts in a number of categories. The KNIME discussion group, forexample, contains 5 categories: KNIME Users, KNIME General, R Statistics nodes and Integration,KNIME Reporting, and Third Party. In total, the KNIME Forum hosts around 20 categories.An individual forum discussion thread consists of an initial post and a number of comments (Fig. 2); athread can span over several pages. The biggest challenge, after pulling down all the web pages, is toresynchronize each post with all its comments.Figure 2. An example of a forum threadThe AnalysisThe analysis of the content of the KNIME Forum has been split in 4 parts. The correspondingworkflows can be downloaded from the KNIME public server named EXAMPLES and available in theKNIME Explorer panel in the KNIME Workbench.Cop