Introduction to Python - Kansas State University

Introduction to Python - Kansas State University

Parsing HTML Topic 3, Chapter 7 Network Programming Kansas State University at Salina Picking information from an HTML page A difficult problem HTML defines page layout, not content advantage XML Very useful because of volume of data available If the format of the page changes, your program is broken. HTML

Definition: Token one piece of information in an HTML formatted page HTML tag usually only relates to formatting URL or image reference Textual information Must look at several tokens to determine context of the data Start-tag, End-tag structure leads parsing code to use finite state machines and stacks. (

) Tokens

Tim Bower

Tim Bower

{'data': [], 'type': 'StartTag', 'name': u'html'} {'data': [], 'type': 'StartTag', 'name': u'head'} {'data': u'\n ', 'type': 'SpaceCharacters'} {'data': [], 'type': 'StartTag', 'name': u'title'} {'data': u' ', 'type': 'SpaceCharacters'} {'data': u'Tim Bower', 'type': 'Characters'} {'data': u' ', 'type': 'SpaceCharacters'} {'data': [], 'type': 'EndTag', 'name': u'title'} {'data': u'\n', 'type': 'SpaceCharacters'} {'data': [], 'type': 'EndTag', 'name': u'head'} {'data': u'\n\n', 'type': 'SpaceCharacters'} {'data': [(u'bgcolor', u'lightyellow')], 'type': 'StartTag', 'name': u'body'}

{'data': u' \n\n', 'type': 'SpaceCharacters'} {'data': [], 'type': 'StartTag', 'name': u'table'} {'data': u' ', 'type': 'SpaceCharacters'} {'data': [], 'type': 'StartTag', 'name': u'tbody'} {'data': [], 'type': 'StartTag', 'name': u'tr'} {'data': u'\n', 'type': 'SpaceCharacters'} {'data': [], 'type': 'StartTag', 'name': u'td'} {'data': u'\n', 'type': 'SpaceCharacters'} {'data': [], 'type': 'StartTag', 'name': u'h1'} {'data': u'Tim Bower', 'type': 'Characters'} {'data': [], 'type': 'EndTag', 'name': u'h1'} Two main programming The call-back approach (HTMLParser shown strategies in text book) Define your own class that extends the

HTMLParser class Nice use of inheritance and polymorphism Pass the HTML page to the parser and it calls functions from your class as needed to process the start-tags, data elements, end-tags and a few other miscellaneous tags. The document tree approach Parser builds a tree (data structure object) based on the page contents You iterate through the tree or a list of tokens taken from the tree looking for desired data. HTMLParser import HTMLParser class TitleParser(HTMLParser): def __init__(self): self.title = '' self.readingtitle = 0 HTMLParser.__init__(self)

def handle_starttag(self, tag, \ attrs): if tag == 'title': self.readingtitle = 1 def handle_data(self, data): if self.readingtitle: self.title += data def handle_endtag(self, tag): if tag == 'title': print *** %s *** % \ self.title self.readingtitle = 0 fd = open(sys.argv[1]) tp = TitleParser() tp.feed(fd.read()) Argh!, HTMLParser is fragile and hard to debug. Traceback (most recent call last): File "C:\Users\tim\Documents\Classes\Net_Programming\Source_code\ Topic 3 - Web\weatherParser.py", line 258, in parser.feed(data) File "C:\Python25\lib\HTMLParser.py", line 108, in feed

self.goahead(0) File "C:\Python25\lib\HTMLParser.py", line 148, in goahead k = self.parse_starttag(i) File "C:\Python25\lib\HTMLParser.py", line 226, in parse_starttag endpos = self.check_for_whole_start_tag(i) File "C:\Python25\lib\HTMLParser.py", line 301, in check_for_whole_start_tag self.error("malformed start tag") File "C:\Python25\lib\HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParseError: malformed start tag, at line 120, column 477 html5lib Found on Python package index Install setuptools then use Python to install html5lib (see the README file). Both are on K-State Online. Advantages:

Robust, standards based parser Filtering data after the page is parsed is easier to follow and debug than the call-back approach Disadvantage: Documentation of API for traversing the tree html5lib Usage Build the tree: p = html5lib.HTMLParser( \ tree=treebuilders.getTreeBuilder("dom")) f = open( "weather.html", "r" ) dom_tree = p.parse(f) f.close()

Loop through tokens: walker = treewalkers.getTreeWalker("dom") stream = walker(dom_tree) passtags = [ u'a', u'h1', u'h2', u'h3', u'h4',u'em', \ u'strong', u'br', u'img', \ u'dl', u'dt', u'dd' ] for token in stream: # Don't show non interesting stuff if token.has_key('name'): if token['name'] in passtags: continue print token The DOM tree alternative

The DOM tree may be used directly. Not documented with html5lib, but xml.dom package is standard with Python. DOM trees are normally used with XML, but html5lib can make a DOM tree from HTML. Walk through the tree by examining children nodes of each node. With knowledge of the page structure, you may be able to go almost directly to the desired information. See chapter 8 and DOMtry.py posted file. html5lib tokens Stream of tokens is a list Each token is a dictionary token[ data ]

String (unicode encoding) Empty list List of tuples for formatting attributes token[ type ] (StartTag, EndTag, Characters, SpaceCharacters) token[ name ] description of start and end tags. (table, tr, td, h1, br, ul, li, ) See example of tokens on previous slide html5lib token parsing doingTitle = False for token in stream: if token.has_key('name'): if token['name'] in passtags: continue else:

tName = token['name'] tType = token['type'] if tType == 'StartTag': if tName == u'title': title = '' doingTitle = True if tType == 'EndTag': if tName == u'title': print "*** %s ***\n" % title doingTitle = False if tType == 'Characters': if doingTitle: title += token['data']

Recently Viewed Presentations

  • Chemistry: Matter and Change

    Chemistry: Matter and Change

    nil = 0 un = 1 bi = 2 tri = 3 quad = 4 pent =5 hex = 6 sept = 7 oct = 8 enn = 9 Section 6-1 Development of the Periodic Table In the 1700s, Lavoisier...
  • OC 2/e Ch 16

    OC 2/e Ch 16

    Christopher S. Foote ... breaks down to carbon dioxide and water Reaction with Bases Preparation Carbonation of Grignard reagents treatment of a Grignard reagent with carbon dioxide followed by acidification gives a carboxylic acid Methanol to Acetic Acid Acetic acid...
  • ტუბერკულოზოს კონტროლი საქართველოში

    ტუბერკულოზოს კონტროლი საქართველოში

    Permissions and LicenseJuly, 2014. Number of Permissions for Hospitals - 295. Number of Licenses - 103: Ambulance/Emergency Service - 79. Blood banking/transfusion service - 15
  • Global Strategies - units.it

    Global Strategies - units.it

    Global Strategies. Volvo - considered a Swedish company but it is controlled by an American company, Ford. The current Volvo S40 is built in Belgium and shares its platform with the Mazda 3 built in Japan and the Ford Focus...
  • Teacher Quality Enhancement Partnership Grant

    Teacher Quality Enhancement Partnership Grant

    Research Funded by a US Department of Education, Teacher Quality Enhancement Grant Significance just between Co-taught and traditional student teaching is the p value provided, based on Chi Square test.
  • SWE 637: Graph Coverage for Design Elements

    SWE 637: Graph Coverage for Design Elements

    A design specification describes aspects of what behavior software should exhibit. A design specification may or may not reflect the implementation. More accurately - the implementation may not exactly reflect the spec. Design specifications are often called models of the...
  • Chapter 17 Section 1 Reconstruction Plans

    Chapter 17 Section 1 Reconstruction Plans

    Chapter 17 Section 1 Reconstruction Plans Post Civil War America Because Southern states had seceded from the Union, the federal government needed to establish a system by which they could be readmitted. Post Civil War America The economy and society...
  • Biology Competency Test Review REVIEW OF IMPORTANT INFORMATION;

    Biology Competency Test Review REVIEW OF IMPORTANT INFORMATION;

    Growth-increase in the amount of living matter either by cell division or cell enlargement. Development- any change from conception to death-embryonic, aging, puberty. Adaptation-structures, behaviors or processes that aid in an organisms survival are passed from parent to offspring.