Processing XML data

This page uses the Python tutorial here: https://docs.python.org/2/library/xml.etree.elementtree.html#module-xml.etree.ElementTree

5 Min XML drill for those who are not familiar with XML : http://www.diveintopython3.net/xml.html#xml-intro

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

There are several different Python XML parsers, we will be using the ElementTree

>>> import xml.etree.ElementTree as etree
>>> tree = etree.parse('country_data.xml')
>>> root=tree.getroot()
>>> root.tag
'data'
>>> root.attrib
{}

>>> for ch in root:
...  print(ch.tag)
... 

country
country
country

>>> for ch in root:
...  print(ch.attrib)
... 

{'name': 'Liechtenstein'}
{'name': 'Singapore'}
{'name': 'Panama'}
>>>

The elements can be accessed by the index

>>> root[0][1].text
'2008'
>>> root[2][2].text
'13600'
>>> root[2][2].tag
'gdppc'

You can iterate over elements by tag:

>>> for neighbor in root.iter('neighbor'):
...  print(neighbor.attrib)
... 
{'direction': 'E', 'name': 'Austria'}
{'direction': 'W', 'name': 'Switzerland'}
{'direction': 'N', 'name': 'Malaysia'}
{'direction': 'W', 'name': 'Costa Rica'}
{'direction': 'E', 'name': 'Colombia'}


>>> for cntry in root.iter('country'):
...  print(cntry.attrib["name"])
... 
Liechtenstein
Singapore
Panama

.iter method looks for the passed tag at the current level and the children recursively

>>> for cntry in root.findall("country"):
...  print(cntry.find("rank").text)
... 
1
4
68

>>> for cntry in root.findall("country"):
...  print(cntry.get("name"))
... 
Liechtenstein
Singapore
Panama

Processing the XML pages from web

import urllib.request
import xml.etree.ElementTree as etree

page = urllib.request.urlopen("http://www.thomas-bayer.com/sqlrest/CUSTOMER/3/")
content=page.read()
content_string = content.decode("utf-8")
root = etree.fromstring(content_string)
for child in root:
	print(child.tag)

<CUSTOMER xmlns:xlink="http://www.w3.org/1999/xlink">
<ID>3</ID>
<FIRSTNAME>Michael</FIRSTNAME>
<LASTNAME>Clancy</LASTNAME>
<STREET>542 Upland Pl.</STREET>
<CITY>San Francisco</CITY>
</CUSTOMER>

ID

FIRSTNAME

LASTNAME

STREET

CITY

​Processing the XML pages from web

Processing the XML pages from web