How to Parse XML in Python? Multiple Methods Covered
Extensible Markup Language (XML) is a widely used format for storing and exchanging structured data. XML files are commonly used to represent hierarchical data, such as configuration files, data interchange formats, web service responses, and web sitemaps.
Parsing XML files in Python is a common task, especially for automating manual processes like processing data retrieved from web APIs or web scraping .
In this article, you’ll learn about some of the libraries that you can use to parse XML in Python, including the
ElementTree
module
,
lxml library
,
minidom
,
Simple API for XML (SAX)
, and
untangle
.
Key Concepts of an XML File
Before you learn how to parse XML in Python, you must understand what XML Schema Definition (XSD) is and what elements make up an XML file. This understanding can help you select the appropriate Python library for your parsing task.
XSD is a schema specification that defines the structure, content, and data types allowed in an XML document. It serves as a syntax for validating the structure and content of XML files against a predefined set of rules.
An XML file usually includes the elements
Namespace
,
root
,
attributes
,
elements
, and
text content
, which collectively represent structured data.
Namespace
roo
attributes
elements
text content
For example, the demlon sitemap has the following XML structure:
urlset
is the
root
element.
<urlset xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd>
is the namespace declaration specific to the
urlset
element, implying that this declaration’s rules extend to the
urlset
element. All elements under it must conform to the schema outlined by this namespace.
url
is the first child of the
root
element.
loc
is the child element of the
url
element.
Now that you know a little more about XSD and XML file elements, let’s use that information to help parse an XML file with a few libraries.
Various Ways to Parse XML in Python
For demonstration purposes, you’ll use the demlon sitemap for this tutorial, which is available in XML format. In the following examples, the demlon sitemap content is fetched using the Python requests library .
The Python requests library is not built-in, so you need to install it before proceeding. You can do so using the following command:
pip install requests
ElementTree
The ElementTree XML API provides a simple and intuitive API for parsing and creating XML data in Python. It’s a built-in module in Python’s standard library, which means you don’t need to install anything explicitly.
For example, you can use the
findall()
method to find all the
url
elements from the root and print the text value of the
loc
element, like this:
import xml.etree.ElementTree as ET
import requests
url = 'https://demlon.com/post-sitemap.xml'
response = requests.get(url)
if response.status_code == 200:
root = ET.fromstring(response.content)
for url_element in root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}url'):
loc_element = url_element.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc')
if loc_element is not None:
print(loc_element.text)
else:
print("Failed to retrieve XML file from the URL.")
All the URLs in the sitemap are printed in the output:
https://demlon.com/case-studies/powerdrop-case-study
https://demlon.com/case-studies/addressing-brand-protection-from-every-angle
https://demlon.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://demlon.com/case-studies/the-seo-transformation
https://demlon.com/case-studies/data-driven-automated-e-commerce-tools
https://demlon.com/case-studies/highly-targeted-influencer-marketing
https://demlon.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://demlon.com/case-studies/workplace-diversity-facilitated-by-online-data
https://demlon.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://demlon.com/case-studies/data-intensive-analytical-solutions
https://demlon.com/case-studies/canopy-advantage-solutions
https://demlon.com/case-studies/seamless-digital-automations
ElementTree is a user-friendly way to parse XML data in Python, featuring a straightforward API that makes it easy to navigate and manipulate XML structures. However, ElementTree does have its limitations; it lacks robust support for schema validation and is not ideal if you need to ensure strict adherence to a schema specification before parsing.
If you have a small script that reads an RSS feed, the user-friendly API of ElementTree would be a useful tool for extracting titles, descriptions, and links from each feed item. However, if you have a use case with complex validation or massive files, it would be better to consider another library like lxml.
lxml
lxml is a fast, easy-to-use, and feature-rich API for parsing XML files in Python; however, it’s not a prebuilt library in Python. While some Linux and Mac platforms have the lxml package already installed, other platforms need manual installation.
lxml is distributed via
PyPI
and you can
install
lxml
using the following
pip
command:
pip install lxml
Once installed, you can use
lxml
to parse XML files using
various API
methods, such as
find()
,
findall()
,
findtext()
,
get()
, and
get_element_by_id()
.
For instance, you can use the
findall()
method to iterate over the
url
elements, find their
loc
elements (which are child elements of the
url
element), and then print the location text using the following code:
from lxml import etree
import requests
url = "https://demlon.com/post-sitemap.xml"
response = requests.get(url)
if response.status_code == 200:
root = etree.fromstring(response.content)
for url in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}url"):
loc = url.find("{http://www.sitemaps.org/schemas/sitemap/0.9}loc").text.strip()
print(loc)
else:
print("Failed to retrieve XML file from the URL.")
The output displays all the URLs found in the sitemap:
https://demlon.com/case-studies/powerdrop-case-study
https://demlon.com/case-studies/addressing-brand-protection-from-every-angle
https://demlon.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://demlon.com/case-studies/the-seo-transformation
https://demlon.com/case-studies/data-driven-automated-e-commerce-tools
https://demlon.com/case-studies/highly-targeted-influencer-marketing
https://demlon.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://demlon.com/case-studies/workplace-diversity-facilitated-by-online-data
https://demlon.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://demlon.com/case-studies/data-intensive-analytical-solutions
https://demlon.com/case-studies/canopy-advantage-solutions
https://demlon.com/case-studies/seamless-digital-automations
So far, you’ve learned how to find elements and print their value. Now, let’s explore schema validation before parsing the XML. This process ensures that the file conforms to the specified structure defined by the schema.
The XSD for the sitemap looks like this:
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
elementFormDefault="qualified"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xs:element name="urlset">
<xs:complexType>
<xs:sequence>
<xs:element ref="url" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="url">
<xs:complexType>
<xs:sequence>
<xs:element name="loc" type="xs:anyURI"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
To use the sitemap for schema validation, make sure you copy it manually and create a file named
schema.xsd
.
To validate the XML file using this XSD, use the following code:
from lxml import etree
import requests
url = "https://demlon.com/post-sitemap.xml"
response = requests.get(url)
if response.status_code == 200:
root = etree.fromstring(response.content)
try:
print("Schema Validation:")
schema_doc = etree.parse("schema.xsd")
schema = etree.XMLSchema(schema_doc)
schema.assertValid(root)
print("XML is valid according to the schema.")
except etree.DocumentInvalid as e:
print("XML validation error:", e)
Here, you parse the XSD file using the
etree.parse()
method. Then you create an XML Schema using the parsed XSD doc content. Finally, you validate the XML root document against the XML schema using the
assertValid()
method. If the schema validation passes, your output includes a message that says something like
XML is valid according to the schema
. Otherwise, the
DocumentInvalid
exception is raised.
Your output should look like this:
Schema Validation:
XML is valid according to the schema.
Now, let’s read an XML file that uses the
xpath
method to find the elements using their path.
To read the elements using the
xpath()
method, use the following code:
from lxml import etree
import requests
url = "https://demlon.com/post-sitemap.xml"
response = requests.get(url)
if response.status_code == 200:
root = etree.fromstring(response.content)
print("XPath Support:")
root = etree.fromstring(response.content)
namespaces = {"ns": "http://www.sitemaps.org/schemas/sitemap/0.9"}
for url in root.xpath(".//ns:url/ns:loc", namespaces=namespaces):
print(url.text.strip())
In this code, you register the namespace prefix
ns
and map it to the namespace URI
http://www.sitemaps.org/schemas/sitemap/0.9
. In the
XPath
expression, you use the
ns
prefix to specify elements in the namespace. Finally, the expression
.//ns:url/ns:loc
selects all
loc
elements that are children of
url
elements in the namespace.
Your output will look like this:
XPath Support:
https://demlon.com/case-studies/powerdrop-case-study
https://demlon.com/case-studies/addressing-brand-protection-from-every-angle
https://demlon.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://demlon.com/case-studies/the-seo-transformation
https://demlon.com/case-studies/data-driven-automated-e-commerce-tools
https://demlon.com/case-studies/highly-targeted-influencer-marketing
https://demlon.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://demlon.com/case-studies/workplace-diversity-facilitated-by-online-data
https://demlon.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://demlon.com/case-studies/data-intensive-analytical-solutions
https://demlon.com/case-studies/canopy-advantage-solutions
https://demlon.com/case-studies/seamless-digital-automations
As you can see, the
find()
and
findall()
methods are faster than the
xpath
method because
xpath
collects all the results into the memory before returning them. It’s recommended that you use the
find()
method unless there is a specific reason for using
XPath
queries.
lxml offers powerful features for parsing and manipulating XML and HTML. It supports complex queries using XPath expressions , validates documents against schemas, and even allows for eXtensible Stylesheet Language Transformations (XSLT) . This makes it ideal for scenarios where performance and advanced functionality are crucial. However, keep in mind that lxml requires a separate installation as it’s not part of the core Python package.
If you’re dealing with large or complex XML data that requires both high performance and advanced manipulation, you should consider using lxml. For instance, if you’re processing financial data feeds in XML format, you might need to use XPath expressions to extract specific elements like stock prices, validate the data against a financial schema to ensure accuracy, and potentially transform the data using XSLT for further analysis.
minidom
minidom
is a lightweight and simple XML parsing library that’s included in Python’s standard library. While it’s not as feature-rich or efficient as parsing with lxml, it offers a straightforward way to parse and manipulate XML data in Python.
You can use the various methods available in the DOM object to access elements. For example, you can use the
getElementsByTagName()
method
to retrieve the value of an element using its tag name.
The following example demonstrates how to use the
minidom
library to parse an XML file and fetch the elements using their tag names:
import requests
import xml.dom.minidom
url = "https://demlon.com/post-sitemap.xml"
response = requests.get(url)
if response.status_code == 200:
dom = xml.dom.minidom.parseString(response.content)
urlset = dom.getElementsByTagName("urlset")[0]
for url in urlset.getElementsByTagName("url"):
loc = url.getElementsByTagName("loc")[0].firstChild.nodeValue.strip()
print(loc)
else:
print("Failed to retrieve XML file from the URL.")
Your output would look like this:
https://demlon.com/case-studies/powerdrop-case-study
https://demlon.com/case-studies/addressing-brand-protection-from-every-angle
https://demlon.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://demlon.com/case-studies/the-seo-transformation
https://demlon.com/case-studies/data-driven-automated-e-commerce-tools
https://demlon.com/case-studies/highly-targeted-influencer-marketing
https://demlon.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://demlon.com/case-studies/workplace-diversity-facilitated-by-online-data
https://demlon.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://demlon.com/case-studies/data-intensive-analytical-solutions
https://demlon.com/case-studies/canopy-advantage-solutions
https://demlon.com/case-studies/seamless-digital-automations
minidom
works with XML data by representing it as a DOM tree. This tree structure makes it easy to navigate and manipulate data, and it’s best suited for basic tasks such as reading, changing, or building simple XML structures.
If your program involves reading default settings from an XML file, the DOM approach of
minidom
allows you to easily access specific settings within the XML file using methods such as finding child nodes or attributes. With
minidom
, you can easily retrieve specific settings from the XML file, such as the
font-size
node, and utilize its value within your application.
SAX Parser
The SAX parser is an event-driven XML parsing approach in Python that processes XML documents sequentially and generates events as it encounters various parts of the document. Unlike DOM-based parsers that construct a tree structure representing the entire XML document in memory, SAX parsers do not build a complete representation of the document. Instead, it emits events such as start tags, end tags, and text content as they parse through the document.
SAX parsers are good for processing large XML files or streams where memory efficiency is a concern as they operate on XML data incrementally without loading the entire document into memory.
When using the SAX parser, you need to define the event handlers that respond to specific XML events, such as the
startElement
and
endElement
emitted by the parser. These event handlers can be customized to perform actions based on the structure and content of the XML document.
The following example demonstrates how to parse an XML file using the SAX parser by defining the
startElement
and
endElement
events and retrieving the URL information from the sitemap file:
import requests
import xml.sax.handler
from io import BytesIO
class MyContentHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self.in_url = False
self.in_loc = False
self.url = ""
def startElement(self, name, attrs):
if name == "url":
self.in_url = True
elif name == "loc" and self.in_url:
self.in_loc = True
def characters(self, content):
if self.in_loc:
self.url += content
def endElement(self, name):
if name == "url":
print(self.url.strip())
self.url = ""
self.in_url = False
elif name == "loc":
self.in_loc = False
url = "https://demlon.com/post-sitemap.xml"
response = requests.get(url)
if response.status_code == 200:
xml_content = BytesIO(response.content)
content_handler = MyContentHandler()
parser = xml.sax.make_parser()
parser.setContentHandler(content_handler)
parser.parse(xml_content)
else:
print("Failed to retrieve XML file from the URL.")
Your output would look like this:
https://demlon.com/case-studies/powerdrop-case-study
https://demlon.com/case-studies/addressing-brand-protection-from-every-angle
https://demlon.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://demlon.com/case-studies/the-seo-transformation
https://demlon.com/case-studies/data-driven-automated-e-commerce-tools
https://demlon.com/case-studies/highly-targeted-influencer-marketing
https://demlon.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://demlon.com/case-studies/workplace-diversity-facilitated-by-online-data
https://demlon.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://demlon.com/case-studies/data-intensive-analytical-solutions
https://demlon.com/case-studies/canopy-advantage-solutions
https://demlon.com/case-studies/seamless-digital-automations
Unlike other parsers that load the entire file into memory, SAX processes files incrementally, conserving memory and enhancing performance. However, SAX necessitates writing more code to manage each data segment dynamically. Additionally, it cannot revisit and analyze specific parts of the data later on.
If you need to scan a large XML file ( eg a log file containing various events) to extract specific information ( eg error messages), SAX can help you efficiently navigate through the file. However, if your analysis requires understanding the relationships between different data segments, SAX may not be the best choice.
untangle
untangle is a lightweight XML parsing library for Python that simplifies the process of extracting data from XML documents. Unlike traditional XML parsers that require navigating through hierarchical structures, untangle lets you access XML elements and attributes directly as Python objects.
With untangle, you can convert XML documents into nested Python dictionaries, where XML elements are represented as dictionary keys, and their attributes and text content are stored as corresponding values. This approach makes it easy to access and manipulate XML data using Python data structures.
untangle is not available by default in Python and needs to be installed using the following
PyPI
command:
pip install untangle
The following example demonstrates how to parse the XML file using the untangle library and access the XML elements:
import untangle
import requests
url = "https://demlon.com/post-sitemap.xml"
response = requests.get(url)
if response.status_code == 200:
obj = untangle.parse(response.text)
for url in obj.urlset.url:
print(url.loc.cdata.strip())
else:
print("Failed to retrieve XML file from the URL.")
Your output will look like this:
https://demlon.com/case-studies/powerdrop-case-study
https://demlon.com/case-studies/addressing-brand-protection-from-every-angle
https://demlon.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-data
https://demlon.com/case-studies/the-seo-transformation
https://demlon.com/case-studies/data-driven-automated-e-commerce-tools
https://demlon.com/case-studies/highly-targeted-influencer-marketing
https://demlon.com/case-studies/data-driven-products-for-smarter-shopping-solutions
https://demlon.com/case-studies/workplace-diversity-facilitated-by-online-data
https://demlon.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofy
https://demlon.com/case-studies/data-intensive-analytical-solutions
https://demlon.com/case-studies/canopy-advantage-solutions
https://demlon.com/case-studies/seamless-digital-automations
untangle offers a user-friendly approach to working with XML data in Python. It simplifies the parsing process with clear syntax and automatically converts the XML structure into easy-to-use Python objects, eliminating the need for complex navigation techniques. However, keep in mind that untangle requires separate installation as it’s not part of the core Python package.
You should consider using untangle if you have a well-formed XML file and need to quickly convert it into Python objects for further processing. For example, if you have a program that downloads weather data in XML format, untangle could be a good fit to parse the XML and create Python objects representing the current temperature, humidity, and forecast. These objects could then be easily manipulated and displayed within your application.
Conclusion
In this article, you learned all about XML files and the various methods for parsing XML files in Python.
Whether you’re working with small configuration files, parsing large web service responses, or extracting data from extensive sitemaps, Python offers versatile libraries to automate and streamline your XML parsing tasks. However, when accessing files from the web using the requests library without proxy management, you may encounter quota exceptions and throttling issues. demlon is an award-winning proxy network that provides reliable and efficient proxy solutions to ensure seamless data retrieval and parsing. With demlon, you can tackle XML parsing tasks without worrying about limitations or disruptions. Contact our sales team to learn more.
Want to skip the whole scraping and parsing process? Try our dataset marketplace for free!