TDM 20200: Project 1 — 2024
Motivation: Extensible Markup Language or XML is a very important file format for storing structured data. Even though formats like JSON, and csv tend to be more prevalent, many, many legacy systems still use XML, and it remains an appropriate format for storing complex data. In fact, JSON and csv are quickly becoming less relevant as new formats and serialization methods like parquet and protobufs are becoming more common.
Context: In this project we will use the lxml
package in Python. This is a first project focusing on web scraping
Scope: python, XML
Readings and Resources
This link will show you more information about lxml Check out this old project that uses a different dataset — you may find it useful for this project. |
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/otc/valu.xml
Questions
Question 1 (2 points)
-
Please use the
lxml
package to find and print the name of the root element’s tag for the XML file of valu.xml
Question 2 (2 points)
-
Please use 'xpath' with namespace to find the 'title' element.
You will need to use namespace for xpath; otherwise, you won’t get the element. This link will show you more information using about XPath with lxml |
Question 3 (2 points)
-
Please use 'xpath' with namespace to find and list all child elements directly under the 'document' element in the xml file.
Question 4 (2 points)
-
Please get and list all author elements, including their child elements and attributes.
To print an
|
Question 5 (2 points)
-
Please list all codeSystem attribute values from the file.
-
Please list the codeSystem value for which the 'displayName' attribute contains the string 'DOSAGE'.
You can use the |
This link may help you when figuring out how to select the right elements |
Project 01 Assignment Checklist
-
Jupyter Lab notebook with your code, comments and output for the assignment
-
firstname-lastname-project01.ipynb
.
-
-
Python file with code and comments for the assignment
-
firstname-lastname-project01.py
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |