如何使用lxml从磁盘加载.xml文件作为元素树?

时间:2016-02-01 16:21:56

标签: python xml xpath xml-parsing lxml

我的驱动器上有一系列XML文件,我想要执行以下操作:

  • 作为元素树加载到lxml中,并使用xpath
  • 进行解析
  • 将另一个XML文件加载为元素树,并使用xpath进行解析以找到将信息附加到
  • 的正确位置
  • 我从一系列XML文件中解析的信息应该设置为变量,这样我就可以在结果上运行一些逻辑,然后再追溯到大的.xml文件

我在使用文件类型时遇到了一些问题/正确加载XML文件作为元素树,因此可以通过lxml对它们进行操作。我尝试了几种不同的方法,但仍然遇到各种各样的问题。目前的问题是:

  

TypeError:参数'_parent'的类型不正确(预期   lxml.etree._Element,得到列表)

from lxml import etree
from lxml import html
import requests

file = 'bgg.xml'
# parse the xml file from disk as an element tree in lxml?
treebgg = etree.parse(file)

# create a list of IDs to iterate through from the bgg.xml file
gameList = treebgg.xpath("//root/BGG/@ID")

# iterate through the IDs
for x in reversed(gameList):
    url = 'https://somewhere.com/xmlapi/' + str(x)
    page = requests.get(url)
    # pull an xml file from a web url and turn it into an element tree in lxml
    tree = html.fromstring(page.content)
    # set my root variable so I can append children to this location
    root = tree.xpath("//root/BGG[@ID=x]")
    name = tree.xpath("//somewhere/name[@primary='true']"
    # append child info into bgg.xml
    child = etree.SubElement(root, "Name")
    child.text = name

# write bgg.xml back to file

1 个答案:

答案 0 :(得分:1)

获取bgg.xml树的根目录:

rootbgg = treebgg.getroot()

并使用它将孩子附加到:

child = etree.SubElement(rootbgg, "Name")
  

我有另一个问题......如何选择正确的元素?我不想附加到xml文件本身的根目录。

您现在需要重新设计迭代元素的方式:

gameList = treebgg.xpath("//root/BGG")

# iterate through the IDs
for game in reversed(gameList):
    url = 'https://somewhere.com/xmlapi/' + game.attrib["id"]
    page = requests.get(url)
    tree = html.fromstring(page.content)
    # TODO: get the name

    # append child info into bgg.xml
    child = etree.SubElement(game, "Name")
    child.text = name