Question

我正在使用此link中的XML文件（40MB的可下载文件）。在这个文件中，我期待来自2种类型标签的数据。

这些是：OpportunityForecastDetail_1_0和OpportunitySynopsisDetail_1_0。

我为此编写了以下代码：

ARTICLE_TAGS = ['OpportunitySynopsisDetail_1_0', 'OpportunityForecastDetail_1_0']

for _tag in ARTICLE_TAGS:
    f = open(xml_f)
    context = etree.iterparse(f, tag = _tag)

    for _, e in context:
        _id = e.xpath('.//OpportunityID/text()')
        text = e.xpath('.//OpportunityTitle/text()')
    f.close()

然后etree.iterparse(f, tag = _tag)返回一个不可迭代的对象。我认为这是在XML文件中找不到标记时发生的。

所以，我像这样在可迭代标签中添加了名称空间。

context = etree.iterparse(f, tag='{http://apply.grants.gov/system/OpportunityDetail-V1.0}'+_tag)

现在，它正在创建一个可迭代对象。但是，我没有收到任何文字。我在该文件中尝试了其他命名空间。但是，不工作。

请告诉我这个问题的解决方案。这是XML文件的示例代码段。 OpportunityForecastDetail_1_0和OpportunitySynopsisDetail_1_0标记在XML文件中重复多次。

<?xml version="1.0" encoding="UTF-8"?>
<Grants xsi:schemaLocation="http://apply.grants.gov/system/OpportunityDetail-V1.0 https://apply07.grants.gov/apply/system/schemas/OppotunityDetail-V1.0.xsd" xmlns="http://apply.grants.gov/system/OpportunityDetail-V1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instace">
<OpportunitySynopsisDetail_1_0>
<OpportunityID>262148</OpportunityID>
<OpportunityTitle>Establishment of the Edmund S. Muskie Graduate Internship Program</OpportunityTitle>
</OpportunitySynopsisDetail_1_0>
<OpportunityForecastDetail_1_0>
<OpportunityID>284765</OpportunityID>
<OpportunityTitle>PPHF 2015: Immunization Grants-CDC Partnership: Strengthening Public Health Laboratories-financed in part by 2015 Prevention and Public Health Funds</OpportunityTitle>
</OpportunityForecastDetail_1_0>
</Grants>

Answer 1

首先，在解析包含名称空间的XML时，在查看标记名称时必须使用这些名称空间其次，iterparse没有采用名为tag的参数，因此我不知道您的代码如何在发布时发挥作用。最后，iterparse返回的元素没有名为xpath的成员函数，因此也无法使用。

以下是如何使用iterparse解析XML的示例：

NS='{http://apply.grants.gov/system/OpportunityDetail-V1.0}'
ARTICLE_TAGS = [NS+'OpportunitySynopsisDetail_1_0', NS+'OpportunityForecastDetail_1_0']

with open(xml_f, 'r') as f:
    context = etree.iterparse(f)

    for _, e in context:
    if e.tag  in ARTICLE_TAGS:
        _id = e.find(NS+'OpportunityID')
        text = e.find(NS+'OpportunityTitle')
        print(_id.text, text.text)

正如我在评论中所说，Python documentation和Effbot page on ElementTree一样有用。还有很多其他资源可用;将xml.etree.elementtree放入Google并开始阅读！

Python中的Iterparse对象不返回iter对象

1 个答案: