使用Python从XML中的特定元素创建值数组

时间:2017-07-10 18:08:58

标签: python xml anaconda

我有一个包含许多元素的XML文件。我想创建一个包含特定元素名称的所有值的列表/数组,在我的情况下"对:ApplicationNumber"。

我已经解决了很多其他问题但是我无法找到答案。我知道我可以通过加载文本文件然后使用pandas来完成此操作,但是,我确信这是一个更好的方法。

使用minidom

尝试使用ElementTree和XML.Dom是不成功的

我的代码目前看起来如下:

import os
from xml.dom import minidom
WindowsUser = os.getenv('username')
XMLPath = os.path.join('C:\\Users', WindowsUser, 'Downloads', 'ApplicationsByCustomerNumber.xml')
xmldoc = minidom.parse(XMLPath)
itemlist = xmldoc.getElementsByTagName('pair:ApplicationNumber')
for s in itemlist:
    print(s.attributes['pair:ApplicationNumber'].value)

示例XML文件如下所示:

<?xml version="1.0" encoding="UTF-8"?>
<pair:PatentApplicationList xsi:schemaLocation="urn:us:gov:uspto:pair PatentApplicationList.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:pair="urn:us:gov:uspto:pair">
    <pair:FileHeader>
            <pair:FileCreationTimeStamp>2017-07-10T10:52:12.12</pair:FileCreationTimeStamp>
    </pair:FileHeader>
    <pair:ApplicationStatusData>
        <pair:ApplicationNumber>62383607</pair:ApplicationNumber>
        <pair:ApplicationStatusCode>20</pair:ApplicationStatusCode>
        <pair:ApplicationStatusText>Application Dispatched from Preexam, Not Yet Docketed</pair:ApplicationStatusText>
        <pair:ApplicationStatusDate>2016-09-16</pair:ApplicationStatusDate>
        <pair:AttorneyDocketNumber>1354-T-02-US</pair:AttorneyDocketNumber>
        <pair:FilingDate>2016-09-06</pair:FilingDate>
        <pair:LastModifiedTimestamp>2017-05-30T21:40:37.37</pair:LastModifiedTimestamp>
        <pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
            <pair:LastTransactionDate>2017-05-30</pair:LastTransactionDate>
            <pair:LastTransactionDescription>Email Notification</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction> 
        <pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator> 
    </pair:ApplicationStatusData>
    <pair:ApplicationStatusData>
        <pair:ApplicationNumber>62292372</pair:ApplicationNumber>
        <pair:ApplicationStatusCode>160</pair:ApplicationStatusCode>
        <pair:ApplicationStatusText>Abandoned  --  Incomplete Application (Pre-examination)</pair:ApplicationStatusText>
        <pair:ApplicationStatusDate>2016-11-01</pair:ApplicationStatusDate>
        <pair:AttorneyDocketNumber>681-S-23-US</pair:AttorneyDocketNumber>
        <pair:FilingDate>2016-02-08</pair:FilingDate>
        <pair:LastModifiedTimestamp>2017-06-20T21:59:26.26</pair:LastModifiedTimestamp>
        <pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
            <pair:LastTransactionDate>2017-06-20</pair:LastTransactionDate>
            <pair:LastTransactionDescription>Petition Entered</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction> 
        <pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator> 
    </pair:ApplicationStatusData>
    <pair:ApplicationStatusData>
        <pair:ApplicationNumber>62289245</pair:ApplicationNumber>
        <pair:ApplicationStatusCode>160</pair:ApplicationStatusCode>
        <pair:ApplicationStatusText>Abandoned  --  Incomplete Application (Pre-examination)</pair:ApplicationStatusText>
        <pair:ApplicationStatusDate>2016-10-26</pair:ApplicationStatusDate>
        <pair:AttorneyDocketNumber>1526-P-01-US</pair:AttorneyDocketNumber>
        <pair:FilingDate>2016-01-31</pair:FilingDate>
        <pair:LastModifiedTimestamp>2017-06-15T21:24:13.13</pair:LastModifiedTimestamp>
        <pair:CustomerNumber>122761</pair:CustomerNumber><pair:LastFileHistoryTransaction>
            <pair:LastTransactionDate>2017-06-15</pair:LastTransactionDate>
            <pair:LastTransactionDescription>Petition Entered</pair:LastTransactionDescription> </pair:LastFileHistoryTransaction> 
        <pair:ImageAvailabilityIndicator>true</pair:ImageAvailabilityIndicator> 
    </pair:ApplicationStatusData>
</pair:PatentApplicationList>

1 个答案:

答案 0 :(得分:1)

示例中的XML正在扩展&#34;对:&#34;根据您使用过的架构标记的部分代码,因此它不匹配:ApplicationNumber&#39;,即使它看起来应该如此。

我已经使用元素树来提取应用程序编号,如下所示(我在我的示例中只使用了本地XML文件,而不是代码中的完整路径)

示例1:

from xml.etree import ElementTree

tree = ElementTree.parse('ApplicationsByCustomerNumber.xml')
root = tree.getroot()

for item in root:
    if 'ApplicationStatusData' in item.tag:
        for child in item:
            if 'ApplicationNumber' in child.tag:
                print child.text

示例2:

from xml.etree import ElementTree

tree = ElementTree.parse('ApplicationsByCustomerNumber.xml')
root = tree.getroot()

for item in root.iter('{urn:us:gov:uspto:pair}ApplicationStatusData'):
    for child in item.iter('{urn:us:gov:uspto:pair}ApplicationNumber'):
        print child.text

希望这可能有用。