解析其中包含多个串联XML文件的XML文件

时间:2018-08-28 09:46:09

标签: python xml python-3.x pandas parsing

我正在尝试解析一个XML文件,该XML文件在一个XML文件中包含多个XML文件。我希望能够解析主文件(6+百万行-不用担心我不会在这里发布它!)并从每个部分中的每个XML文件返回某些信息。

代码看起来像这样(我已将第一个xml文件复制了3次,以使您了解我正在使用的文件的外观,但是我的主文档在一个文件中总共包含6000个这些xml文件我需要解析才能从下面的xml文件中获取以下内容)

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE us-patent-application SYSTEM "us-patent-application-v44-2014-04-03.dtd" [ ]>
    <us-patent-application lang="EN" dtd-version="v4.4 2014-04-03" file="US20180213710A1-20180802.XML" status="PRODUCTION" id="us-patent-application" country="US" date-produced="20180718" date-publ="20180802">
    <us-bibliographic-data-application lang="EN" country="US">
    <publication-reference>
    <document-id>
    <doc-number>20180213710</doc-number>
    <kind>A1</kind>
    <date>20180802</date>
    </document-id>
    </publication-reference>
    <application-reference appl-type="utility">
    <document-id>
    <country>US</country>
    <doc-number>15419046</doc-number>
    <date>20170130</date>
    </document-id>
    <invention-title id="d2e43">AGRICULTURAL IMPLEMENT WITH RELEASABLE TOOLS</invention-title>
    <assignees>
    <assignee>
    <addressbook>
    <orgname>CNH Industrial Canada, Ltd.</orgname>
    <role>03</role>
    <address>
    <city>Saskatoon</city>
    <country>CA</country>
    </address>
    </addressbook>
    </assignee>
    </assignees>
    </us-bibliographic-data-application>
    <abstract id="abstract">
    <p id="p-0001" num="0000">An agricultural implement includes a frame; a shank mounted to the frame and including at least one clip edge; a retaining clip press fitted to the at least one clip edge and including a movable lock biased in a locking direction; and a ground working tool with a locking opening at least partially filled by the movable lock to resist the tool being removed from the shank</p>
    </abstract>
    <us-claim-statement>What is claimed is:</us-claim-statement>
    </us-patent-application>
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE us-patent-application SYSTEM "us-patent-application-v44-2014-04-03.dtd" [ ]>
    <us-patent-application lang="EN" dtd-version="v4.4 2014-04-03" file="US20180213710A1-20180802.XML" status="PRODUCTION" id="us-patent-application" country="US" date-produced="20180718" date-publ="20180802">
    <us-bibliographic-data-application lang="EN" country="US">
    <publication-reference>
    <document-id>
    <doc-number>20180213710</doc-number>
    <kind>A1</kind>
    <date>20180802</date>
    </document-id>
    </publication-reference>
    <application-reference appl-type="utility">
    <document-id>
    <country>US</country>
    <doc-number>15419046</doc-number>
    <date>20170130</date>
    </document-id>
    <invention-title id="d2e43">AGRICULTURAL IMPLEMENT WITH RELEASABLE TOOLS</invention-title>
    <assignees>
    <assignee>
    <addressbook>
    <orgname>CNH Industrial Canada, Ltd.</orgname>
    <role>03</role>
    <address>
    <city>Saskatoon</city>
    <country>CA</country>
    </address>
    </addressbook>
    </assignee>
    </assignees>
    </us-bibliographic-data-application>
    <abstract id="abstract">
    <p id="p-0001" num="0000">An agricultural implement includes a frame; a shank mounted to the frame and including at least one clip edge; a retaining clip press fitted to the at least one clip edge and including a movable lock biased in a locking direction; and a ground working tool with a locking opening at least partially filled by the movable lock to resist the tool being removed from the shank</p>
    </abstract>
    <us-claim-statement>What is claimed is:</us-claim-statement>
    </us-patent-application>
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE us-patent-application SYSTEM "us-patent-application-v44-2014-04-03.dtd" [ ]>
    <us-patent-application lang="EN" dtd-version="v4.4 2014-04-03" file="US20180213710A1-20180802.XML" status="PRODUCTION" id="us-patent-application" country="US" date-produced="20180718" date-publ="20180802">
    <us-bibliographic-data-application lang="EN" country="US">
    <publication-reference>
    <document-id>
    <doc-number>20180213710</doc-number>
    <kind>A1</kind>
    <date>20180802</date>
    </document-id>
    </publication-reference>
    <application-reference appl-type="utility">
    <document-id>
    <country>US</country>
    <doc-number>15419046</doc-number>
    <date>20170130</date>
    </document-id>
    <invention-title id="d2e43">AGRICULTURAL IMPLEMENT WITH RELEASABLE TOOLS</invention-title>
    <assignees>
    <assignee>
    <addressbook>
    <orgname>CNH Industrial Canada, Ltd.</orgname>
    <role>03</role>
    <address>
    <city>Saskatoon</city>
    <country>CA</country>
    </address>
    </addressbook>
    </assignee>
    </assignees>
    </us-bibliographic-data-application>
    <abstract id="abstract">
    <p id="p-0001" num="0000">An agricultural implement includes a frame; a shank mounted to the frame and including at least one clip edge; a retaining clip press fitted to the at least one clip edge and including a movable lock biased in a locking direction; and a ground working tool with a locking opening at least partially filled by the movable lock to resist the tool being removed from the shank</p>
    </abstract>
    <us-claim-statement>What is claimed is:</us-claim-statement>
    </us-patent-application>   

我的代码:

    import xml.etree.cElementTree as et
    import pandas as pd


    parsed_xml = et.parse("25_pto_test.xml")
    dfcols = ['us_patent_app_number', 'kind_code', 'date', 'invention_title', 'assignee', 'abstract']
    df_xml = pd.DataFrame(columns=dfcols)
    pto = {}


    us_patent_app_number = parsed_xml.find('us-bibliographic-data-application/publication-reference/document-id/doc-number')
    kind_code = parsed_xml.find('us-bibliographic-data-application/publication-reference/document-id/kind')
    date = parsed_xml.find('us-bibliographic-data-application/publication-reference/document-id/date')
    invention_title = parsed_xml.find('us-bibliographic-data-application/invention-title')
    assignee = parsed_xml.find('us-bibliographic-data-application/assignees/assignee/addressbook/orgname')
    abstract = parsed_xml.find('abstract/p')

    pto.update({'US_Patent_App_Number': us_patent_app_number.text, 'Kind_Code': kind_code.text, 'Date': date.text,
        'Invention_Title': invention_title.text, 'Company_Name': assignee.text, 'Abstract': abstract.text})

    pto_data = pd.DataFrame.from_dict(pto, orient='index')
    pto_data = pto_data.transpose()

我可以获取上面的代码,以通过主xml文件中的第一个xml节进行解析,然后在第一个xml文件的末尾崩溃,并在文档中第x行后显示垃圾。

任何人都可以帮忙吗?

0 个答案:

没有答案