我正在尝试解析一个XML文件,该XML文件在一个XML文件中包含多个XML文件。我希望能够解析主文件(6+百万行-不用担心我不会在这里发布它!)并从每个部分中的每个XML文件返回某些信息。
代码看起来像这样(我已将第一个xml文件复制了3次,以使您了解我正在使用的文件的外观,但是我的主文档在一个文件中总共包含6000个这些xml文件我需要解析才能从下面的xml文件中获取以下内容)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v44-2014-04-03.dtd" [ ]>
<us-patent-application lang="EN" dtd-version="v4.4 2014-04-03" file="US20180213710A1-20180802.XML" status="PRODUCTION" id="us-patent-application" country="US" date-produced="20180718" date-publ="20180802">
<us-bibliographic-data-application lang="EN" country="US">
<publication-reference>
<document-id>
<doc-number>20180213710</doc-number>
<kind>A1</kind>
<date>20180802</date>
</document-id>
</publication-reference>
<application-reference appl-type="utility">
<document-id>
<country>US</country>
<doc-number>15419046</doc-number>
<date>20170130</date>
</document-id>
<invention-title id="d2e43">AGRICULTURAL IMPLEMENT WITH RELEASABLE TOOLS</invention-title>
<assignees>
<assignee>
<addressbook>
<orgname>CNH Industrial Canada, Ltd.</orgname>
<role>03</role>
<address>
<city>Saskatoon</city>
<country>CA</country>
</address>
</addressbook>
</assignee>
</assignees>
</us-bibliographic-data-application>
<abstract id="abstract">
<p id="p-0001" num="0000">An agricultural implement includes a frame; a shank mounted to the frame and including at least one clip edge; a retaining clip press fitted to the at least one clip edge and including a movable lock biased in a locking direction; and a ground working tool with a locking opening at least partially filled by the movable lock to resist the tool being removed from the shank</p>
</abstract>
<us-claim-statement>What is claimed is:</us-claim-statement>
</us-patent-application>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v44-2014-04-03.dtd" [ ]>
<us-patent-application lang="EN" dtd-version="v4.4 2014-04-03" file="US20180213710A1-20180802.XML" status="PRODUCTION" id="us-patent-application" country="US" date-produced="20180718" date-publ="20180802">
<us-bibliographic-data-application lang="EN" country="US">
<publication-reference>
<document-id>
<doc-number>20180213710</doc-number>
<kind>A1</kind>
<date>20180802</date>
</document-id>
</publication-reference>
<application-reference appl-type="utility">
<document-id>
<country>US</country>
<doc-number>15419046</doc-number>
<date>20170130</date>
</document-id>
<invention-title id="d2e43">AGRICULTURAL IMPLEMENT WITH RELEASABLE TOOLS</invention-title>
<assignees>
<assignee>
<addressbook>
<orgname>CNH Industrial Canada, Ltd.</orgname>
<role>03</role>
<address>
<city>Saskatoon</city>
<country>CA</country>
</address>
</addressbook>
</assignee>
</assignees>
</us-bibliographic-data-application>
<abstract id="abstract">
<p id="p-0001" num="0000">An agricultural implement includes a frame; a shank mounted to the frame and including at least one clip edge; a retaining clip press fitted to the at least one clip edge and including a movable lock biased in a locking direction; and a ground working tool with a locking opening at least partially filled by the movable lock to resist the tool being removed from the shank</p>
</abstract>
<us-claim-statement>What is claimed is:</us-claim-statement>
</us-patent-application>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v44-2014-04-03.dtd" [ ]>
<us-patent-application lang="EN" dtd-version="v4.4 2014-04-03" file="US20180213710A1-20180802.XML" status="PRODUCTION" id="us-patent-application" country="US" date-produced="20180718" date-publ="20180802">
<us-bibliographic-data-application lang="EN" country="US">
<publication-reference>
<document-id>
<doc-number>20180213710</doc-number>
<kind>A1</kind>
<date>20180802</date>
</document-id>
</publication-reference>
<application-reference appl-type="utility">
<document-id>
<country>US</country>
<doc-number>15419046</doc-number>
<date>20170130</date>
</document-id>
<invention-title id="d2e43">AGRICULTURAL IMPLEMENT WITH RELEASABLE TOOLS</invention-title>
<assignees>
<assignee>
<addressbook>
<orgname>CNH Industrial Canada, Ltd.</orgname>
<role>03</role>
<address>
<city>Saskatoon</city>
<country>CA</country>
</address>
</addressbook>
</assignee>
</assignees>
</us-bibliographic-data-application>
<abstract id="abstract">
<p id="p-0001" num="0000">An agricultural implement includes a frame; a shank mounted to the frame and including at least one clip edge; a retaining clip press fitted to the at least one clip edge and including a movable lock biased in a locking direction; and a ground working tool with a locking opening at least partially filled by the movable lock to resist the tool being removed from the shank</p>
</abstract>
<us-claim-statement>What is claimed is:</us-claim-statement>
</us-patent-application>
我的代码:
import xml.etree.cElementTree as et
import pandas as pd
parsed_xml = et.parse("25_pto_test.xml")
dfcols = ['us_patent_app_number', 'kind_code', 'date', 'invention_title', 'assignee', 'abstract']
df_xml = pd.DataFrame(columns=dfcols)
pto = {}
us_patent_app_number = parsed_xml.find('us-bibliographic-data-application/publication-reference/document-id/doc-number')
kind_code = parsed_xml.find('us-bibliographic-data-application/publication-reference/document-id/kind')
date = parsed_xml.find('us-bibliographic-data-application/publication-reference/document-id/date')
invention_title = parsed_xml.find('us-bibliographic-data-application/invention-title')
assignee = parsed_xml.find('us-bibliographic-data-application/assignees/assignee/addressbook/orgname')
abstract = parsed_xml.find('abstract/p')
pto.update({'US_Patent_App_Number': us_patent_app_number.text, 'Kind_Code': kind_code.text, 'Date': date.text,
'Invention_Title': invention_title.text, 'Company_Name': assignee.text, 'Abstract': abstract.text})
pto_data = pd.DataFrame.from_dict(pto, orient='index')
pto_data = pto_data.transpose()
我可以获取上面的代码,以通过主xml文件中的第一个xml节进行解析,然后在第一个xml文件的末尾崩溃,并在文档中第x行后显示垃圾。
任何人都可以帮忙吗?