import xml.etree.ElementTree as ET
import csv
import re
import codecs
import io
xml = open('ipa110106.xml')
line_num=0
f = open('workfile.xml', 'w')
for line in xml:
line_num+=1
if line_num == 1:
print (line)
if '<?xml version="1.0" encoding="UTF-8"?>' in line and line_num !=1:
count =count+1
line = line.replace('<?xml version="1.0" encoding="UTF-8"?>', '')
if '<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>' in line:
line = line.replace('<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>', '')
count2+=1
if "!DOCTYPE" in line:
line=line.replace('<!DOCTYPE sequence-cwu SYSTEM "us-sequence-listing.dtd" [ ]>','')
f.write(line)
f.close()
with open("workfile.xml") as f:
xml = f.read()
tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")
root= tree.getroot()
结果:
<?xml version="1.0" encoding="UTF-8"?>
0
Traceback (most recent call last):
File "<ipython-input-164-4d6fc9ea9aac>", line 1, in <module>
runfile('C:/Users/Harshit/Downloads/ipa110106 (1)/parsing_test5.py', wdir='C:/Users/Harshit/Downloads/ipa110106 (1)')
File "C:\Users\Harshit\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\Users\Harshit\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Harshit/Downloads/ipa110106 (1)/parsing_test5.py", line 41, in <module>
root= tree.getroot()
AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'getroot'
我正在尝试解析USPTO XML文件以提取相关信息。这些文件是多个XML文件的串联,并遵循此论坛中给出的标准建议,我删除了多个实例:<?xml version="1.0" encoding="UTF-8"?>
和<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>
因为他们也造成了错误:
ParseError:格式不正确(令牌无效):第2行,第2列。
最后,在从XML中删除这些麻烦的元素之后,我创建了一个合成父根来将此文件转换为适当的XML格式。但是,当我尝试解析此文件并访问其根时,我收到错误。我在帖子中附上了代码。
import xml.etree.ElementTree as ET
import csv
import re
import codecs
import io
xml = open('ipa110106.xml')
line_num=0
f = open('workfile.xml', 'w')
for line in xml:
line_num+=1
if line_num == 1:
print (line)
if '<?xml version="1.0" encoding="UTF-8"?>' in line and line_num !=1:
count =count+1
line = line.replace('<?xml version="1.0" encoding="UTF-8"?>', '')
if '<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>' in line:
line = line.replace('<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>', '')
count2+=1
if "!DOCTYPE" in line:
line=line.replace('<!DOCTYPE sequence-cwu SYSTEM "us-sequence-listing.dtd" [ ]>','')
f.write(line)
f.close()
with open("workfile.xml") as f:
xml = f.read()
tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")
root= tree.getroot()
此外,XML文件很大,我只能共享指向它的链接 - enter link description here
XML(如)文件的一小部分示例:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>
<us-patent-application lang="EN" dtd-version="v4.2 2006-08-23" file="US20110000001A1-20110106.XML" status="PRODUCTION" id="us-patent-application" country="US" date-produced="20101222" date-publ="20110106">
<us-bibliographic-data-application lang="EN" country="US">
<publication-reference>
<document-id>
<country>US</country>
<doc-number>20110000001</doc-number>
<kind>A1</kind>
<date>20110106</date>
</document-id>
</publication-reference>
<application-reference appl-type="utility">
<document-id>
<country>US</country>
<doc-number>12838840</doc-number>
<date>20100719</date>
</document-id>
</application-reference>
<us-application-series-code>12</us-application-series-code>
<priority-claims>
<priority-claim sequence="01" kind="national">
<country>IL</country>
<doc-number>189088</doc-number>
<date>20080128</date>
</priority-claim>
</priority-claims>
<classifications-ipcr>
<classification-ipcr>
答案 0 :(得分:0)
如果您在XML声明中拆分它们并分别处理每个发布,则当前PTO XML文件是有效的XML。我希望尝试一次处理它们以使用非常大量的内存。无论哪种方式,您所做的替换都是不需要的。
我的解决方案是创建一个拥有zipfile的类(对于其他可能不知道的,数据是包含一个包含连接的XML文件的文件的zip文件)并且具有依次产生每个XML文件的函数。然后我使用ET.XML()
来处理这些文件。