我有一个很大的xml文件(> 1,5gb),如下所示:
<?xml version="1.0" encoding="utf-8"?>
<events version="1.0">
<event time="0.0" type="actend" person="94324001" link="119380" actType="home_94200.0" />
<event time="0.0" type="departure" person="94324001" link="119380" legMode="bicycle" />
<event time="0.0" type="actend" person="93120501" link="116274" actType="home_94800.0" />
<event time="0.0" type="departure" person="93120501" link="116274" legMode="bicycle" />
<event time="0.0" type="actend" person="84637601" link="72152" actType="home_90600.0" />
<event time="0.0" type="departure" person="84637601" link="72152" legMode="ride" />
<event time="0.0" type="actend" person="78914201" link="49600" actType="home_91800.0" />
<event time="0.0" type="departure" person="78914201" link="49600" legMode="access_walk" />
<event time="0.0" type="actend" person="74265301" link="48593" actType="home_96000.0" />
....
</events>
当我尝试使用以下代码进行解析时:
import xml.etree.ElementTree as ET
import gzip
# Parsing Event XML and saving in a list
def gzipedXMLparser(filename):
vehicleIDs = []
data = gzip.open(filename, mode="rb")
datatoparse = ET.iterparse(filename, events = ("start", "end"), parser = ET.XMLParser(encoding = 'utf-8'))
datatoparse = iter(datatoparse)
event, root = datatoparse.__next__()
for event, elem in datatoparse:
if event == "end" and elem.tag == "event":
if elem.attrib["type"] == "vehicle enters traffic":
if elem.attrib["vehicle"] in vehicleIDs:
pass
else:
vehicleIDs.append(elem.attrib)
elem.clear
root.clear()
print(vehicleIDs)
return vehicleIDs
我收到以下错误消息:
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0
有人可以解释问题是什么以及如何解决吗?
问题是xml文件,某处是一个错误,我再次从另一个位置下载了它,并且工作正常。
答案 0 :(得分:0)
似乎您的XML可能包含一些无效字符。 无论如何,您都可以检查ParseError: not well-formed (invalid token) using cElementTree
答案 1 :(得分:0)
问题是xml文件,某处是一个错误,我再次从另一个位置下载了它,并且工作正常。