请帮忙。我正在尝试解析大型XML文件并将数据传输到CSV文件中。我不断丢失标签之间的大量数据,无法弄清楚原因。
以下是一些XML:
<testcase internalid="1256092" name="hls_vtt_single_default_diable_vtt">
<node_order><![CDATA[7]]></node_order>
<externalid><![CDATA[6121]]></externalid>
<version><![CDATA[2]]></version>
<summary><![CDATA[<p>condition: single subtitle track is available in stream and it is default set the vtt track to diable status before playing stream.</p>
<p> </p>
<div>play stream no subtitle is rendered along with A/V<span class="Apple-tab-span" style="white-space:pre"> </span></div>
<div> </div>]]></summary>
<preconditions><![CDATA[]]></preconditions>
<execution_type><![CDATA[1]]></execution_type>
<importance><![CDATA[2]]></importance>
</testcase>
这是我的Python代码:
class CaseHandler( xml.sax.ContentHandler ):
def __init__(self):
self.CurrentData = ""
self.externalid = ""
self.version = ""
self.summary = ""
def startElement(self, tag, attributes):
self.CurrentData = tag
if tag == "testcase":
name = attributes["name"]
outfile.write("\n" + name + " ,")
def endElement(self, tag):
if self.CurrentData == "externalid":
outfile.write("OTV52-" + self.externalid + ",")
elif self.CurrentData == "version":
outfile.write("Version: " + self.version + ",")
elif self.CurrentData == "summary":
outfile.write("Summary: " + self.summary + ",")
def characters(self, content):
if self.CurrentData == "externalid":
self.externalid = content
elif self.CurrentData == "version":
self.version = content
elif self.CurrentData == "summary":
self.summary = content
if ( __name__ == "__main__"):
parser = xml.sax.make_parser()
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
Handler = CaseHandler()
parser.setContentHandler( Handler )
parser.parse("OTV52.xml")
问题是它不会返回“摘要”括号中的任何信息。 externalid和版本数据很好。但是从“摘要”括号返回的所有内容都是div括号。
我需要它返回:
“条件:单个字幕轨道在流中可用,并且在播放流之前默认将vtt轨道设置为diable状态。播放流没有字幕与A / V一起呈现”
答案 0 :(得分:0)
正如此answer所示,您应该将解析后的值+=content
与每个characters()
调用连接起来。但是,要删除解析后的CDATA中的xml内容(包括换行符和空格),请考虑使用正则表达式替换:
import xml.sax
import re
class CaseHandler( xml.sax.ContentHandler ):
def __init__(self):
self.CurrentData = ""
self.externalid = ""
self.version = ""
self.summary = ""
def startElement(self, tag, attributes):
self.CurrentData = tag
if tag == "testcase":
name = attributes["name"]
outfile.write("\r" + name + " ,")
def endElement(self, tag):
if self.CurrentData == "externalid":
outfile.write("OTV52-" + self.externalid + ",")
elif self.CurrentData == "version":
outfile.write("Version: " + self.version + ",")
elif self.CurrentData == "summary":
self.summary = re.sub("<[^>]+>", "", self.summary)
self.summary = re.sub("\n| |/\s\s/", "", self.summary).strip()
outfile.write("Summary: " + self.summary + ",")
def characters(self, content):
if self.CurrentData == "externalid":
self.externalid += content
elif self.CurrentData == "version":
self.version += content
elif self.CurrentData == "summary":
self.summary += content
输出(全部一行)
#
# hls_vtt_single_default_diable_vtt ,OTV52-6121,Version: 2,Summary: \
# condition: single subtitle track is available in stream and it is \
# default set the vtt track to diable status before playing \
# stream.play stream no subtitle is rendered along with A/V, \