我有一个xml文件 - 其内容位于此帖的底部 - 我想要解析。我希望能够创建csv输出,其数据包括'论坛标题&#39 ;; '标题&#39 ;; '用户&#39 ;; ' {所有在句子中}'。
我有这段代码:
from lxml import etree
xmL = 'huge-xml.xml'
# Parse the XML file in chunks at a time and output info at every step of the way
for event, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
text = elem.text
print event, elem, text
但是这不会找到所有标记的内容,只能以某种方式找到w。
要解析的XML:
<corpus id="politics">
<forum id="14" title="something & something" url="https://www.at.net/1">
<thread id="108" title="a title" postcount="87" lastpost="2005-03-31 06:35" url="https://www.at.net/111/222">
<text datefrom="20020526" dateto="20020526" timefrom="230000" timeto="230059" id="1185" username="user123" userid="46" date="2002-03-22 23:00" url="https://www.at.net/111/333">
<sentence id="776550f8f2-7765cba9fe">
<w pos="NN" msd="NN.UTR.SIN.DEF.NOM" lemma="|gräns|" lex="|gräns..nn.1|" saldo="|gräns..1|" prefix="|grän..nn.1|" suffix="|s..nn.1|" ref="1" dephead="6" deprel="AA">Gränsen</w>
<w pos="PP" msd="PP" lemma="|mellan|" lex="|mellan..pp.1|" saldo="|mellan..1|" prefix="|" suffix="|" ref="2" dephead="4" deprel="DT">mellan</w>
<w pos="NN" msd="NN.UTR.SIN.IND.NOM" lemma="|lycka|" lex="|lycka..nn.2|lycka..nn.1|" saldo="|lycka..2|lycka..1|lycka..3|" prefix="|" suffix="|" ref="3" dephead="4" deprel="CJ">lycka</w>
<w pos="KN" msd="KN" lemma="|och|" lex="|och..kn.1|" saldo="|och..1|" prefix="|" suffix="|" ref="4" dephead="1" deprel="ET">och</w>
<w pos="NN" msd="NN.UTR.SIN.IND.NOM" lemma="|död|" lex="|död..nn.1|" saldo="|död..2|" prefix="|" suffix="|" ref="5" dephead="4" deprel="CJ">död</w>
<w pos="JJ" msd="JJ.POS.UTR.SIN.IND.NOM" lemma="|snäv|" lex="|snäv..av.1|" saldo="|snäv..1|" prefix="|" suffix="|" ref="6" dephead="" deprel="ROOT">snäv</w>
<w pos="MAD" msd="MAD" lemma="|" lex="|" saldo="|" prefix="|" suffix="|" ref="7" dephead="6" deprel="I?">?</w>
</sentence>
</text>
</thread>
... and so on ...
答案 0 :(得分:1)
以下代码提取w的论坛标题,主题标题,用户名和文本,并为每个句子生成这些参数的列表,然后将其作为一行写入CSV文件。
import csv
from lxml import etree
def readXML(xml_file):
forum, thread, user = [''] * 3
ws = []
for event, elem in etree.iterparse(xml_file, events=('start', 'end')):
if elem.tag == 'forum' and event == 'start':
forum = elem.attrib['title']
if elem.tag == 'thread' and event == 'start':
thread = elem.attrib['title']
if elem.tag == 'text' and event == 'start':
user = elem.attrib['username']
if elem.tag == 'sentence':
if event == 'start':
ws.clear()
else:
yield [forum, thread, user] + ws
if elem.tag == 'w' and event == 'start':
ws.append(elem.text)
with open('huge-csv.csv', 'w') as fd:
w = csv.writer(fd)
w.writerows(readXML('huge-xml.xml'))
我使用以下输入文件测试了代码:
<?xml version="1.0" encoding="UTF-8"?>
<corpus id="politics">
<forum id="14" title="something & something" url="https://www.at.net/1">
<thread id="108" title="a title" postcount="87" lastpost="2005-03-31 06:35" url="https://www.at.net/111/222">
<text datefrom="20020526" dateto="20020526" timefrom="230000" timeto="230059" id="1185" username="user123" userid="46" date="2002-03-22 23:00" url="https://www.at.net/111/333">
<sentence id="776550f8f2-7765cba9fe">
<w pos="NN" msd="NN.UTR.SIN.DEF.NOM" lemma="|gräns|" lex="|gräns..nn.1|" saldo="|gräns..1|" prefix="|grän..nn.1|" suffix="|s..nn.1|" ref="1" dephead="6" deprel="AA">Gränsen</w>
<w pos="PP" msd="PP" lemma="|mellan|" lex="|mellan..pp.1|" saldo="|mellan..1|" prefix="|" suffix="|" ref="2" dephead="4" deprel="DT">mellan</w>
<w pos="NN" msd="NN.UTR.SIN.IND.NOM" lemma="|lycka|" lex="|lycka..nn.2|lycka..nn.1|" saldo="|lycka..2|lycka..1|lycka..3|" prefix="|" suffix="|" ref="3" dephead="4" deprel="CJ">lycka</w>
<w pos="KN" msd="KN" lemma="|och|" lex="|och..kn.1|" saldo="|och..1|" prefix="|" suffix="|" ref="4" dephead="1" deprel="ET">och</w>
<w pos="NN" msd="NN.UTR.SIN.IND.NOM" lemma="|död|" lex="|död..nn.1|" saldo="|död..2|" prefix="|" suffix="|" ref="5" dephead="4" deprel="CJ">död</w>
<w pos="JJ" msd="JJ.POS.UTR.SIN.IND.NOM" lemma="|snäv|" lex="|snäv..av.1|" saldo="|snäv..1|" prefix="|" suffix="|" ref="6" dephead="" deprel="ROOT">snäv</w>
<w pos="MAD" msd="MAD" lemma="|" lex="|" saldo="|" prefix="|" suffix="|" ref="7" dephead="6" deprel="I?">?</w>
</sentence>
</text>
</thread>
</forum>
</corpus>
生成的CSV文件:
something & something,a title,user123,Gränsen,mellan,lycka,och,död,snäv,?
我不确定你是否想连接w?如果是,请将yield [forum, thread, user] + ws
替换为yield [forum, thread, user, ' '.join(w)]
。