我正在尝试解析大约1GB的非常大的XML文件,其格式为:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE candidates SYSTEM "dtd/mwetoolkit-candidates.dtd">
<!-- MWETOOLKIT: filetype="XML" -->
<candidates>
<meta>
<corpussize name="ukwac-01" value="38224449" />
<corpussize name="sum" value="38224449" />
</meta>
<cand candid="2">
<ngram><w lemma="executive" pos="JJ" ><freq name="ukwac-01" value="600" /><freq name="sum" value="600" /></w> <w lemma="box" pos="NNS" ><freq name="ukwac-01" value="1006" /><freq name="sum" value="1006" /></w> <freq name="ukwac-01" value="9" /><freq name="sum" value="9" /></ngram>
<occurs>
<ngram><w surface="Executive" lemma="executive" pos="JJ" /> <w surface="boxes" lemma="box" pos="NNS" /> <freq name="ukwac-01" value="1" /></ngram>
<ngram><w surface="executive" lemma="executive" pos="JJ" /> <w surface="boxes" lemma="box" pos="NNS" /> <freq name="ukwac-01" value="8" /></ngram>
</occurs>
</cand>
<cand candid="5">
<ngram><w lemma="bad" pos="JJ" ><freq name="ukwac-01" value="4094" /><freq name="sum" value="4094" /></w> <w lemma="thing" pos="NN" ><freq name="ukwac-01" value="6609" /><freq name="sum" value="6609" /></w> <freq name="ukwac-01" value="119" /><freq name="sum" value="119" /></ngram>
<occurs>
<ngram><w surface="bad" lemma="bad" pos="JJ" /> <w surface="thing" lemma="thing" pos="NN" /> <freq name="ukwac-01" value="115" /></ngram>
<ngram><w surface="Bad" lemma="bad" pos="JJ" /> <w surface="thing" lemma="thing" pos="NN" /> <freq name="ukwac-01" value="4" /></ngram>
</occurs>
</cand>
</candidates>
到目前为止,我有这段代码:
from lxml import etree
import sys
def fast_iter(context, func):
#http://www.ibm.com.br/developerworks/xml/library/x-hiperfparse/
#Author = Liza Daly
for event, elem in context:
func(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
def print_csv(element):
if element.tag == 'cand':
lemmas = []
compound_freqs = []
mweval = 0
for f in c.xpath('ngram/freq'):
if f.attrib['name'] == 'ukwac':
mweval = int(f.attrib['value'])
for w in element.xpath('ngram/w'):
lemmas.append(w.attrib['lemma'])
for freq in element.xpath('ngram/w/freq'):
if freq.attrib['name'] == 'ukwac':
compound_freqs.append(int(freq.attrib['value']))
print(' '.join(lemmas),mweval,sep='\t',end='\t')
[print(l,f,sep=":",end='') for l,f in zip(lemmas,compound_freqs)]
print()
if __name__ == '__main__':
args = sys.argv
context = etree.iterparse(args[1], events=("start", "end"))
print("mwe","mwe_freq","compounds",sep='\t')
for event, element in context:
if element.tag == "candidates":
fast_iter(context, print_csv)
所需的输出是CSV文件,格式为:
mwe mwe_freq compounds
executive box 9 executive:600,box:1006
确切的打印格式可能(并且会)改变,但由于某种原因,一旦我到达print函数并经过element.tag检查,freq元素就是空的,我打印的都是它们的地址。我知道我应该在某个地方进行结束事件检查,就像iterparse的documentation一样,但是我尝试在fast_iter中放一个,但肯定没有用。
我目前的输出:
mwe mwe_freq compounds
<Element freq at 0x7f8735342c48>
<Element freq at 0x7f8735342c88>
executive box 0
0
<Element freq at 0x7f8735346708>
<Element freq at 0x7f87353467c8>
bad thing 0
0
非常感谢任何帮助。