如何使用lxml iterparse

时间:2017-01-17 21:54:26

标签: python xml lxml python-3.6

我正在尝试解析大约1GB的非常大的XML文件,其格式为:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE candidates SYSTEM "dtd/mwetoolkit-candidates.dtd">
<!-- MWETOOLKIT: filetype="XML" -->
<candidates>
<meta>
    <corpussize name="ukwac-01" value="38224449" />
    <corpussize name="sum" value="38224449" />
</meta>
<cand candid="2">
    <ngram><w lemma="executive" pos="JJ" ><freq name="ukwac-01" value="600" /><freq name="sum" value="600" /></w> <w lemma="box" pos="NNS" ><freq name="ukwac-01" value="1006" /><freq name="sum" value="1006" /></w> <freq name="ukwac-01" value="9" /><freq name="sum" value="9" /></ngram>
    <occurs>
    <ngram><w surface="Executive" lemma="executive" pos="JJ" /> <w surface="boxes" lemma="box" pos="NNS" /> <freq name="ukwac-01" value="1" /></ngram>
    <ngram><w surface="executive" lemma="executive" pos="JJ" /> <w surface="boxes" lemma="box" pos="NNS" /> <freq name="ukwac-01" value="8" /></ngram>
    </occurs>
</cand>
<cand candid="5">
    <ngram><w lemma="bad" pos="JJ" ><freq name="ukwac-01" value="4094" /><freq name="sum" value="4094" /></w> <w lemma="thing" pos="NN" ><freq name="ukwac-01" value="6609" /><freq name="sum" value="6609" /></w> <freq name="ukwac-01" value="119" /><freq name="sum" value="119" /></ngram>
    <occurs>
    <ngram><w surface="bad" lemma="bad" pos="JJ" /> <w surface="thing" lemma="thing" pos="NN" /> <freq name="ukwac-01" value="115" /></ngram>
    <ngram><w surface="Bad" lemma="bad" pos="JJ" /> <w surface="thing" lemma="thing" pos="NN" /> <freq name="ukwac-01" value="4" /></ngram>
    </occurs>
</cand>
</candidates>

到目前为止,我有这段代码:

from lxml import etree
import sys

def fast_iter(context, func):
    #http://www.ibm.com.br/developerworks/xml/library/x-hiperfparse/
    #Author = Liza Daly
    for event, elem in context:       
        func(elem)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context


def print_csv(element):
    if element.tag == 'cand':
        lemmas = []
        compound_freqs = []
        mweval = 0
        for f in c.xpath('ngram/freq'):
            if f.attrib['name'] == 'ukwac':
                mweval = int(f.attrib['value'])
        for w in element.xpath('ngram/w'):
            lemmas.append(w.attrib['lemma'])
        for freq in element.xpath('ngram/w/freq'):
            if freq.attrib['name'] == 'ukwac':
                compound_freqs.append(int(freq.attrib['value']))
        print(' '.join(lemmas),mweval,sep='\t',end='\t')
        [print(l,f,sep=":",end='') for l,f in zip(lemmas,compound_freqs)]
        print()


if __name__ == '__main__':
    args = sys.argv
    context = etree.iterparse(args[1], events=("start", "end"))
    print("mwe","mwe_freq","compounds",sep='\t')
    for event, element in context:
        if element.tag == "candidates":
            fast_iter(context, print_csv)

所需的输出是CSV文件,格式为:

mwe        mwe_freq    compounds
executive box    9    executive:600,box:1006

确切的打印格式可能(并且会)改变,但由于某种原因,一旦我到达print函数并经过element.tag检查,freq元素就是空的,我打印的都是它们的地址。我知道我应该在某个地方进行结束事件检查,就像iterparse的documentation一样,但是我尝试在fast_iter中放一个,但肯定没有用。

我目前的输出:

mwe     mwe_freq        compounds
<Element freq at 0x7f8735342c48>
<Element freq at 0x7f8735342c88>
executive box   0
        0
<Element freq at 0x7f8735346708>
<Element freq at 0x7f87353467c8>
bad thing       0
        0

非常感谢任何帮助。

0 个答案:

没有答案