使用lxml从xml中提取字段标记及其内容

时间:2017-03-14 07:07:00

标签: python xml lxml

我有一个xml文件 - 其内容位于此帖的底部 - 我想要解析。我希望能够创建csv输出,其数据包括'论坛标题&#39 ;; '标题&#39 ;; '用户&#39 ;; ' {所有在句子中}'。

我有这段代码:

from lxml import etree
xmL = 'huge-xml.xml'

# Parse the XML file in chunks at a time and output info at every step of the way

for event, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
    text = elem.text
    print event, elem, text

但是这不会找到所有标记的内容,只能以某种方式找到w。

要解析的XML:

<corpus id="politics">
<forum id="14" title="something & something" url="https://www.at.net/1">
<thread id="108" title="a title" postcount="87" lastpost="2005-03-31 06:35" url="https://www.at.net/111/222">
<text datefrom="20020526" dateto="20020526" timefrom="230000" timeto="230059" id="1185" username="user123" userid="46" date="2002-03-22 23:00" url="https://www.at.net/111/333">
<sentence id="776550f8f2-7765cba9fe">
<w pos="NN" msd="NN.UTR.SIN.DEF.NOM" lemma="|gräns|" lex="|gräns..nn.1|" saldo="|gräns..1|" prefix="|grän..nn.1|" suffix="|s..nn.1|" ref="1" dephead="6" deprel="AA">Gränsen</w>
<w pos="PP" msd="PP" lemma="|mellan|" lex="|mellan..pp.1|" saldo="|mellan..1|" prefix="|" suffix="|" ref="2" dephead="4" deprel="DT">mellan</w>
<w pos="NN" msd="NN.UTR.SIN.IND.NOM" lemma="|lycka|" lex="|lycka..nn.2|lycka..nn.1|" saldo="|lycka..2|lycka..1|lycka..3|" prefix="|" suffix="|" ref="3" dephead="4" deprel="CJ">lycka</w>
<w pos="KN" msd="KN" lemma="|och|" lex="|och..kn.1|" saldo="|och..1|" prefix="|" suffix="|" ref="4" dephead="1" deprel="ET">och</w>
<w pos="NN" msd="NN.UTR.SIN.IND.NOM" lemma="|död|" lex="|död..nn.1|" saldo="|död..2|" prefix="|" suffix="|" ref="5" dephead="4" deprel="CJ">död</w>
<w pos="JJ" msd="JJ.POS.UTR.SIN.IND.NOM" lemma="|snäv|" lex="|snäv..av.1|" saldo="|snäv..1|" prefix="|" suffix="|" ref="6" dephead="" deprel="ROOT">snäv</w>
<w pos="MAD" msd="MAD" lemma="|" lex="|" saldo="|" prefix="|" suffix="|" ref="7" dephead="6" deprel="I?">?</w>
</sentence>
</text>
</thread>

... and so on ...

1 个答案:

答案 0 :(得分:1)

以下代码提取w的论坛标题,主题标题,用户名和文本,并为每个句子生成这些参数的列表,然后将其作为一行写入CSV文件。

import csv
from lxml import etree


def readXML(xml_file):
    forum, thread, user = [''] * 3
    ws = []

    for event, elem in etree.iterparse(xml_file, events=('start', 'end')):
        if elem.tag == 'forum' and event == 'start':
            forum = elem.attrib['title']
        if elem.tag == 'thread' and event == 'start':
            thread = elem.attrib['title']
        if elem.tag == 'text' and event == 'start':
            user = elem.attrib['username']
        if elem.tag == 'sentence':
            if event == 'start':
                ws.clear()
            else:
                yield [forum, thread, user] + ws
        if elem.tag == 'w' and event == 'start':
            ws.append(elem.text)


with open('huge-csv.csv', 'w') as fd:
    w = csv.writer(fd)
    w.writerows(readXML('huge-xml.xml'))

我使用以下输入文件测试了代码:

<?xml version="1.0" encoding="UTF-8"?>
<corpus id="politics">
    <forum id="14" title="something &amp; something" url="https://www.at.net/1">
        <thread id="108" title="a title" postcount="87" lastpost="2005-03-31 06:35" url="https://www.at.net/111/222">
            <text datefrom="20020526" dateto="20020526" timefrom="230000" timeto="230059" id="1185" username="user123" userid="46" date="2002-03-22 23:00" url="https://www.at.net/111/333">
                <sentence id="776550f8f2-7765cba9fe">
                    <w pos="NN" msd="NN.UTR.SIN.DEF.NOM" lemma="|gräns|" lex="|gräns..nn.1|" saldo="|gräns..1|" prefix="|grän..nn.1|" suffix="|s..nn.1|" ref="1" dephead="6" deprel="AA">Gränsen</w>
                    <w pos="PP" msd="PP" lemma="|mellan|" lex="|mellan..pp.1|" saldo="|mellan..1|" prefix="|" suffix="|" ref="2" dephead="4" deprel="DT">mellan</w>
                    <w pos="NN" msd="NN.UTR.SIN.IND.NOM" lemma="|lycka|" lex="|lycka..nn.2|lycka..nn.1|" saldo="|lycka..2|lycka..1|lycka..3|" prefix="|" suffix="|" ref="3" dephead="4" deprel="CJ">lycka</w>
                    <w pos="KN" msd="KN" lemma="|och|" lex="|och..kn.1|" saldo="|och..1|" prefix="|" suffix="|" ref="4" dephead="1" deprel="ET">och</w>
                    <w pos="NN" msd="NN.UTR.SIN.IND.NOM" lemma="|död|" lex="|död..nn.1|" saldo="|död..2|" prefix="|" suffix="|" ref="5" dephead="4" deprel="CJ">död</w>
                    <w pos="JJ" msd="JJ.POS.UTR.SIN.IND.NOM" lemma="|snäv|" lex="|snäv..av.1|" saldo="|snäv..1|" prefix="|" suffix="|" ref="6" dephead="" deprel="ROOT">snäv</w>
                    <w pos="MAD" msd="MAD" lemma="|" lex="|" saldo="|" prefix="|" suffix="|" ref="7" dephead="6" deprel="I?">?</w>
                </sentence>
            </text>
        </thread>
    </forum>
</corpus>

生成的CSV文件:

something & something,a title,user123,Gränsen,mellan,lycka,och,död,snäv,?

我不确定你是否想连接w?如果是,请将yield [forum, thread, user] + ws替换为yield [forum, thread, user, ' '.join(w)]