Question

给定输入文件，例如

<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<p>
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
</p>
</doc>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<p>
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>
</p>
</doc>
<srcset>

所需的结果是存储以下内容的嵌套字典：

/setid
    /docid
        /segid
            text

我一直在使用defaultdict并使用BeautifulSoup和嵌套循环读取xml文件，即

from io import StringIO
from collections import defaultdict

from bs4 import BeautifulSoup

srcfile = """<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<p>
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
</p>
</doc>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<p>
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>
</p>
</doc>
<srcset>"""

#ntok = NISTTokenizer()

eval_docs = defaultdict(lambda: defaultdict(dict))

with StringIO(srcfile) as fin:
    bsoup = BeautifulSoup(fin.read(), 'html5lib')
    setid = bsoup.find('srcset')['setid']
    for doc in bsoup.find_all('doc'):
        docid = doc['docid']
        for seg in doc.find_all('seg'):
            segid = seg['id']
            eval_docs[setid][docid][segid] = seg.text

[OUT]：

>>> eval_docs

defaultdict(<function __main__.<lambda>>,
            {'newstest2015': defaultdict(dict,
                         {'1012-bbc': {'1': 'India and Japan prime ministers meet in Tokyo',
                           '2': "India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.",
                           '3': 'Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.',
                           '4': 'High on the agenda are plans for greater nuclear co-operation.',
                           '5': 'India is also reportedly hoping for a deal on defence collaboration between the two nations.'},
                          '1018-lenta.ru': {'1': 'FANO Russia will hold a final Expert Session',
                           '2': 'The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.',
                           '3': 'The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.',
                           '4': 'At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.'}})})

是否有更简单的方法来读取文件并获得相同的eval_docs嵌套字典？

如果不使用BeautifulSoup

，是否可以轻松完成？

请注意，在示例中，只有一个setid和一个docid，但实际文件中不止一个。

Answer 1

由于您所拥有的是具有XML外观的HTML，因此您无法使用基于XML的工具。在大多数情况下，您的选择是

实施SAX解析器
使用BS4（您已经在做）
使用lxml

在任何情况下，您最终都会花费更多的时间和精力，并拥有更大的代码来处理这个问题。你真正的时尚和轻松。如果我是你，我不会寻找另一种解决方案。

PS：比10班码更简单！

Answer 2

我不知道你是否会发现这个更简单，但是可以使用其他人建议的lxml作为替代方案。

步骤1：将XML数据转换为规范化表格（列表清单）

from lxml import etree

tree = etree.parse('source.xml')
segs = tree.xpath('//seg')

normalized_list = []
for seg in segs:
    srcset = seg.getparent().getparent().getparent().attrib['setid']
    doc = seg.getparent().getparent().attrib['docid']
    normalized_list.append([srcset, doc, seg.attrib['id'], seg.text])

第2步：像在原始代码中一样使用defaultdict

d = defaultdict(lambda: defaultdict(dict))
for i in normalized_list:
    d[i[0]][i[1]][i[2]] = i[3]

根据您保存源文件的方式，您必须使用以下方法之一来解析XML：

tree = etree.parse('source.xml')：当您想直接解析文件时 - 您不会需要StringIO。文件由etree自动关闭。
tree = etree.fromstring(source)：其中source是一个字符串对象，就像你的问题一样。

有没有更简单的方法将xml文件解析为嵌套数组？

2 个答案: