
时间:2018-04-20 02:02:16

标签: python xml multidimensional-array beautifulsoup


<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>




from io import StringIO
from collections import defaultdict

from bs4 import BeautifulSoup

srcfile = """<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>

#ntok = NISTTokenizer()

eval_docs = defaultdict(lambda: defaultdict(dict))

with StringIO(srcfile) as fin:
    bsoup = BeautifulSoup(fin.read(), 'html5lib')
    setid = bsoup.find('srcset')['setid']
    for doc in bsoup.find_all('doc'):
        docid = doc['docid']
        for seg in doc.find_all('seg'):
            segid = seg['id']
            eval_docs[setid][docid][segid] = seg.text


>>> eval_docs

defaultdict(<function __main__.<lambda>>,
            {'newstest2015': defaultdict(dict,
                         {'1012-bbc': {'1': 'India and Japan prime ministers meet in Tokyo',
                           '2': "India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.",
                           '3': 'Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.',
                           '4': 'High on the agenda are plans for greater nuclear co-operation.',
                           '5': 'India is also reportedly hoping for a deal on defence collaboration between the two nations.'},
                          '1018-lenta.ru': {'1': 'FANO Russia will hold a final Expert Session',
                           '2': 'The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.',
                           '3': 'The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.',
                           '4': 'At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.'}})})





2 个答案:

答案 0 :(得分:1)


  • 实施SAX解析器
  • 使用BS4(您已经在做)
  • 使用lxml



答案 1 :(得分:1)



from lxml import etree

tree = etree.parse('source.xml')
segs = tree.xpath('//seg')

normalized_list = []
for seg in segs:
    srcset = seg.getparent().getparent().getparent().attrib['setid']
    doc = seg.getparent().getparent().attrib['docid']
    normalized_list.append([srcset, doc, seg.attrib['id'], seg.text])


d = defaultdict(lambda: defaultdict(dict))
for i in normalized_list:
    d[i[0]][i[1]][i[2]] = i[3]


  • tree = etree.parse('source.xml'):当您想直接解析文件时 - 您不会需要StringIO。文件由etree自动关闭。

  • tree = etree.fromstring(source):其中source是一个字符串对象,就像你的问题一样。