Question

我的XML文档结构如下：

root document sentences sentence id tokens token id word lemma POS NER

以下是token id的孩子的示例：

        <word>Denmark</word>
        <lemma>denmark</lemma>
        <CharacterOffsetBegin>0</CharacterOffsetBegin>
        <CharacterOffsetEnd>7</CharacterOffsetEnd>
        <POS>NN</POS>
        <NER>LOCATION</NER>

我想过滤掉只有那些有NER标签的单词的细节＆＃34; LOCATION＆＃34;。我试过这个：

soup = BeautifulSoup(markup,"lxml-xml")
print(soup.find_all('NER'))

但是这给了我：

[<NER>LOCATION</NER>, <NER>O</NER>, <NER>NUMBER</NER>, <NER>O</NER>]

我想：

denmark, LOCATION

我该怎么做？我查看了文档但我找不到出路。

Answer 1

一种选择是找到带有NER文字的LOCATION代码并转到它的父母：

for ner in soup('NER', text='LOCATION'):
    token = ner.parent

    print(token.word.get_text(), token.ner.get_text())

如何让BeautifulSoup向我显示特定的字符串？

1 个答案: