mergnig POS标签由名词短语chunk

时间:2017-04-05 17:42:17

标签: python nlp spacy

我的问题类似于question。在spacy中,我可以单独进行词性标注和名词短语识别,例如

import spacy
nlp = spacy.load('en')
sentence = 'For instance , consider one simple phenomena : 
            a question is typically followed by an answer , 
            or some explicit statement of an inability or refusal to answer .'
token = nlp(sentence)
token_tag = [(word.text, word.pos_) for word in token]

输出如下:

[('For', 'ADP'),
 ('instance', 'NOUN'),
 (',', 'PUNCT'),
 ('consider', 'VERB'),
 ('one', 'NUM'),
 ('simple', 'ADJ'),
 ('phenomena', 'NOUN'), 
 ...]

对于名词短语或大块,我可以得到noun_chunks,这是一大块单词,如下所示:

[nc for nc in token.noun_chunks] # [instance, one simple phenomena, an answer, ...]

我想知道是否有办法根据noun_chunks对POS广告代码进行聚类,以便我得到输出

[('For', 'ADP'),
 ('instance', 'NOUN'), # or NOUN_CHUNKS
 (',', 'PUNCT'),
 ('one simple phenomena', 'NOUN_CHUNKS'), 
 ...]

1 个答案:

答案 0 :(得分:1)

我想出了怎么做。基本上,我们可以得到名词短语标记的开始和结束位置,如下所示:

noun_phrase_position = [(s.start, s.end) for s in token.noun_chunks]
noun_phrase_text = dict([(s.start, s.text) for s in token.noun_chunks])
token_pos = [(i, t.text, t.pos_) for i, t in enumerate(token)]

然后我与此solution合并,以便根据token_posstart位置

合并stop列表
result = []
for start, end in noun_phrase_position:
    result += token_pos[index:start]
    result.append(token_pos[start:end])
    index = end

result_merge = []
for i, r in enumerate(result):
    if len(r) > 0 and isinstance(r, list):
        result_merge.append((r[0][0], noun_phrase_text.get(r[0][0]), 'NOUN_PHRASE'))
    else:
        result_merge.append(r)

<强>输出

[(1, 'instance', 'NOUN_PHRASE'),
 (2, ',', 'PUNCT'),
 (3, 'consider', 'VERB'),
 (4, 'one simple phenomena', 'NOUN_PHRASE'),
 (7, ':', 'PUNCT'),
 (8, 'a', 'DET'), ...