我试图找到给定句子中名词短语的跨度(起始索引,结束索引)。以下是提取名词短语的代码
sent=nltk.word_tokenize(a)
sent_pos=nltk.pos_tag(sent)
grammar = r"""
NBAR:
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns
NP:
{<NBAR>}
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc...
VP:
{<VBD><PP>?}
{<VBZ><PP>?}
{<VB><PP>?}
{<VBN><PP>?}
{<VBG><PP>?}
{<VBP><PP>?}
"""
cp = nltk.RegexpParser(grammar)
result = cp.parse(sent_pos)
nounPhrases = []
for subtree in result.subtrees(filter=lambda t: t.label() == 'NP'):
np = ''
for x in subtree.leaves():
np = np + ' ' + x[0]
nounPhrases.append(np.strip())
对于 a =&#34;美国内战,也被称为美国之间的战争或仅仅是内战,是美国在1861年至1865年间在几个南方奴隶国之后的内战宣布分裂并组建了美国同盟国。&#34;,提取的名词短语
[&#39;美国内战&#39;&#39;战争&#39;&#39;国家&#39;&#39;内战&#39;,&#39;内战与美国&#39;美国&#39;,&#39;几个南部&#39;州,&#39;分离&#39;分离&#39;同盟国家&#39;,&#39;美国&#39;]。
现在我需要找到名词短语的跨度(短语的起始位置和结束位置)。例如,上述名词短语的范围将是
[(1,3),(9,9),(12,12),(16,17),(21,23),....] 。
我对NLTK很新,我已经调查了http://www.nltk.org/_modules/nltk/tree.html。我尝试使用 Tree.treepositions(),但我无法使用这些索引提取绝对位置。任何帮助将不胜感激。谢谢!
答案 0 :(得分:3)
没有任何隐式函数返回https://github.com/nltk/nltk/issues/1214
突出显示的字符串/标记的偏移量但您可以使用RIBES score https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L123使用的ngram搜索器3>
>>> from nltk import word_tokenize
>>> from nltk.translate.ribes_score import position_of_ngram
>>> s = word_tokenize("The American Civil War, also known as the War between the States or simply the Civil War, was a civil war fought from 1861 to 1865 in the United States after several Southern slave states declared their secession and formed the Confederate States of America.")
>>> position_of_ngram(tuple('American Civil War'.split()), s)
1
>>> position_of_ngram(tuple('Confederate States of America'.split()), s)
43
(它返回查询ngram的起始位置)
答案 1 :(得分:0)
这是另一种方法,可以使用树串中的绝对位置来增加令牌。现在可以从任何子树的叶子中提取绝对位置。
def add_indices_to_terminals(treestring):
tree = ParentedTree.fromstring(treestring)
for idx, _ in enumerate(tree.leaves()):
tree_location = tree.leaf_treeposition(idx)
non_terminal = tree[tree_location[:-1]]
non_terminal[0] = non_terminal[0] + "_" + str(idx)
return str(tree)
用例示例
>>> treestring = (S (NP (NNP John)) (VP (V runs)))
>>> add_indices_to_terminals(treestring)
(S (NP (NNP John_0)) (VP (V runs_1)))
答案 2 :(得分:0)
使用以下代码实现组成的已解析树的令牌偏移量:
def get_tok_idx_of_tree(t, mapping_label_2_tok_idx, count_label, i):
if isinstance(t, str):
pass
else:
if count_label[0] == 0:
idx_start = 0
elif i == 0:
idx_start = mapping_label_2_tok_idx[list(mapping_label_2_tok_idx.keys())[-1]][0]
else:
idx_start = mapping_label_2_tok_idx[list(mapping_label_2_tok_idx.keys())[-1]][1] + 1
idx_end = idx_start + len(t.leaves()) - 1
mapping_label_2_tok_idx[t.label() + "_" + str(count_label[0])] = (idx_start, idx_end)
count_label[0] += 1
for i, child in enumerate(t):
get_tok_idx_of_tree(child, mapping_label_2_tok_idx, count_label, i)
以下是组成树:
上述代码的误区:
{'ROOT_0': (0, 3), 'S_1': (0, 3), 'VP_2': (0, 2), 'VB_3': (0, 0), 'NP_4': (1, 2), 'DT_5': (1, 1), 'NN_6': (2, 2), '._7': (3, 3)}