我试图提取顺序' NN'来自列表的元素(包括' NNP')并附加到给定的新列表中' IN'或者' TO'在NN'之前遇到过。我该怎么办?
我尝试了以下代码。但无法捕获其他类似的实例。
[['Additional',
'condition',
'of',
'DeNOx',
'activation',
'shall',
'be',
'introduced',
'in',
'order',
'to',
'provide',
'flexibility',
'and',
'robustness',
'to',
'NSC',
'regeneration',
'management',
'.'],
['JJ',
'NN',
'IN',
'NNP',
'NN',
'MD',
'VB',
'VBN',
'IN',
'NN',
'TO',
'VB',
'NN',
'CC',
'NN',
'TO',
'NNP',
'NN',
'NN',
'.']].
但我想改进代码以提供以下输出:
[['DeNOx', 'activation'], ['order'], ['NSC', 'regeneration', 'management']]
每个输出块都有' IN'或者' TO'发生在他们面前。
实际上,上面的列表(新)是此列表的基本词性:
{{1}}
如何将结果映射回此列表,以便我得到
{{1}}
答案 0 :(得分:5)
您可以使用两个方便的itertools
:groupby
和takewhile
:
from itertools import groupby, takewhile
nn = lambda x: x.startswith('NN')
to_in = lambda x: x in ('IN', 'TO')
list(filter(None, [list(takewhile(nn, g)) for k, g in groupby(new, key=to_in)][1:]))
# [['NNP', 'NN'], ['NN'], ['NNP', 'NN', 'NN']]
这会根据TO
或IN
项以块的形式整理初始列表。从第一个除了第一个(除了任何初始NNs
)之外的每个块,这都是以NN
开头的元素。最后,它filters出了非真实(空)列表。
答案 1 :(得分:1)
我在输入时发布了另一个很好的答案 - 这是一个没有导入的简单实现。
full_list = []
for x in range(0, len(new)):
if 'NN' in new[x] and ('IN' in new[x-1] or 'TO' in new[x-1]):
temp_list = [new[x]]
temp_index = x+1
while 'NN' in new[temp_index]:
temp_list.append(new[temp_index])
temp_index += 1
full_list.append(temp_list)
答案 2 :(得分:1)
你不是太远了。使这更容易的一种方法是获取'IN'
和'TO'
的所有索引:
starts = {'IN', 'TO'}
in_twos = [i for i, e in enumerate(new) if e in starts]
给出了:
[2, 8, 10, 15]
然后你只需要迭代这些索引,特别是new[i+1:]
,并获取'NN'
或'NNP'
的元素。当您到达不属于其中一个元素的元素时,break
将退出循环。
以下是一个例子:
result = []
take = {'NN', 'NNP'}
for i in in_twos:
temp = []
for x in new[i+1:]:
if x not in take:
break
temp.append(x)
# If this is empty, don't add it
if temp:
result.append(temp)
print(result)
最终输出:
[['NNP', 'NN'], ['NN'], ['NNP', 'NN', 'NN']]
正如@schwobaseggl建议的那样,另一种更短的方法是使用itertools.takewhile
来简化'NN'
元素的提取。此函数基本上保持提取元素,直到第一个参数谓词返回false。
以下是它的样子:
from itertools import takewhile
# new, take and in_twos same as before
result = [l for l in [list(takewhile(lambda x: x in take, new[i+1:])) for i in in_twos] if l]
print(result)
# [['NNP', 'NN'], ['NN'], ['NNP', 'NN', 'NN']]
<强>更新强>
如果要将单词和语音一起映射,可以执行以下操作:
new = [['JJ', 'NN', 'IN','NNP','NN','MD','VB','VBN','IN','NN','TO','VB','NN','CC','NN','TO','NNP','NN','NN','.'],
['Additional','condition','of','DeNOx','activation','shall','be','introduced','in', 'order','to','provide','flexibility','and','robustness', 'to','NSC','regeneration','management','.']]
starts = {'IN', 'TO'}
in_twos = [i for i, e in enumerate(new[0]) if e in starts]
speech = []
words = []
take = {'NN', 'NNP'}
for i in in_twos:
temp = []
for x, y in zip(new[0][i+1:], new[1][i+1:]):
if x not in take:
break
temp.append((x, y))
# If this is empty, don't add it
if temp:
speech.append([x for x, _ in temp])
words.append([y for _, y in temp])
print(speech)
print(words)
哪个输出:
[['NNP', 'NN'], ['NN'], ['NNP', 'NN', 'NN']]
[['DeNOx', 'activation'], ['order'], ['NSC', 'regeneration', 'management']]