我是python的新手,并不断学习在python中构建更好的代码。我有两个清单;一个索引存储在x变量中,其中x中的索引表示名为bb的列表中带有字符串('IN')的元组索引,并且两侧至少包含一个包含'NN'的元组。
我想从下面的代码中得到的是,从bb中的x中提到的每个索引,在bb列表中的字符串元组的两侧存在多少以'NN'开头的连续字符串。
我尝试了以下代码,但代码效率不高。任何人都可以帮助我提高代码效率。
bb = [('The', 'RB'),
('company', 'NN'),
('whose', 'NNS'),
('stock', 'IN'),
('has', 'NNP'),
('been', 'NNS'),
('on', 'NNP'),
('tear', 'VBJ'),
('this', 'VB'),
('week', 'NNS'),
('already', 'NN'),
('sells', 'IN'),
('its', 'NNP'),
('graphics', 'NNS'),
('processing', 'VB'),
('units', 'VBJ'),
('biggest', 'NNS'),
('cloud', 'NN'),
('companies', 'IN'),
('just', 'NNP'),
('that', 'IN')]
def solvr(bb):
x = []
for i in range(len(bb)-1):
if bb[i][1] == 'IN':
if 'NN' in (bb[i-1][1]) and 'NN' in (bb[i+1][1]):
x.append(i)
#===============================
for i in range(len(bb)-1):
if i in x:
k=[]
front = bb[i+1:]
v = 0-i
back = bb[:-v]
#======================
for i in back:
if 'NN' in i[1]:
k.append(i[0])
[[] for i in k]
#================================
for i, j in enumerate(front):
if front[i][1][:2] == 'NN':
k.append(front[i][0])
else:
break
return(k)
>> solvr(bb)
output:
['company',
'whose',
'has',
'been',
'on',
'week',
'already',
'its',
'graphics',
'biggest',
'cloud',
'just']
我对代码的期望是将每个迭代结果都放在新列表中,并且每个列表中都包含“IN”字符串。
[['company', 'whose', 'stock', 'has', 'been', 'on'],
['week', 'already', 'sells', 'its', 'graphics'],
['biggest', 'cloud', 'companies', 'just']]
如果有人对我的代码进行了任何更改,那将会很感激。
答案 0 :(得分:3)
itertools.groupby
这似乎是一个很好的问题,它根据您定义的某些条件,根据每个元素是否为真来将列表的连续元素组合在一起。
在您的情况下,您可以使用以下内容:
groups = itertools.groupby(bb, lambda x: x[1][:2] in ['IN', 'NN'])
result = [list(b) for a,b in groups if a]
result = [[w[0] for w in b] for b in result if 'IN' in [w[1] for w in b]]
print(result)
[['company', 'whose', 'stock', 'has', 'been', 'on'],
['week', 'already', 'sells', 'its', 'graphics'],
['biggest', 'cloud', 'companies', 'just', 'that']]
这是有效的,因为每当条件(第二个元素是' IN'或者以' NN'开头)来自哪个组时,您的原始bb
列表就会分成子列表false为true(反之亦然)。如果我们显示组,您可以看到它是如何分割的:
groups = itertools.groupby(bb, lambda x: x[1][:2] in ['IN', 'NN'])
print([(a,list(b)) for a,b in groups])
[(False, [('The', 'RB')]),
(True,
[('company', 'NN'),
('whose', 'NNS'),
('stock', 'IN'),
('has', 'NNP'),
('been', 'NNS'),
('on', 'NNP')]),
(False, [('tear', 'VBJ'), ('this', 'VB')]),
(True,
[('week', 'NNS'),
('already', 'NN'),
('sells', 'IN'),
('its', 'NNP'),
('graphics', 'NNS')]),
(False, [('processing', 'VB'), ('units', 'VBJ')]),
(True,
[('biggest', 'NNS'),
('cloud', 'NN'),
('companies', 'IN'),
('just', 'NNP'),
('that', 'IN')])]
布尔值表示以下列表是否包含满足或不满足条件的元素。现在你所要做的只是保持那些布尔值为真的条件(满足条件),然后将包含'IN'
的子列表作为语音标签之一。
只是为了好玩,如果你想将整个解决方案作为一个(几乎不可读的长)单行,你可以使用:
[[w[0] for w in b] for b in [list(b) for a,b in itertools.groupby(bb, lambda x: x[1][:2] in ['IN', 'NN']) if a] if 'IN' in [w[1] for w in b]]
修改强>
仅保留包含' IN
'的子列表。单词至少有一个 'NN'
字,您可以执行以下操作:
从与以前相同的初始groups
和results
变量开始:
groups = itertools.groupby(bb, lambda x: x[1][:2] in ['IN', 'NN'])
result = [list(b) for a,b in groups if a]
将相同的groupby
函数应用于子列表,但这次将条件设置为使词性等于'IN'
:
result = [[(a,list(b)) for a,b in itertools.groupby(r, lambda x: x[1] == 'IN')] for r in result]
现在遍历result
并删除子列表中groupby
的布尔值为真(POS为'IN'
)的所有元素,它位于右侧或子列表的左边缘(索引为0
或-1
)
result = [[b for i,(a,b) in enumerate(r) if (a and i not in [0,len(r)-1]) or not a] for r in result]
现在我们已经删除了这些,我们可以将所有内容连接在一起并取出POS标记以获得正确的输出格式(有关列表展平语法的详细信息,请参阅here)
result = [[w[0] for sub in r for w in sub] for r in result]
print(result)
[['company', 'whose', 'stock', 'has', 'been', 'on'],
['week', 'already', 'sells', 'its', 'graphics'],
['biggest', 'cloud', 'companies', 'just']]
答案 1 :(得分:1)
我不确定这是如何符合您的要求的,因为它可能会产生相同的名词作为两个不同命中的一部分,例如[N*, N*, IN, N*, N*, IN, N*]
- > [[N*, N*, IN, N*, N*], [N*, N*, IN, N*]]
。如果这是不希望的,那么你将采用不同的方法。在这里,您只需保留一个后备缓冲区,并检查所有单词是否满足最低要求(N *,IN,N *)。如果他们这样做,那么只需构建完整的命中。我也使用生成器,因为这可能会运行大量数据。
def solvr(bb):
# keep a buffer of the previous tags
back_buffer = []
for i in range(len(bb)-1):
word, tag = bb[i]
_, next_tag = bb[i+1]
# make sure there is a minimal hit of 3 tokens
if tag == 'IN' and next_tag.startswith('N') and len(back_buffer) > 0:
hit = back_buffer + [word]
for it in bb[i+1:]:
if it[1].startswith('N'):
hit.append(it[0])
else:
break
yield hit
# add to the buffer
if tag.startswith('N'):
back_buffer.append(word)
# reset the buffer as the sequence of N* tags has ended
else:
back_buffer = []
print(list(solvr(bb)))
答案 2 :(得分:1)
试试这个“神奇”:
>>> bb = [('The', 'RB'), ('company', 'NN'), ('whose', 'NNS'), ('stock', 'IN'), ('has', 'NNP'), ('been', 'NNS'), ('on', 'NNP'), ('tear', 'VBJ'), ('this', 'VB'), ('week', 'NNS'), ('already', 'NN'), ('sells', 'IN'), ('its', 'NNP'), ('graphics', 'NNS'), ('processing', 'VB'), ('units', 'VBJ'), ('biggest', 'NNS'), ('cloud', 'NN'), ('companies', 'IN'), ('just', 'NNP'), ('that', 'IN')]
>>> filter(None, map(str.strip, ' '.join([word if pos.startswith('NN') or pos == 'IN'else '|' for word, pos in bb]).split('|')))
['company whose stock has been on', 'week already sells its graphics', 'biggest cloud companies just that']
基本上,没有疯狂的嵌套疯狂:
tmp = []
answer = []
for word, pos in bb:
if pos.startswith('NN') or pos == 'IN':
tmp.append(word)
else:
if tmp:
answer.append(' '.join(tmp))
tmp = []
if tmp: # Remeber to flush out the last tmp.
answer.append(' '.join(tmp))
您只需遍历bb
一次。这类似于@ bunji对itertools.groupby