如何根据Python中某些条件的其他列表中的索引列表从一个列表中查找字符串?

时间:2017-05-12 11:01:42

标签: python python-3.x nlp

我是python的新手,并不断学习在python中构建更好的代码。我有两个清单;一个索引存储在x变量中,其中x中的索引表示名为bb的列表中带有字符串('IN')的元组索引,并且两侧至少包含一个包含'NN'的元组。

我想从下面的代码中得到的是,从bb中的x中提到的每个索引,在bb列表中的字符串元组的两侧存在多少以'NN'开头的连续字符串。

我尝试了以下代码,但代码效率不高。任何人都可以帮助我提高代码效率。

     bb = [('The', 'RB'),
     ('company', 'NN'),
     ('whose', 'NNS'),
     ('stock', 'IN'),
     ('has', 'NNP'),
     ('been', 'NNS'),
     ('on', 'NNP'),
     ('tear', 'VBJ'),
     ('this', 'VB'),
     ('week', 'NNS'),
     ('already', 'NN'),
     ('sells', 'IN'),
     ('its', 'NNP'),
     ('graphics', 'NNS'),
     ('processing', 'VB'),
     ('units', 'VBJ'),
     ('biggest', 'NNS'),
     ('cloud', 'NN'),
     ('companies', 'IN'),
     ('just', 'NNP'),
     ('that', 'IN')]

def solvr(bb):
    x = []
    for i in range(len(bb)-1):
        if bb[i][1] == 'IN':
            if 'NN' in (bb[i-1][1]) and 'NN' in (bb[i+1][1]):
                x.append(i)
    #===============================        

    for i in range(len(bb)-1):
        if i in x:
            k=[]
            front = bb[i+1:]
            v = 0-i
            back = bb[:-v]
    #======================

    for i in back:
        if 'NN' in i[1]:
            k.append(i[0])
            [[] for i in k] 
    #================================


    for i, j in enumerate(front):
        if front[i][1][:2] == 'NN':
            k.append(front[i][0])
        else:
            break
    return(k)

>> solvr(bb)

output:

['company',
 'whose',
 'has',
 'been',
 'on',
 'week',
 'already',
 'its',
 'graphics',
 'biggest',
 'cloud',
 'just']

我对代码的期望是将每个迭代结果都放在新列表中,并且每个列表中都包含“IN”字符串。

 [['company', 'whose', 'stock', 'has', 'been', 'on'],
 ['week', 'already', 'sells', 'its', 'graphics'],
 ['biggest', 'cloud', 'companies', 'just']]

如果有人对我的代码进行了任何更改,那将会很感激。

3 个答案:

答案 0 :(得分:3)

itertools.groupby这似乎是一个很好的问题,它根据您定义的某些条件,根据每个元素是否为真来将列表的连续元素组合在一起。

在您的情况下,您可以使用以下内容:

groups = itertools.groupby(bb, lambda x: x[1][:2] in ['IN', 'NN']) 
result = [list(b) for a,b in groups if a]
result = [[w[0] for w in b] for b in result if 'IN' in [w[1] for w in b]]

print(result)

[['company', 'whose', 'stock', 'has', 'been', 'on'], 
 ['week', 'already', 'sells', 'its', 'graphics'], 
 ['biggest', 'cloud', 'companies', 'just', 'that']]

这是有效的,因为每当条件(第二个元素是' IN'或者以' NN'开头)来自哪个组时,您的原始bb列表就会分成子列表false为true(反之亦然)。如果我们显示组,您可以看到它是如何分割的:

groups = itertools.groupby(bb, lambda x: x[1][:2] in ['IN', 'NN']) 

print([(a,list(b)) for a,b in groups])

[(False, [('The', 'RB')]),
 (True,
  [('company', 'NN'),
   ('whose', 'NNS'),
   ('stock', 'IN'),
   ('has', 'NNP'),
   ('been', 'NNS'),
   ('on', 'NNP')]),
 (False, [('tear', 'VBJ'), ('this', 'VB')]),
 (True,
  [('week', 'NNS'),
   ('already', 'NN'),
   ('sells', 'IN'),
   ('its', 'NNP'),
   ('graphics', 'NNS')]),
 (False, [('processing', 'VB'), ('units', 'VBJ')]),
 (True,
  [('biggest', 'NNS'),
   ('cloud', 'NN'),
   ('companies', 'IN'),
   ('just', 'NNP'),
   ('that', 'IN')])]

布尔值表示以下列表是否包含满足或不满足条件的元素。现在你所要做的只是保持那些布尔值为真的条件(满足条件),然后将包含'IN'的子列表作为语音标签之一。

只是为了好玩,如果你想将整个解决方案作为一个(几乎不可读的长)单行,你可以使用:

[[w[0] for w in b] for b in [list(b) for a,b in itertools.groupby(bb, lambda x: x[1][:2] in ['IN', 'NN'])  if a] if 'IN' in [w[1] for w in b]]

修改

仅保留包含' IN'的子列表。单词至少有一个 'NN'字,您可以执行以下操作:

从与以前相同的初始groupsresults变量开始:

groups = itertools.groupby(bb, lambda x: x[1][:2] in ['IN', 'NN']) 
result = [list(b) for a,b in groups if a]

将相同的groupby函数应用于子列表,但这次将条件设置为使词性等于'IN'

result = [[(a,list(b)) for a,b in itertools.groupby(r, lambda x: x[1] == 'IN')] for r in result]

现在遍历result并删除子列表中groupby的布尔值为真(POS为'IN')的所有元素,它位于右侧或子列表的左边缘(索引为0-1

result = [[b for i,(a,b) in enumerate(r) if (a and i not in [0,len(r)-1]) or not a] for r in result]

现在我们已经删除了这些,我们可以将所有内容连接在一起并取出POS标记以获得正确的输出格式(有关列表展平语法的详细信息,请参阅here

result = [[w[0] for sub in r for w in sub] for r in result]

print(result)

[['company', 'whose', 'stock', 'has', 'been', 'on'],
 ['week', 'already', 'sells', 'its', 'graphics'],
 ['biggest', 'cloud', 'companies', 'just']]

答案 1 :(得分:1)

我不确定这是如何符合您的要求的,因为它可能会产生相同的名词作为两个不同命中的一部分,例如[N*, N*, IN, N*, N*, IN, N*] - > [[N*, N*, IN, N*, N*], [N*, N*, IN, N*]]。如果这是不希望的,那么你将采用不同的方法。在这里,您只需保留一个后备缓冲区,并检查所有单词是否满足最低要求(N *,IN,N *)。如果他们这样做,那么只需构建完整的命中。我也使用生成器,因为这可能会运行大量数据。

def solvr(bb):

    # keep a buffer of the previous tags
    back_buffer = []

    for i in range(len(bb)-1):

        word, tag = bb[i]
        _, next_tag = bb[i+1]

        # make sure there is a minimal hit of 3 tokens
        if tag == 'IN' and next_tag.startswith('N') and len(back_buffer) > 0:
            hit = back_buffer + [word]
            for it in bb[i+1:]:
                if it[1].startswith('N'):
                    hit.append(it[0])
                else:
                    break
            yield hit

        # add to the buffer
        if tag.startswith('N'):
            back_buffer.append(word)

        # reset the buffer as the sequence of N* tags has ended
        else:
            back_buffer = []
print(list(solvr(bb)))

答案 2 :(得分:1)

试试这个“神奇”:

>>> bb = [('The', 'RB'), ('company', 'NN'), ('whose', 'NNS'), ('stock', 'IN'), ('has', 'NNP'), ('been', 'NNS'), ('on', 'NNP'), ('tear', 'VBJ'), ('this', 'VB'), ('week', 'NNS'), ('already', 'NN'), ('sells', 'IN'), ('its', 'NNP'), ('graphics', 'NNS'), ('processing', 'VB'), ('units', 'VBJ'), ('biggest', 'NNS'), ('cloud', 'NN'), ('companies', 'IN'), ('just', 'NNP'), ('that', 'IN')]

>>> filter(None, map(str.strip, ' '.join([word if pos.startswith('NN') or pos == 'IN'else '|' for word, pos in bb]).split('|')))
['company whose stock has been on', 'week already sells its graphics', 'biggest cloud companies just that']

基本上,没有疯狂的嵌套疯狂:

tmp = []
answer = []
for word, pos in bb:
    if pos.startswith('NN') or pos == 'IN':
        tmp.append(word)
    else:
        if tmp:
            answer.append(' '.join(tmp))
            tmp = []

if tmp: # Remeber to flush out the last tmp.
    answer.append(' '.join(tmp))

您只需遍历bb一次。这类似于@ bunji对itertools.groupby

的回答