如何过滤掉不包含其他列表中元素的列表列表?

时间:2017-05-17 23:22:25

标签: python python-3.x nlp

我正在尝试从下面的小列表中排除不包含特定POS标记的列表,但不能这样做。

a = ['VBG', 'RB', 'NNP']

我只想要输出中下面的元组列表列表中包含上述标签的列表: (以下标签可能不正确,但出于表示目的)

  data = [[('User', 'NNP'),
      ('is', 'VBG'),
      ('not', 'RB'),
      ('able', 'JJ'),
      ('to', 'TO'),
      ('order', 'NN'),
      ('products', 'NNS'),
      ('from', 'IN'),
      ('iShopCatalog', 'NN'),
      ('Coala', 'NNP'),
      ('excluding', 'VBG'),
      ('articles', 'NNS'),
      ('from', 'IN'),
      ('VWR', 'NNP')],
     [('Arfter', 'NNP'),
      ('transferring', 'VBG'),
      ('the', 'DT'),
      ('articles', 'NNS'),
      ('from', 'IN'),
      ('COALA', 'NNP'),
      ('to', 'TO'),
      ('SRM', 'VB'),
      ('the', 'DT'),
      ('Category', 'NNP'),
      ('S9901', 'NNP'),
      ('Dummy', 'NNP'),
      ('is', 'VBZ'),
      ('maintained', 'VBN')],
     [('Due', 'JJ'),
      ('to', 'TO'),
      ('this', 'DT'),
      ('the', 'DT'),
      ('user', 'NN'),
      ('is', 'VBZ'),
      ('not', 'RB'),
      ('able', 'JJ'),
      ('to', 'TO'),
      ('order', 'NN'),
      ('the', 'DT'),
      ('product', 'NN')],
     [('All', 'DT'),
      ('other', 'JJ'),
      ('users', 'NNS'),
      ('can', 'MD'),
      ('order', 'NN'),
      ('these', 'DT'),
      ('articles', 'NNS')],
     [('She', 'PRP'),
      ('can', 'MD'),
      ('order', 'NN'),
      ('other', 'JJ'),
      ('products', 'NNS'),
      ('from', 'IN'),
      ('a', 'DT'),
      ('POETcatalog', 'NNP'),
      ('without', 'IN'),
      ('any', 'DT'),
      ('problems', 'NNS')],
     [('Furtheremore', 'IN'),
      ('she', 'PRP'),
      ('is', 'VBZ'),
      ('able', 'JJ'),
      ('to', 'TO'),
      ('order', 'NN'),
      ('products', 'NNS'),
      ('from', 'IN'),
      ('the', 'DT'),
      ('Vendor', 'NNP'),
      ('VWR', 'NNP'),
      ('through', 'IN'),
      ('COALA', 'NNP')],
     [('But', 'CC'),
      ('articles', 'NNP'),
      ('from', 'VBG'),
      ('all', 'RB'),
      ('other', 'JJ'),
      ('suppliers', 'NNS'),
      ('are', 'NNP'),
      ('not', 'VBG'),
      ('orderable', 'RB')],
     [('I', 'PRP'),
      ('already', 'RB'),
      ('spoke', 'VBD'),
      ('to', 'TO'),
      ('anic', 'VB'),
      ('who', 'WP'),
      ('maintain', 'VBP'),
      ('the', 'DT'),
      ('catalog', 'NN'),
      ('COALA', 'NNP'),
      ('and', 'CC'),
      ('they', 'PRP'),
      ('said', 'VBD'),
      ('that', 'IN'),
      ('the', 'DT'),
      ('reason', 'NN'),
      ('should', 'MD'),
      ('be', 'VB'),
      ('the', 'DT'),
      ('assignment', 'NN'),
      ('of', 'IN'),
      ('the', 'DT'),
      ('plant', 'NN')],
     [('User', 'NNP'),
      ('is', 'VBZ'),
      ('a', 'DT'),
      ('assinged', 'JJ'),
      ('to', 'TO'),
      ('Universitaet', 'NNP'),
      ('Regensburg', 'NNP'),
      ('in', 'IN'),
      ('Scout', 'NNP'),
      ('but', 'CC'),
      ('in', 'IN'),
      ('P17', 'NNP'),
      ('table', 'NN'),
      ('YESRMCDMUSER01', 'NNP'),
      ('she', 'PRP'),
      ('is', 'VBZ'),
      ('assigned', 'VBN'),
      ('to', 'TO'),
      ('company', 'NN'),
      ('001500', 'CD'),
      ('Merck', 'NNP'),
      ('KGaA', 'NNP')],
     [('Please', 'NNP'),
      ('find', 'VB'),
      ('attached', 'JJ'),
      ('some', 'DT'),
      ('screenshots', 'NNS')]]

我的预期输出是:

data = [[('User', 'NNP'),
  ('is', 'VBG'),
  ('not', 'RB'),
  ('able', 'JJ'),
  ('to', 'TO'),
  ('order', 'NN'),
  ('products', 'NNS'),
  ('from', 'IN'),
  ('iShopCatalog', 'NN'),
  ('Coala', 'NNP'),
  ('excluding', 'VBG'),
  ('articles', 'NNS'),
  ('from', 'IN'),
  ('VWR', 'NNP')],
  [('But', 'CC'),
  ('articles', 'NNP'),
  ('from', 'VBG'),
  ('all', 'RB'),
  ('other', 'JJ'),
  ('suppliers', 'NNS'),
  ('are', 'NNP'),
  ('not', 'VBG'),
  ('orderable', 'RB')]

我尝试通过编写以下代码来完成此操作,但无法执行此操作:

list1=[]
for i in data:
    list2 = []
    a = ['VBG', 'RB', 'NNP']
    for j in i:
        if all(i in j[1] for i in a):
            list2.append(j)
    list1.append(list2)
list1

返回列表的空列表。 任何人都可以提供一个简单易懂的代码来获得我的预期输出。感谢。

2 个答案:

答案 0 :(得分:3)

你的情况:

if all(i in j[1] for i in a):

询问a中的所有标记是否在j[1]!中,然后仅附加该项。但最多只有一个(给定你的数据),这就是为什么你得到一个空列表。相反,你想要:

In [32]: from operator import itemgetter
    ...: list1=[]
    ...: a = ['VBG', 'RB', 'NNP']
    ...: for sub in data:
    ...:     tags = set(map(itemgetter(1), sub))
    ...:     if all(s in tags for s in a):
    ...:         list1.append(sub)
    ...:

这会检查* a中的所有项目是否都在tags的子集列表中......

In [33]: list1
Out[33]:
[[('User', 'NNP'),
  ('is', 'VBG'),
  ('not', 'RB'),
  ('able', 'JJ'),
  ('to', 'TO'),
  ('order', 'NN'),
  ('products', 'NNS'),
  ('from', 'IN'),
  ('iShopCatalog', 'NN'),
  ('Coala', 'NNP'),
  ('excluding', 'VBG'),
  ('articles', 'NNS'),
  ('from', 'IN'),
  ('VWR', 'NNP')],
 [('But', 'CC'),
  ('articles', 'NNP'),
  ('from', 'VBG'),
  ('all', 'RB'),
  ('other', 'JJ'),
  ('suppliers', 'NNS'),
  ('are', 'NNP'),
  ('not', 'VBG'),
  ('orderable', 'RB')]]

答案 1 :(得分:2)

这个解决方案可能看起来很奇怪,但它确实有效:

a = set(a)
def match(x):
  words,tags = zip(*x)
  return set(tags) & a == a
list(filter(match,data))
#[[('User', 'NNP'), ('is', 'VBG'), ('not', 'RB'), ('Coala', 'NNP'), 
#  ('excluding', 'VBG'), ('VWR', 'NNP')], [('Arfter', 'NNP'),     
#  ('transferring', 'VBG'), ('COALA', 'NNP'), ('Category', 'NNP'), 
#  ('S9901', 'NNP'), ('Dummy', 'NNP')], [('not', 'RB')], [], 
#  [('POETcatalog', 'NNP')], [('Vendor', 'NNP'), ('VWR', 'NNP'), 
#  ('COALA', 'NNP')], [('articles', 'NNP'), ('from', 'VBG'), ('all', 'RB'), 
#  ('are', 'NNP'), ('not', 'VBG'), ('orderable', 'RB')], [('already', 'RB'), 
#  ('COALA', 'NNP')], [('User', 'NNP'), ('Universitaet', 'NNP'), 
#  ('Regensburg', 'NNP'), ('Scout', 'NNP'), ('P17', 'NNP'), 
#  ('YESRMCDMUSER01', 'NNP'), ('Merck', 'NNP'), ('KGaA', 'NNP')], 
#  [('Please', 'NNP')]]