如何在列表列表中搜索范围?

时间:2017-05-18 03:39:41

标签: python python-3.x nlp

我想找出两个范围之间发生的POS标签,这两个范围是NNP标签的索引值。

data = [[('User', 'NNP'),
  ('is', 'VBG'),
  ('not', 'RB'),
  ('able', 'JJ'),
  ('to', 'TO'),
  ('order', 'NN'),
  ('products', 'NNS'),
  ('from', 'IN'),
  ('iShopCatalog', 'NN'),
  ('Coala', 'NNP'),
  ('excluding', 'VBG'),
  ('articles', 'NNS'),
  ('from', 'IN'),
  ('VWR', 'NNP')],
 [('Arfter', 'NNP'),
  ('transferring', 'VBG'),
  ('the', 'DT'),
  ('articles', 'NNS'),
  ('from', 'IN'),
  ('COALA', 'NNP'),
  ('to', 'TO'),
  ('SRM', 'VB'),
  ('the', 'DT'),
  ('Category', 'NNP'),
  ('S9901', 'NNP'),
  ('Dummy', 'NNP'),
  ('is', 'VBZ'),
  ('maintained', 'VBN')],
 [('Due', 'JJ'),
  ('to', 'TO'),
  ('this', 'DT'),
  ('the', 'DT'),
  ('user', 'NN'),
  ('is', 'VBZ'),
  ('not', 'RB'),
  ('able', 'JJ'),
  ('to', 'TO'),
  ('order', 'NN'),
  ('the', 'DT'),
  ('product', 'NN')],
 [('All', 'DT'),
  ('other', 'JJ'),
  ('users', 'NNS'),
  ('can', 'MD'),
  ('order', 'NN'),
  ('these', 'DT'),
  ('articles', 'NNS')],
 [('She', 'PRP'),
  ('can', 'MD'),
  ('order', 'NN'),
  ('other', 'JJ'),
  ('products', 'NNS'),
  ('from', 'IN'),
  ('a', 'DT'),
  ('POETcatalog', 'NNP'),
  ('without', 'IN'),
  ('any', 'DT'),
  ('problems', 'NNS')],
 [('Furtheremore', 'IN'),
  ('she', 'PRP'),
  ('is', 'VBZ'),
  ('able', 'JJ'),
  ('to', 'TO'),
  ('order', 'NN'),
  ('products', 'NNS'),
  ('from', 'IN'),
  ('the', 'DT'),
  ('Vendor', 'NNP'),
  ('VWR', 'NNP'),
  ('through', 'IN'),
  ('COALA', 'NNP')],
 [('But', 'CC'),
  ('articles', 'NNP'),
  ('from', 'VBG'),
  ('all', 'RB'),
  ('other', 'JJ'),
  ('suppliers', 'NNS'),
  ('are', 'NNP'),
  ('not', 'VBG'),
  ('orderable', 'RB')],
 [('I', 'PRP'),
  ('already', 'RB'),
  ('spoke', 'VBD'),
  ('to', 'TO'),
  ('anic', 'VB'),
  ('who', 'WP'),
  ('maintain', 'VBP'),
  ('the', 'DT'),
  ('catalog', 'NN'),
  ('COALA', 'NNP'),
  ('and', 'CC'),
  ('they', 'PRP'),
  ('said', 'VBD'),
  ('that', 'IN'),
  ('the', 'DT'),
  ('reason', 'NN'),
  ('should', 'MD'),
  ('be', 'VB'),
  ('the', 'DT'),
  ('assignment', 'NN'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('plant', 'NN')],
 [('User', 'NNP'),
  ('is', 'VBZ'),
  ('a', 'DT'),
  ('assinged', 'JJ'),
  ('to', 'TO'),
  ('Universitaet', 'NNP'),
  ('Regensburg', 'NNP'),
  ('in', 'IN'),
  ('Scout', 'NNP'),
  ('but', 'CC'),
  ('in', 'IN'),
  ('P17', 'NNP'),
  ('table', 'NN'),
  ('YESRMCDMUSER01', 'NNP'),
  ('she', 'PRP'),
  ('is', 'VBZ'),
  ('assigned', 'VBN'),
  ('to', 'TO'),
  ('company', 'NN'),
  ('001500', 'CD'),
  ('Merck', 'NNP'),
  ('KGaA', 'NNP')],
 [('Please', 'NNP'),
  ('find', 'VB'),
  ('attached', 'JJ'),
  ('some', 'DT'),
  ('screenshots', 'NNS')]]

以下是我的代码。

list1 = []
list4 = []
for i in data:
    list2 = []
    list3 = []
    for l,j in enumerate(i):
        if j[1] == 'NNP':
            list2.append(l)
            list3.append(j[0])
    list1.append(list2)
    list4.append(list3)

输出:

list1:

[[0, 9, 13],
 [0, 5, 9, 10, 11],
 [],
 [],
 [7],
 [9, 10, 12],
 [1, 6],
 [9],
 [0, 5, 6, 8, 11, 13, 20, 21],
 [0]]

list4

[['User', 'Coala', 'VWR'],
 ['Arfter', 'COALA', 'Category', 'S9901', 'Dummy'],
 [],
 [],
 ['POETcatalog'],
 ['Vendor', 'VWR', 'COALA'],
 ['articles', 'are'],
 ['COALA'],
 ['User',
  'Universitaet',
  'Regensburg',
  'Scout',
  'P17',
  'YESRMCDMUSER01',
  'Merck',
  'KGaA'],
 ['Please']]

从list1和list4我能够获得NNP的字符串和索引。但我想在每个列表列表中找出使用NNP标记的索引值在NNP标记之间是否存在VB,RB,JJ标记。

例如,在第一个列表列表中,如何编写代码以在范围(0-9)和(9-13)之间搜索是否存在具有VB,RB,JJ的标记。

2 个答案:

答案 0 :(得分:1)

list comprehension,zip offset list1获取范围的索引
输出范围,其中逻辑在切片data[0][j:k]元素中找到任何匹配项

[[j, k] for j, k in zip(list1[0][:], list1[0][1:])
        if any(t[1] in ['VB', 'RB', 'JJ'] for t in data[0][j:k])]

Out[107]: [[0, 9]]

答案 1 :(得分:0)

假设我正确理解了您的问题,以下内容应该有效:

search_list = ['VB', 'RB', 'JJ']
for index, set in enumerate(list1):
    temp = set[::-1] # makes a copy of the list in reverse
    while len(temp) > 1:
        first = temp.pop() # removes the last item (first item of set) to control while loop
        second = temp[-1] # references next item (new last item)
        for i in range(first, second + 1): # search all indices between first and second
            if data[index][i][1] in search_list: # index the data by same index as current list1 item
                do_stuff()

基本上:

  1. 在外部for循环中使用枚举以保持与原始数据的并行索引
  2. 创建list1中每个列表的副本以便混乱。我做了一个反向拷贝,因为我个人不喜欢使用带有索引的pop(),所以如果我想反复弹出列表的第一项,我会反转列表。您可以进行常规复制并使用list.pop(0)删除并传递第一个项目
  3. 弹出列表中的最后一个(第一个)项目并引用下一个项目。
  4. 使用这两项来创建索引数据的范围并检查上述项目。