Question

我正在编写一个程序，该程序读入许多文件，然后对其中的术语编制索引。我能够在python中将文件读入2d数组（列表），但后来我需要删除第一列中的重复项，并将索引存储在新列中，第一次出现重复的单词。

例如：

['when', 1]
['yes', 1]
['', 1]
['greg', 1]
['17', 1]
['when',2]

第一列是术语，第二列是它来自的DocID 我希望能够将其更改为：

['when', 1, 2]
['yes', 1]
['', 1]
['greg', 1]
['17', 1]

删除副本。

这是我到目前为止所做的：

for j in range(0,len(index)):
        for r in range(1,len(index)):
                if index[j][0] == index[r][0]:
                        index[j].append(index[r][1])
                        index.remove(index[r])

我在

处不断超出范围错误

if index[j][0] == index[r][0]:

我认为这是因为我正在从索引中删除一个对象，因此它变得越来越小。任何想法将不胜感激（是的，我知道我不应该修改原文，但这只是在小范围内测试）

Answer 1

构建dict / defaultdict

是不合适的

类似的东西：

from collections import defaultdict

ar = [['when', 1],
      ['yes', 1],
      ['', 1],
      ['greg', 1],
      ['17', 1],
      ['when',2]] 

result = defaultdict(list)
for lst in ar:
    result[lst[0]].append(lst[1])

输出：

>>> for k,v in result.items():
...     print(repr(k),v)
'' [1]
'yes' [1]
'greg' [1]
'when' [1, 2]
'17' [1]

Answer 2

是的，您的错误来自修改列表。此外，您的解决方案对于长列表无效。最好使用字典，然后将其转换回最后的列表：

from collections import defaultdict
od = defaultdict(list)

for term, doc_id in index:
    od[term].append(doc_id)

result = [[term] + doc_ids for term, doc_ids in od.iteritems()]

print result
# [['', 1], ['yes', 1], ['greg', 1], ['when', 1, 2], ['17', 1]]

Answer 3

实际上，您可以使用range()和len()完成此操作。然而，python的优点是你可以直接迭代列表中没有索引的元素

查看此代码并尝试了解。

#!/usr/bin/env python

def main():

    tot_array = \
    [ ['when', 1],
      ['yes', 1],
      ['', 1],
      ['greg', 1],
      ['17', 1],
      ['when',2]
    ]

    for aList1 in tot_array:
        for aList2 in tot_array:
            if aList1[0]==aList2[0] and aList1 !=aList2:
                aList1.append(aList2[1])
                tot_array.remove(aList2)
    print tot_array

    pass

if __name__ == '__main__':
    main()

输出结果如下：

*** Remote Interpreter Reinitialized  ***
>>> 
[['when', 1, 2], ['yes', 1], ['', 1], ['greg', 1], ['17', 1]]

减少Python列表中的重复项

3 个答案: