找到双线;一种更快的方式

时间:2016-03-26 08:24:02

标签: python sorting python-3.x

这是我在文本文件中找到所有双行的方法

import regex #regex is as re
#capture all lines in buffer
r = f.readlines()
#create list of all linenumbers
lines = list(range(1,endline+1))
#merge both lists
z=[list(a) for a in zip(r, lines)]

#sort list
newsorting = sorted(z)

#put doubles in list
listdoubles = []
for i in range(0,len(newsorting)-1):
    if (i+1) <= len(newsorting):
        if (newsorting[i][0] == newsorting[i+1][0]) and (not regex.search('^\s*$',newsorting[i][0])):
                listdoubles.append(newsorting[i][1])
                listdoubles.append(newsorting[i+1][1])

#remove event. double linenumbers
listdoubles = list(set(listdoubles))
#sort line numeric
listdoubles = sorted(listdoubles, key=int)
print(listdoubles)

但它很慢。当我有超过10,000行时,创建此列表需要10秒钟。

有没有办法更快地完成?

1 个答案:

答案 0 :(得分:4)

您可以使用更简单的方法:

    每行
  • 如果之前已经看过,那就显示它
  • 将其添加到已知行的集合

在代码中:

seen = set()
for L in f:
    if L in seen:
        print(L)
    else:
        seen.add(L)

如果要显示出现重复项的行号,可以简单地将代码更改为使用字典映射行内容到第一次看到其文本的行号:

seen = {}
for n, L in enumerate(f):
    if L in seen:
        print("Line %i is a duplicate of line %i" % (n, seen[L]))
    else:
        seen[L] = n

Python中的dictset都基于散列并提供常量时间查找操作。

修改

如果你只需要一行的最后一个副本的行号,那么在处理过程中输出显然不能完成,但是在发出任何输出之前你将首先处理整个输入......

# lastdup will be a map from line content to the line number the
# last duplicate was found. On first insertion the value is None
# to mark the line is not a duplicate
lastdup = {}
for n, L in enumerate(f):
    if L in lastdup:
        lastdup[L] = n
    else:
        lastdup[L] = None

# Now all values that are not None are the last duplicate of a line
result = sorted(x for x in lastdup.values() if x is not None)