Question

我编写了一个代码，将文本文件作为输入，并仅打印不止一次重复的变体。我的意思是变量，文本文件中的chr位置。

输入文件如下所示：

chr1 1048989 1048989 A G intronic C1orf159 0.16 rs4970406
  chr1 1049083 1049083 C内含子C1orf159 0.13 rs4970407
  chr1 1049083 1049083 C内含子C1orf159 0.13 rs4970407
  chr1 1113121 1113121 G内含子TTLL10 0.13 rs12092254

如您所见，第2行和第3行重复。我只是拿前3列，看看它们是否相同。这里，chr1 1049083 1049383在row2和row3中都重复。所以我打印出来说有一个副本和它的位置。

我写了下面的代码。虽然它正在做我想要的，但它很慢。我需要大约5分钟来运行一个有700,000行的文件。我想知道是否有办法加快速度。

谢谢！

#!/usr/bin/env python
""" takes in a input file and 
    prints out only the variants that occur more than once """

import shlex
import collections

rows = open('variants.txt', 'r').read().split("\n")
# removing the header and storing it in a new variable
header = rows.pop()
indices = []

for row in rows:
    var = shlex.split(row)
    indices.append("_".join(var[0:3]))

dup_list = []
ind_tuple = collections.Counter(indices).items()

for x, y in ind_tuple:
    if y>1:
        dup_list.append(x)

print dup_list    
print len(dup_list)

注意：在这种情况下，整个row2是row3的副本。但事实并非总是如此。我正在寻找重复的chr位置（前三列）。

编辑：根据damienfrancois的建议编辑了代码。以下是我的新代码：

f = open('variants.txt', 'r')
indices = {}
for line in f:
    row = line.rstrip()   
    var = shlex.split(row)
    index = "_".join(var[0:3])
    if indices.has_key(index):
        indices[index] = indices[index] + 1
    else:
        indices[index] = 1

dup_pos = 0        
for key, value in indices.items():
    if value > 1:
        dup_pos = dup_pos + 1

print dup_pos

我用了，时间看看代码需要多长时间。

我的原始代码：

time run remove_dup.py 
14428 
CPU times: user 181.75 s, sys: 2.46 s,total: 184.20 s 
Wall time: 209.31 s

修改后的代码：

time run remove_dup2.py 
14428
CPU times: user 177.99 s, sys: 2.17 s, total: 180.16 s
Wall time: 222.76 s

我认为时间没有任何明显的改善。

Answer 1

一些建议：

不要一次读取整个文件;逐行阅读并即时处理;你将节省内存操作
让index成为默认字典，并在键“_”处递增值.join（var [0：3]）;这节省了成本（猜测，应该使用profiler）collections.Counter（indices）.items（）step
尝试pypy或python compiler
在您的计算机具有核心的子集中拆分数据，将程序并行应用于每个子集然后合并结果

HTH

Answer 2

大时间接收器可能是代码的if..has_key()部分。根据我的经验，尝试 - 除了更快......

f = open('variants.txt', 'r')
indices = {}
for line in f: 
    var = line.split()
    index = "_".join(var[0:3])
    try:
        indices[index] += 1
    except KeyError:
        indices[index] = 1
f.close()
dup_pos = 0        
for key, value in indices.items():
    if value > 1:
        dup_pos = dup_pos + 1

print dup_pos

另一个选项是用以下代码替换四个try except行：

indices[index] = 1 + indices.get(index,0)

这种方法只能说明有多少行重复，而不是多少次重复。（因此，如果一条线被欺骗3倍，那么它会说一条......）

如果您仅尝试计算重复项而不删除或记录它们，您可以随时计算文件的行数，并将其与{{1}的长度进行比较}字典，差异是欺骗行的数量（而不是循环返回和重新计数）。这可能会节省一点时间，但会给出不同的答案：

indices

我很想知道不包含#!/usr/bin/env python f = open('variants.txt', 'r') indices = {} total_len=0 for line in f: total_len +=1 var = line.split() index = "_".join(var[0:3]) indices[index] = 1 + indices.get(index,0) f.close() print "Number of duplicated lines:", total_len - len(indices.keys())测试的代码的基准是什么......

需要帮助提高我的代码在Python中删除重复列的速度

2 个答案: