Question

我目前正致力于分析偏差的脚本。不幸的是，我的问题是，当字符串的长度增加时，运行时变得太长，我似乎无法计算我的答案。

def SkewGC(file):
    countG = 0
    countC = 0
    diffGtoC = ""
    # first, we need to find number of G's.
    # the idea is, if G appears, we add it to the count.
    # We'll just do the same to each one.
    for pos in range(0,len(file)):
        if file[pos] == "G":
            countG = countG+1
        if file[pos] == "C":
            countC = countC+1
        diffGtoC = diffGtoC + str(countG-countC) + ","
    return diffGtoC.split(",")

SkewGCArray = SkewGC(data)
# This because I included extra "," at the end...
SkewGCArray = [int(i) for i in SkewGCArray[:len(SkewGCArray)-1]]

def min_locator(file):
    min_indices = ""
    for pos in range(0,len(file)):
        if file[pos] == min(file):
            min_indices = min_indices + str(pos) + " "
    return min_indices

print min_locator(SkewGCArray)

基本上，这个脚本计算G和C的数量（对应于DNA中的核苷酸），获得每个位置的差异，然后我试图找到最小的指数。它适用于低文件长度（输入字符串）但当长度变大 - 即使像90000+，然后我的脚本运行但无法在合理的时间内解决答案（~4-5分钟）

有人能指出我能做些什么来加快速度吗？我已经考虑过更好地说，获得差异（diffGtoC），将其设置为最小值，然后重新计算每个差异，直到看到不同的东西，在此期间我也会替换最小值太

但我对这种方法的担忧是找到并保留最小指数。如果我说，有一个值为数组的数组：

[ - 4，-2，-5，-6，-5，-6]

我可以看到在算法运行时方面如何更快地改变最小值（-4到-5然后再到-6），但我怎样才能保持-6-的位置？不确定这是否完全正确。

Answer 1

一些改善代码性能的建议：

diffGtoC = diffGtoC + str(countG-countC) + ","
    return diffGtoC.split(",")

实际上相当于：

diffGtoC = list()
diffGtoC.append(countG - countC)

字符串在Python中是不可变的，因此您为每个不高效的位置生成一个新字符串。使用列表还可以为您节省正在执行的str和int次转化以及截断列表。您还可以使用pop()删除列表中的最后一项，而不是生成新项。

一个非常简单的替代方案是搜索最小值并仅存储最小值及其位置。然后从最小位置开始迭代，看看是否可以再次找到最小值，如果是，则将其附加到第一个最小位置。减少数据操作，节省时间和内存。

GC偏斜的运行时间太长

1 个答案: