我从第1行开始,查找levenshtein距离为1的所有行,并将它们全部添加到列表中。因此,在下面的示例中,前三行全部添加到列表中,第四行不添加。然后我迭代遍历该列表并比较在每个字符串中找到的字符在1,2,3 ...最后,我输出的字符串有点像平均值,因此第一个字符是h,因为它是在char位置1找到最常见的字母,第二个字符是e,因为这是最常见的第二个字符,依此类推。







On Edit:我正在添加第二个版本,这个版本较少依赖于生物信息学的观点,但保留了第一个版本,因为对于严格的生物信息学而言,它可能会更好。

这是字母“ATGC”的可能答案。我的基础是DNA指纹识别的概念。我有一大堆大股。我随机选择那些位置的子集为“指纹”。在我的测试数据中,股线的长度为10,000,我随机选择了100个位置。我在那些位置的第一行形成了字符的指纹。然后,基于该指纹,我创建了一组“污迹指纹” - 由原始指纹和汉明距离1处的指纹组成。然后我遍历列表的其余部分。我首先指纹链,看看它是否在污点中。如果是 - 那么我看看是否在汉明距离1内。如果是 - 我更新了一个字典,我在每个位置保持,由字符键入。最后,我创建了摘要字符串。使用10,000个链,每个链长度为10,000,只需要几秒钟来创建这些字符串并计算汇总字符串:

import random

def randDNA(n):
    return ''.join(random.choice("ATGC") for i in range(n))

def mutate(strand,times):
    nucleotides = list(strand)
    n = len(strand)
    for i in range(times):
        j = random.randint(0,n-1)
        nucleotides[i] = random.choice("ATGC")
    return ''.join(nucleotides)

def HammingClose(s,t):
    #assumes s and t are strings of the same length
    #returns True if s,t are Hamming distance at most 1
    #otherwise returns False

    clashes = 0
    for x,y in zip(s,t):
        if x != y:
            clashes += 1
            if clashes > 1: return False
    return True

def takeFingerPrint(strand,places):
    return ''.join(strand[i] for i in places)

def smudges(fp):
    s = set([fp])
    for i,c in enumerate(fp):
        s.update([fp[:i] + d + fp[(i+1):] for d in "ATGC" if d != c])
    return s

def summary(strandList, fpSize = 10):
    n = len(strandList)
    refStrand = strandList[0]
    ncount = 1 #count of refstrand + Hamming-neighbors
    position = {}

    for i,c in enumerate(refStrand):
        position[i] = dict.fromkeys("ATGC",0)
        position[i][c] += 1

    #take a random fingerprint of size fpSize
    places = random.sample(range(n),fpSize)
    refPrint = takeFingerPrint(refStrand,places)
    s = smudges(refPrint)

    for strand in strandList[1:]:
        fp = takeFingerPrint(strand,places)
        if fp in s: #maybe a hit!
            if HammingClose(strand,refStrand):
                ncount += 1
                for i,c in enumerate(strand):
                    position[i][c] += 1

    #assemble summary strand

    mode = []
    for i in range(len(refStrand)):
        c = "A"
        m = position[i]["A"]
        for x in "TGC":
            if position[i][x] > m:
                c = x
                m = position[i][x]


#example problem

strand = randDNA(10000)
strandList = [mutate(strand,5) for i in range(10000)]

n,s = summary(strandList,100)
print(n, "close strands found")
print("First 30 positions in summary strand are ", s[:30])


158 close strands found
First 30 positions in summary strand are  CAAGGTCGTCGCCCATAAACGTTTTTCCCA


第二版。您可以使用您正在使用的任何字符集替换代码中的ALPHA。指纹现在是初始切片。我在汉明距离处形成所有字符串的集合,恰好是第一行的初始切片中的一个。然后在迭代时我检查是否该行的初始切片等于参考切片,并且如果初始切片,则该行的其余部分最多为1 的汉明距离在汉明距离1处的切片集合中,在这种情况下,线的其余部分必须等于第一行的其余部分。我假设Python解释器可以比执行循环更快地测试字符串的相等性。生成的代码似乎比我的初始代码快两倍:

import random


def randDNA(n):
    return ''.join(random.choice(ALPHA) for i in range(n))

def mutate(strand,times):
    nucleotides = list(strand)
    n = len(strand)
    for i in range(times):
        j = random.randint(0,n-1)
        nucleotides[i] = random.choice(ALPHA)
    return ''.join(nucleotides)

def HammingOne(s,t):
    #assumes s and t are strings of the same length
    #returns True if s,t are Hamming distance 1
    #otherwise returns False

    clashes = 0
    for x,y in zip(s,t):
        if x != y:
            clashes += 1
            if clashes > 1: return False
    return True if clashes == 1 else False

def neighbors(s):
    n = set()
    for i,c in enumerate(s):
        n.update([s[:i] + d + s[(i+1):] for d in ALPHA if d != c])
    return n

def summary(sList, fpSize = 10):
    n = len(sList)
    refString = sList[0]
    ncount = 0 #count of Hamming-neighbors
    position = {}

    for i,c in enumerate(refString):
        position[i] = dict.fromkeys(ALPHA,0)
        position[i][c] += 1

    refPrint = refString[:fpSize]
    s = neighbors(refPrint)
    refTail = refString[fpSize:]

    for strand in sList[1:]:
        fp = strand[:fpSize]
        if (fp == refPrint) and \
           (strand[fpSize:] == refTail or HammingOne(strand[fpSize:],refTail)) or \
           (HammingOne(fp,refPrint) and strand[fpSize:] == refTail):
            ncount += 1
            for i,c in enumerate(strand):
                position[i][c] += 1

    #assemble summary strand

    mode = []
    for i in range(len(refString)):
        c = ALPHA[0]
        m = position[i][c]
        for x in ALPHA[1:]:
            if position[i][x] > m:
                c = x
                m = position[i][x]


#example problem

strand = randDNA(10000)
sList = [mutate(strand,5) for i in range(10000)]

n,s = summary(sList,100)
print(n, "close strands found")
print("First 30 positions in summary strand are ", s[:30])