许多子序列之间最长的共同序列

时间:2017-03-03 09:30:55

标签: python dictionary sequence

花式标题:) 我有一个包含以下内容的文件:

>sequence_40
ABCDABDCABCDBACDBACDBACDBACDABDCDC
ACDCCDCABDCADCADBCACBDCABD
>sequence_41
DCBACDBACDADCDCDCABCDCACBDCBDACBDC
BCDBABABBABACDCDBCACDBACDBACDBACDC
BCDB
...

然后,我有一个函数返回一个字典(称为dict),它返回序列作为键和字符串(组合在一行上)作为键的值。序列范围从40到59。 我想拿一个序列字典并返回所有序列中找到的最长的常见子序列。管理在stackoverflow上找到一些帮助,并制作了一个代码,只比较该字典中的最后两个字符串,而不是所有字符串:)。 这是代码

def longest_common_sequence(s1, s2):
    m = [[0] * (1 + len(s2)) for i in range(1 + len(s1))]
    longest, x_longest = 0, 0
    for x in range(1, 1 + len(s1)):
        for y in range(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] > longest:
                    longest = m[x][y]
                    x_longest = x
            else:
                m[x][y] = 0
    return s1[x_longest - longest: x_longest]

for i in range(40,59):
    s1=str(dictionar['sequence_'+str(i)])
    s2=str(dictionar['sequence_'+str(i+1)])
longest_common_sequence(s1,s2)

如何修改它以获取字典中所有序列之间的公共子序列?谢谢!

3 个答案:

答案 0 :(得分:2)

编辑:正如@lmcarreiro指出的那样,子串(或子阵列子列表)和子序列之间存在相关差异。根据我的理解,我们都在讨论子串,所以我将在我的回答中使用这个术语。

Guillaumes答案可以改进:

def eachPossibleSubstring(string):
  for size in range(len(string) + 1, 0, -1):
    for start in range(len(string) - size + 1):
      yield string[start:start+size]

def findLongestCommonSubstring(strings):
  shortestString = min(strings, key=len)
  for substring in eachPossibleSubstring(shortestString):
    if all(substring in string
        for string in strings if string != shortestString):
      return substring

print findLongestCommonSubstring([
  'ABCDABDCABCDBACDBACDBACDBACDABDCDCACDCCDCABDCADCADBCACBDCABD',
  'DCBACDBACDADCDCDCABCDCACBDCBDACBDCBCDBABABBABACDCDBCACDBACDBACDBACDCBCDB',
])

打印:

ACDBACDBACDBACD

这更快,因为我将第一个找到并从最长到最短的搜索返回。

基本思想是这样:取最短字符串的每个可能子字符串(按照从最长到最短的顺序),看看是否可以在所有其他字符串中找到此子字符串。如果是这样,请返回它,否则尝试下一个子串。

您需要了解生成器。试试吧。 G。这样:

for substring in eachPossibleSubstring('abcd'):
  print substring

print list(eachPossibleSubstring('abcd'))

答案 1 :(得分:1)

我首先定义一个函数来返回给定序列的所有可能的子序列:

from itertools import combinations_with_replacement
def subsequences(sequence):
    "returns all possible subquences of a given sequence"
    for start, stop in combinations_with_replacement(range(len(sequence)), 2):
        if start < stop:
            yield sequence[start:stop]

然后我会用另一种方法检查所有给定序列中是否存在给定的子序列:

def is_common_subsequence(sub, sequences):
    "returns True if <sub> is a common subsequence in all <sequences>"
    return all(sub in sequence for sequence in sequences)

然后使用上面的两种方法很容易得到给定序列集中的所有常见子序列:

def common_sequences(sequences):
    "return all subsequences common in sequences"
    shortest_seq = min(sequences, key=len)
    return set(subsequence for subsequence in subsequences(shortest_seq) \
       if is_common_subsequence(subsequence, sequences))

...并提取最长的序列:

def longuest_common_subsequence(sequences):
    "returns the longuest subsequence in sequences"
    return max(common_sequences(sequences), key=len)

结果:

sequences = {
    41: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
    42: '123ABCDEFGHIJKLMNOPQRSTUVW',
    43: '123456ABCDEFGHIJKLMNOPQRST'
}

sequences2 = {
    0: 'ABCDEFGHIJ',
    1: 'DHSABCDFKDDSA',
    2: 'SGABCEIDEFJRNF'
}

print(longuest_common_subsequence(sequences.values()))
>>> ABCDEFGHIJKLMNOPQRST

print(longuest_common_subsequence(sequences2.values()))
>>> ABC

答案 2 :(得分:0)

这里有一个可行的方法。首先让我们定义一个返回两个字符串之间最长子串的函数:

def longest_substring(s1, s2):
    t = [[0]*(1+len(s2)) for i in range(1+len(s1))]
    l, xl = 0, 0
    for x in range(1,1+len(s1)):
        for y in range(1,1+len(s2)):
            if s1[x-1] == s2[y-1]:
                t[x][y] = t[x-1][y-1] + 1
                if t[x][y]>l:
                    l = t[x][y]
                    xl  = x
            else:
                t[x][y] = 0
    return s1[xl-l: xl]

现在我将为示例创建一个随机dict序列:

import random
import string

d = {i : ''.join(random.choice(string.ascii_uppercase) for _ in range(50)) for i in range(10)}

print d

{0: 'ASCUCEVJNIGWVMWMBBQQBZYBBNGQAJRYXACGFEIFWHMBCNYRGL', 1: 'HKUKZOJJUCRTSBLNZXCIBARLPNAPAABRBZEVGVILJAFCGWGQVV', 2: 'MMHCYPKECRJFEWTGYITMHZSNHAFEZVFYDAVILRYRKIDDBEFRVX', 3: 'DGBULRFJINFZEELDASRFBIRSADWMRAYMGCDAOJDKQIMXIRLTEI', 4: 'VDUFWZSXLRGOIMAHOAMZAIWDPTHDVDXUACRBASJMCUHREDORRH', 5: 'RFGAVHOWNKRZMYMSFSSNUGCKEWUNVETCDWJXSPBJHKSTPFNSJO', 6: 'HFMLMHCFSOEXBXWFAROIRGJNPRTKRWCEPLFOKGMXNUPCPWREWX', 7: 'CNPGSHGVIRLDXAADXUVWCTJCXUHQLALBUOJMXQBKXWHKGSJHEH', 8: 'UWDXXTRCFNCBUBEYGYTDWTPLNTRHYQWKTHPRVCBAWIMNGHULDC', 9: 'OOCJRXBZKJIGHZEJOOIKWKMQKIEQVPEDTFPJQAUQKJQVLOMGJB'}

最后,我们需要找到所有序列之间最长的子序列:

import itertools
max([longest_substring(i,j) for i,j in itertools.combinations(d.values(), 2)], key=len)

输出:

'VIL'