仅使用掉期的最小成本字符串对齐

时间:2014-03-04 18:29:08

标签: string algorithm alignment permutation dynamic-programming

我一直在为自己的教化而研究这个问题,我似乎无法破解它:

给定两个字符串,这些字符串是相同字符超集的排列,找到对齐它们所需的最小交换次数。字符串是循环的(即您可以交换第一个和最后一个字符),最长可达200​​0个字符,并且可以对任一字符串执行交换。

我尝试了几种不同的方法。一种方法是对两个字符串进行冒泡排序并缓存其所有中间状态,然后找到两者共有的中间状态,以最小化每个字符串达到该中间状态的步数之和。这没有产生正确的数字(我有一些正确答案的例子),它显然不适用于非常大的字符串。我现在有点难过了。我想我可能需要使用Demerau-Levenshtein距离的修改版本,但是当只允许一种类型的操作时,如何选择最小成本操作?

有人能指出我正确的方向吗?

1 个答案:

答案 0 :(得分:1)

考虑基于A *算法的动态编程解决方案。我们的想法是将此问题建模为图形,并找到目标节点的最佳路径。

首先进行一些基本的澄清。图中的每个节点都是一个字符串,当且仅当它们因相邻字符的一次(循环)交换而不同时,两个节点才是邻居。起始节点是第一个字符串,目标节点是第二个字符串。找到从开始到目标的最短路径,可以为您提供最佳解决方案。

现在讨论。 A *实现非常通用,这里唯一感兴趣的是启发式函数的选择。我已经实现了两个启发式,一个已知是可接受的,因此A *保证找到最佳路径。我怀疑我实施的另一种启发式应该是可以接受的。我所做的所有经验性尝试都表明它是不可接受的,这种尝试都失败了,这使我对它可以接受的信心增强了。第一个启发式算法非常慢,它似乎扩展了具有字符串大小的指数数量的节点。第二种启发式似乎扩展了一个更合理的数字(可能是字符串大小的多项式。)

然而,这是一个非常有趣的问题,一旦我有更多的时间,我可能会尝试在理论上证明第二种启发式是合理的。

即使是better_heuristic似乎也扩展了字符串大小的指数节点,这对于这两种启发式方法都不是好兆头。

下面的代码。请随意提出自己的启发式方法并试用它们。这些是我首先想到的。

#!/usr/bin/python

import random
import string
import copy
from Queue import PriorityQueue as priority_queue

# SWAP AND COPY UTILITY FUNCTION
def swap(c,i,j):
    if j >= len(c):
        j = j % len(c)
    c = list(c)
    c[i], c[j] = c[j], c[i]
    return ''.join(c)

# GIVEN THE GOAL y AND THE CURRENT STATE x COMPUTE THE HEURISTIC DISTANCE
# AS THE MAXIMUM OVER THE MINIMUM NUMBER OF SWAPS FOR EACH CHARACTER TO GET 
# TO ITS GOAL POSITION.
# NOTE: THIS HEURISTIC IS GUARANTEED TO BE AN ADMISSIBLE HEURISTIC, THEREFORE
# A* WILL ALWAYS FIND THE OPTIMAL SOLUTION USING THIS HEURISITC. IT IS HOWEVER
# NOT A STRONG HEURISTIC
def terrible_heuristic(x, y):
    lut = {}
    for i in range(len(y)):
        c = y[i]
        if lut.has_key(c):
            lut[c].append(i)
        else:
            lut[c] = [i]
    longest_swaps = []
    for i in range(len(x)):
        cpos = lut[x[i]]
        longest_swaps.append(min([ min((i-cpos[j])%len(x),(cpos[j]-i)%len(x)) for j in range(len(cpos)) ]))
    return max(longest_swaps)-1

# GIVEN THE GOAL y AND THE CURRENT STATE x COMPUTE THE HEURISTIC DISTANCE
# AS THE SUM OVER THE MINIMUM NUMBER OF SWAPS FOR EACH CHARACTER TO GET 
# TO ITS GOAL POSITION DIVIDED BY SOME CONSTANT. THE LOWER THE CONSTANT
# THE FASTER A* COMPUTES THE SOLUTION, 
# NOTE: THIS HEURISTIC IS CURRENTLY NOT THEORETICALLY JUSTIFIED. A PROOF
# SHOULD BE FORMULATED AND THEORETICAL WORK SHOULD BE DONE TO DISCOVER
# WHAT IS THE MINIMAL CONSTANT ALLOWED FOR THIS HEURISTIC TO BE ADMISSIBLE.
def better_heuristic(x, y):
    lut = {}
    for i in range(len(y)):
        c = y[i]
        if lut.has_key(c):
            lut[c].append(i)
        else:
            lut[c] = [i]
    longest_swaps = []
    for i in range(len(x)):
        cpos = lut[x[i]]
        longest_swaps.append(min([ min((i-cpos[j])%len(x),(cpos[j]-i)%len(x)) for j in range(len(cpos)) ]))
    d = 0.
    for x in longest_swaps:
        d += x-1
    # THE CONSTANT TO DIVIDE THE SUM OF THE MINIMUM SWAPS, 1.5 SEEMS TO BE THE LOWEST
    # ONE CAN CHOOSE BEFORE A* NO LONGER RETURNS CORRECT SOLUTIONS
    constant = 1.5
    d /= constant
    return d


# GET ALL STRINGS ONE CAN FORM BY SWAPPING TWO CHARACTERS ONLY
def ngbs(x):
    n = set() # WE USE SET FOR THE PATHOLOGICAL CASE OF WHEN len(x) = 2
    for i in xrange(len(x)):
        n.add(swap(x,i,i+1))
    return n

# CONVENIENCE WRAPPER AROUND PYTHON's priority_queue
class sane_priority_queue(priority_queue):
    def __init__(self):
        priority_queue.__init__(self)
        self.counter = 0

    def put(self, item, priority):
        priority_queue.put(self, (priority, self.counter, item))
        self.counter += 1

    def get(self, *args, **kwargs):
        _, _, item = priority_queue.get(self, *args, **kwargs)
        return item

# AN A* IMPLEMENTATION THAT USES EXPANDING DATA-TYPES BECAUSE OUR FULL SEARCH
# SPACE COULD BE MASSIVE. HEURISTIC FUNCTION CAN BE SPECIFIED AT RUNTIME.
def a_star(x0,goal,heuristic_func=terrible_heuristic):
    visited = set()
    frontier_visited = set()
    frontier = sane_priority_queue()
    distances = {}
    predecessors = {}

    predecessors[x0] = x0
    distances[x0] = 0
    frontier.put(x0,heuristic_func(x0,goal))

    while not frontier.empty():
        current = frontier.get()
        if current == goal:
            print "goal found, distance: ", distances[current], ' nodes explored: ', len(visited)
            return predecessors, distances
        visited.add(current)

        for n in ngbs(current):
            if n in visited:
                continue

            tentative_distance = distances[current] + 1
            if not distances.has_key(n) or tentative_distance < distances[n]:
                predecessors[n] = current
                distances[n] = tentative_distance
                heuristic_distance = tentative_distance + heuristic_func(n,goal)
                frontier.put(n,heuristic_distance)



# SIZE OF STRINGS TO WORK WITH
n = 10

# GENERATE RANDOM STRING
str1 = ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(n)])

# RANDOMLY SHUFFLE
str2 = copy.deepcopy(str1)
l = list(str2)
random.shuffle(l)
str2 = ''.join(l)

# PRINT THE STRING FOR VISUAL DISPLAY
print 'str1', str1
print 'str2', str2

# RUN A* WITH THE TERRIBLE HEURISITIC FOR A KNOWN OPTIMAL SOLUTION
print 'a_star with terrible_heuristic:'
predecessors, distances = a_star(str1,str2,terrible_heuristic)
current = str2
while current != predecessors[current]:
    print current
    current = predecessors[current]
print str1

# RUN A* WITH A BETTER HEURISTIC THAT IS NOT JUSTIFIED THEORETICALLY
# TO BE ADMISSIBLE. THE PURPORSE IS TO COMPARE AGAINST THE KNOWN
# ADMISSIBLE HEURISTIC TO SEE EMPIRICALLY WHAT THE LOWEST WE CAN
# GO IS.
print 'a_star with better_heuristic:'
predecessors, distances = a_star(str1,str2,better_heuristic)
current = str2
while current != predecessors[current]:
    print current
    current = predecessors[current]
print str1