Question

我有一组位串：{'0011', '1100', '1110'}（一组中的所有位串都具有相同的长度）。

我想快速找到与集合最大相似度最小的相同长度的位串。最大相似度可以这样计算：

def max_similarity(bitstring, set):
    max = 0
    for item in set:
        temp = 0
        for i in range(len(bitstring)):
            if bitstring[i] == item[i]:
                temp += 1
        if temp > max:
            max = temp
    return max

我知道我可以遍历该长度的所有可能的位串，计算每个位的最大相似度，最后保留这些迭代中的最小者。但这解决了O（2 ^ n）中的问题。我想知道是否有人能找到更快捷的选择。

我一直在使用Python XOR：

def int2bin(integer, digits):
    if integer >= 0:
        return bin(integer)[2:].zfill(digits)
    else:
        return bin(2**digits + integer)[2:]


def XOR(bitset):  
    intset = [int('{}'.format(bitstring), 2) for bitstring in bitset]

    digits = len(bitset.pop())

    if len(intset) == 1:
        return int2bin(~intset.pop(), digits)        
    else:
        curr = 0    
        while intset:
            curr = curr ^ intset.pop()

        return int2bin(curr, digits)

if __name__ == '__main__':
    bitset1 = {'0011', '1100', '1110'}
    bitset2 = {'01001', '11100', '10111'}
    print(XOR(bitset1))
    print(XOR(bitset2))

>>> python test.py
0001
00010

（从here被盗的函数int2bin）

但是我发现它只适用于某些输入，不适用于其他输入。在上面的测试中，它找到了bitset2的正确解决方案，但没有找到bitset1的正确解决方案。是否有低于O（2 ^ n）的解决方案？

Answer 1

这个问题部分是算法上的（什么是解决方案的最佳算法），部分是Python问题（关于要使用哪些Python部分有效地实现该最佳算法）。

在算法上：您将一个位串与一组（相同大小）位串的最大距离定义为目标位串与该组中的任何字符串所不同的最大位数。该算法的目的是找到一个新的位串，其长度与集合中具有最大最大距离的字符串长度相同。

假定所有起始位字符串都不同（因为它被定义为集合而不是列表）。您正在计算的距离称为汉明距离，因此，您正在寻找的汉明距离到一组起始字符串的最小汉密距离的新位串。

生成所有可能的正确长度的位串并计算到每个起始串的最大距离，这是蛮力解决的问题，可以使用回溯进行优化（*）（一旦超过最低电流最大值，则丢弃结果）候选位字符串）。

（*：对于希望更正我的拼写的人，请考虑以下事实：我使用的是英国英语而不是美国英语-请随时提出改进建议）

但是，该问题也可以如下查看。

对于长度为1的位串，整个字符串空间只有两个选项{'0', '1'}。可以将其可视化为'0'和'1'坐在长度为1的线段的两端，彼此之间的距离为1。

对于长度为2的位串，整个字符串空间有4个选项，即0到3的位表示形式{'00', '01', '10', '11'} 0距离1距离1和2，两者都距离1距离3.可视化时，它们形成一个正方形的四个角，彼此之间的距离都不超过2步。

对于长度为3的位串，整个空间具有8个选项，即0到7的位表示形式。可视化时，形成一个立方体的8个角，彼此之间的距离都不超过3步。

此模式继续进行（转换为4D超立方体，5D等），找到问题的答案有效地转换为：给定这些图之一上的一组角，找到距任意图中最小最大距离的点他们。

找到这样一个点的算法，给定类似这样的图形：

首先从一组列表中的点开始；如果只有一个，那是微不足道的答案。
将当前距离设置为1。
对于所有集合，将其添加到离集合中已有点仅一步之遥的任何点上。
相交所有结果集；如果相交点不为空，则所有这些点都是距起始点集的当前距离（或更短距离）；否则，将当前距离增加1并转到步骤3。

可以通过跟踪将访问的点添加到集合中（对于长位字符串）进行跟踪来进一步优化，以避免重复添加相同的点，从而快速降低给定算法的速度。即您可以将{'001'}变成{'001', '101', '011', '000'}，而不是将[{'001'}]变成[{'001'}, {'101', '011', '000'}]-集的并集仍然可以使您在1步或更短的时间内达到所有点，但是通过查找距离更远一步的所有点，但排除前一个方向的点，该系列的下一步将更容易计算。

查找点实际上很简单，如果将字符串转换为表示的数字，并计算具有相同位字符串的所有单个“ 1”位数字的按位异或数字长度，即要找到距离'001'仅一步之遥的所有点，可以将1，4和2与1进行异或，得到{5, 3, 0} ，匹配正确的点。

将所有内容放在一起用Python压缩（无需对较长的字符串进行优化）：

def closest(strings):
    if len(strings) == 1:
        return next(iter(strings))

    size = len(next(iter(strings)))
    points = [{int(s, 2)} for s in strings]
    powers = {1 << n for n in range(size)}

    d = 0
    while True:
        d += 1
        points = [{n ^ p for p in powers for n in nums} | nums for nums in points]
        intersection = set.intersection(*points)
        if len(intersection) > 0:
            return d, {f"{n:b}".zfill(size) for n in intersection}


print(closest({'1000', '0001', '0011'}))

请注意，closest返回的是实际距离和所有最佳答案，而不仅仅是一个。输出：

(2, {'0000', '0010', '1001', '0001', '1011'})

将讨论的优化添加到closest：

def closest_optimised(strings):
    if len(strings) == 1:
        return next(iter(strings))

    size = len(next(iter(strings)))
    points = [({int(s, 2)}, {int(s, 2)}) for s in strings]
    powers = {1 << n for n in range(size)}

    d = 0
    while True:
        d += 1
        new_points = [{n ^ p for p in powers for n in rp} - ap for ap, rp in points]
        points = [(ap | np, np) for (ap, _), np in zip(points, new_points)]
        intersection = set.intersection(*[ap for ap, _ in points])
        if len(intersection) > 0:
            return d, {f"{n:b}".zfill(size) for n in intersection}

请注意，通过探查器运行这些代码，这些设置的优化代码平均运行时间约为平均时间的一半：

from random import randint

s = 10
x = 500
numbers = [randint(0, 2**s-1) for _ in range(x)]
number_strings = {f"{n:b}".zfill(s) for n in numbers}
print(number_strings)
print(closest_optimised(number_strings))
print(closest(number_strings))

编辑：出于好奇，我将示例与问题中给出的原始结果进行对比，发现它经常返回的结果远非最佳结果。我没有试图找出原因，但我认为值得一提。

有人指出，OP可能实际上希望与所有提供的位串具有最大汉明距离的点。使用类似的方法：

def farthest(strings):
    s = next(iter(strings))
    size = len(s)
    if len(strings) == 1:
        return ''.join(['0' if c == '1' else '1' for c in s])

    all_visited = {int(s, 2) for s in strings}
    visited = [set(), all_visited]
    powers = {1 << n for n in range(size)}

    d = 0
    while True:
        d += 1
        visited.append({n ^ p for p in powers for n in visited[-1]} - all_visited)
        all_visited = all_visited | visited[-1]
        if len(all_visited) == 2**size:
            return d, {f"{n:b}".zfill(size) for n in visited[-1]}

Answer 2

这是一种开销为O(n * b)的算法，其中n是集合的大小，b是固定的位长。

此算法的直觉是检查每个位索引（0或1）的多数位位置并相应地评分。

较高的分数表示给定的位串具有位位置大多数时候与大多数人背道而驰。虽然，我还没有处理过联系。

import operator

def max_hamming(bitstrings):
    n_bits = len(bitstrings[0])
    # Track bit set/unset for each bit position
    scores = {
        n: {'0': [], '1': []} for n in range(n_bits)
    }
    # Increment on each bit position if not with the majority
    total = {b: 0 for b in bitstrings}

    # O(b * n)
    for n in range(n_bits):
        n_set = 0
        for b in bitstrings:
            is_set = b[n]
            scores[n][is_set].append(b)
            if is_set:
                n_set += 1

        # If majority have this bit set, give a point to those with unset or vice versa
        outliers = scores[n]['0'] if n_set > len(bitstrings) else scores[n]['1']
        for s in outliers:
            total[s] += 1

    return max(total.items(), key=operator.itemgetter(1))[0]

另外请注意，我正在向其传递列表而不是集合，因为python集合的顺序不确定。

用法：

bitset1 = [
    '0011',
    '1100',
    '1110'
]
bitset2 = [
    '01001',
    '11100',
    '10111'
]
print(max_hamming(bitset1))
print(max_hamming(bitset2))

Answer 3

我可以使用numpy还是应该使用算法？假设一切都是bitstring，就像您拥有的一样。

import numpy as np

def bitstring2np(bitstring):
    """
    Convert a bitstring to np.array
    i.e. '0011' to np.array([0, 0, 1, 1])
    """
    return np.array([int(bit) for bit in bitstring], dtype=int)

def unlike(bitset):
    """
    Gets the most 'unlike' string between a bitset.
    Accomplishes this by creating a 2D array from the bitsets,
    figuring out the number of 1s in a column, and if that
    number of 1s is >=50%, then gives it a 0 in that place, otherwise
    gives it a 1.
    """
    bset = list(bitset)
    # Create an empty 2D array to store the bitsets into
    arr = np.empty((len(bset), len(bset[0])), dtype=int)
    for idx in range(len(bset)):
        # Store that bitset into the row of our array
        arr[idx,:] = bitstring2np(bset[idx])

    # Count the number of 1's in each column
    nonzero = np.count_nonzero(arr, axis=0)
    total = len(bset) # how many to compare against
    # Since you want the most unlike and since we are counting
    # number of 1s in a column, if the rate is >=.5 give it a 0, otherwise 
    # 1
    most_unlike = ''.join('0' if count/total >=.5 else '1' for count in nonzero)

    return most_unlike


>>> print(unlike(bitset1))
0001
>>> print(unlike(bitset2))  
00010

现在我知道您说0001不是bitset的正确解决方案，但我可以肯定的是，除非我对问题的理解不正确。

搜索最不同于一组位串的位串

3 个答案: