Question

什么是通用，有效的算法，用于在离散值矩阵中查找列的最小子集，使该行唯一。

例如，考虑这个矩阵（带有命名列）：

矩阵中的每一行都是唯一的。但是，如果我们删除列a和d，我们会保留相同的属性。

我可以枚举列的所有可能组合，但随着矩阵的增长，这些组合将很快变得难以处理。这样做有更快，更优的算法吗？

Answer 1

实际上，我原来的配方并不是很好。这套装置更好。

import pulp

# Input data
A = [
    [2, 1, 0, 0],
    [2, 0, 0, 0],
    [2, 1, 2, 2],
    [1, 2, 2, 2],
    [2, 1, 1, 0]
]

# Preprocess the data a bit.
# Bikj = 1 if Aij != Akj, 0 otherwise
B = []
for i in range(len(A)):
    Bi = []
    for k in range(len(A)):
        Bik = [int(A[i][j] != A[k][j]) for j in range(len(A[i]))]
        Bi.append(Bik)
    B.append(Bi)

model = pulp.LpProblem('Tim', pulp.LpMinimize)

# Variables turn on and off columns.
x = [pulp.LpVariable('x_%d' % j, cat=pulp.LpBinary) for j in range(len(A[0]))]

# The sum of elementwise absolute difference per element and row.
for i in range(len(A)):
    for k in range(i + 1, len(A)):
        model += sum(B[i][k][j] * x[j] for j in range(len(A[i]))) >= 1

model.setObjective(pulp.lpSum(x))
assert model.solve() == pulp.LpStatusOptimal
print([xi.value() for xi in x])

Answer 2

这是我贪心的解决方案。（是的，这不符合您的“最佳”标准。）随机选择一条可以安全扔掉并扔掉的行。继续前进，直到不再有这样的行。我确信is_valid可以优化。

rows = [
    [2, 1, 0, 0],
    [2, 0, 0, 0],
    [2, 1, 2, 2],
    [1, 2, 2, 2],
    [2, 1, 1, 0]
]

col_names = [0, 1, 2, 3]

def is_valid(rows, col_names):
    # it's valid if every row has a distinct "signature"
    signatures = { tuple(row[col] for col in col_names) for row in rows }
    return len(signatures) == len(rows)

import random    
def minimal_distinct_columns(rows, col_names):
    col_names = col_names[:]
    random.shuffle(col_names)
    for i, col in enumerate(col_names):
        fewer_col_names = col_names[:i] + col_names[(i+1):]
        if is_valid(rows, fewer_col_names):
            return minimal_distinct_columns(rows, fewer_col_names)
    return col_names

因为它很贪婪，所以总是没有得到最好的答案，但它应该相对快速（而且简单）。

Answer 3

观察：如果M具有唯一的行而没有列i和j，则它具有唯一的行而没有列i且没有列j独立（换句话说，将列添加到矩阵中唯一行不能使行不唯一）。因此，您应该能够通过使用深度优先搜索找到最小（不仅仅是最小）解决方案。

def has_unique_rows(M):
    return len(set([tuple(i) for i in M])) == len(M)

def remove_cols(M, cols):
    ret = []
    for row in M:
        new_row = []
        for i in range(len(row)):
            if i in cols:
                continue
            new_row.append(row[i])
        ret.append(new_row)
    return ret


def minimum_unique_rows(M):
    if not has_unique_rows(M):
        raise ValueError("M must have unique rows")

    cols = list(range(len(M[0])))

    def _cols_to_remove(M, removed_cols=(), max_removed_cols=()):
        for i in set(cols) - set(removed_cols):
            new_removed_cols = removed_cols + (i,)
            new_M = remove_cols(M, new_removed_cols)
            if not has_unique_rows(new_M):
                continue
            if len(new_removed_cols) > len(max_removed_cols):
                max_removed_cols = new_removed_cols
            return _cols_to_remove(M, new_removed_cols, max_removed_cols)
        return max_removed_cols

    removed_cols = _cols_to_remove(M)
    return remove_cols(M, removed_cols), removed_cols

（请注意我的变量命名很糟糕）

这是你的矩阵

In [172]: rows = [
   .....:     [2, 1, 0, 0],
   .....:     [2, 0, 0, 0],
   .....:     [2, 1, 2, 2],
   .....:     [1, 2, 2, 2],
   .....:     [2, 1, 1, 0]
   .....: ]

In [173]: minimum_unique_rows(rows)
Out[173]: ([[1, 0], [0, 0], [1, 2], [2, 2], [1, 1]], (0, 3))

我生成了一个随机矩阵（使用sympy.randMatrix），如下所示

⎡0  1  0  1  0  1  1⎤
⎢                   ⎥
⎢0  1  1  2  0  0  2⎥
⎢                   ⎥
⎢1  0  1  1  1  0  0⎥
⎢                   ⎥
⎢1  2  2  1  1  2  2⎥
⎢                   ⎥
⎢2  0  0  0  0  1  1⎥
⎢                   ⎥
⎢2  0  2  2  1  1  0⎥
⎢                   ⎥
⎢2  1  2  1  1  0  1⎥
⎢                   ⎥
⎢2  2  1  2  1  0  1⎥
⎢                   ⎥
⎣2  2  2  1  1  2  1⎦

（请注意，对M行进行排序有助于手动检查这些内容）

In [224]: M1 = [[0, 1, 0, 1, 0, 1, 1], [0, 1, 1, 2, 0, 0, 2], [1, 0, 1, 1, 1, 0, 0], [1, 2, 2, 1, 1, 2, 2], [2, 0, 0, 0, 0, 1, 1], [2, 0, 2, 2, 1, 1, 0], [2, 1, 2, 1, 1, 0
, 1], [2, 2, 1, 2, 1, 0, 1], [2, 2, 2, 1, 1, 2, 1]]

In [225]: minimum_unique_rows(M1)
Out[225]: ([[1, 1, 1], [2, 0, 2], [1, 0, 0], [1, 2, 2], [0, 1, 1], [2, 1, 0], [1, 0, 1], [2, 0, 1], [1, 2, 1]], (0, 1, 2, 4))

这是一个强力检查，它是最低答案（实际上有很多最低要求）。

In [229]: from itertools import combinations

In [230]: print([has_unique_rows(remove_cols(M1, r)) for r in combinations(range(7), 6)])
[False, False, False, False, False, False, False]

In [231]: print([has_unique_rows(remove_cols(M1, r)) for r in combinations(range(7), 5)])
[False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False]

In [232]: print([has_unique_rows(remove_cols(M1, r)) for r in combinations(range(7), 4)])
[False, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, True, False, False, False, False, False, False, False, True, False, False, True, False, False, False, False, False, True, True]

Answer 4

虽然我确定有更好的方法，但这让我想起了我在90年代所做的一些遗传算法。我使用R＆＃39 GA包编写了一个快速版本。

library(GA)

matrix_to_minimize <- matrix(c(2,2,1,1,2, 
                               1,0,1,2,1,
                               0,0,2,2,1,
                               0,0,2,2,0), ncol=4)

evaluate <- function(indices) {
  if(all(indices == 0)) {
    return(0)
  }
  selected_cols <- matrix_to_minimize[, as.logical(indices), drop=FALSE]

  are_unique <- nrow(selected_cols) == nrow(unique(selected_cols))
  if (are_unique == FALSE) {
    return(0)
  }

  retval <- (1/sum(as.logical(indices)))
  return(retval)
}

ga_results <- ga("binary", evaluate, 
             nBits=ncol(matrix_to_minimize), 
             popSize=10 * ncol(matrix_to_minimize), #why not
             maxiter=1000,
             run=10) #probably want to play with this

print("Best Solution: ")
print(ga_results@solution)

我不知道它是好的还是最佳的，但我敢打赌它会在合理的时间内提供一个相当好的答案？：）

查找使矩阵中的行唯一的列的最小子集

4 个答案: