Question

我正在尝试编写一个基本脚本，它将帮助我找到行之间有多少相似的列。信息非常简单，如：

array = np.array([0 1 0 0 1 0 0], [0 0 1 0 1 1 0])

我必须在列表的所有排列之间执行此脚本，因此第1行与第2行相比，第1行与第3行相比，等等。

非常感谢任何帮助。

Answer 1

您的标题问题可以通过基本的numpy技术来解决。我们假设你有一个两维的numpy数组a，你想比较行m和n：

row_m = a[m, :] # this selects row index m and all column indices, thus: row m
row_n = a[n, :]
shared = row_m == row_n # this compares row_m and row_n element-by-element storing each individual result (True or False) in a separate cell, the result thus has the same shape as row_m and row_n
overlap = shared.sum() # this sums over all elements in shared, since False is encoded as 0 and True as 1 this returns the number of shared elements.

将此配方应用于所有行对的最简单方法是广播：

 first = a[:, None, :] # None creates a new dimension to make space for a second row axis
 second = a[None, :, :] # Same but new dim in first axis
 # observe that axes 0 and 1 in these two array are arranged as for a distance map
 # a binary operation between arrays so layed out will trigger broadcasting, i.e. numpy will compute all possible pairs in the appropriate positions
 full_overlap_map = first == second # has shape nrow x nrow x ncol
 similarity_table = full_overlap_map.sum(axis=-1) # shape nrow x nrow

Answer 2

如果你可以依赖二进制值的所有行＆＃34;类似的列＆＃34;伯爵只是

def count_sim_cols(row0, row1):
    return np.sum(row0*row1)

如果有更广泛的价值观的可能性，您只需用比较替换产品

def count_sim_cols(row0, row1):
     return np.sum(row0 == row1)

如果您希望对＆＃34;相似性＆＃34;表示容差，请说tol，一些小值，这只是

def count_sim_cols(row0, row1):
    return np.sum(np.abs(row0 - row1) < tol)

然后你可以双嵌套循环来获取计数。假设X是一个带有n行的numpy数组

sim_counts = {}
for i in xrange(n):
    for j in xrange(i + 1, n):
        sim_counts[(i, j)] = count_sim_cols(X[i], X[j])

比较Python

2 个答案: