Question

我有一个列出对的文本文件，例如

10,1
2,7
3,1
10,1

然后将其转换为对称矩阵，因此（1,10）条目是对（1,10）在列表中出现的次数。我现在想对这个矩阵进行二次抽样。通过子样本我的意思是 - 我想制作一个矩阵，这个矩阵只能使用原始文本文件中随机30％的行。所以在这个例子中，如果我删除了70％的文本文件，那么（1,10）对可能只显示一次而不是两次，因此矩阵中的（1,10）条目将是1而不是2。

如果我实际拥有原始文本文件，只需使用random.sample选择文件中30％的行，就可以轻松完成。但如果我只有矩阵，我怎么能随机抽取70％的数据？

Answer 1

不幸的是，根据原始文件中行的出现次数，示例二和三没有观察到正确的分布。

您可以从矩阵中随机删除计数，而不是从原始数据中删除元组。因此，您必须生成随机索引并减少相应的计数。请务必避免减少零计数，而是生成新索引。这样做直到您将计数的元组总数减少到30％。基本上这可能是这样的：

amount_to_decrease = 0.7 * overall_amount

decreased = 0

while decreased < amount_to_decrease:
    x = random.randint(0, n)
    y = random.randint(0, n)
    if matrix[x][y] > 0:
        matrix[x][y]-=1
        decreased+=1
        if x != y:
            matrix[y][x]-=1

~~如果矩阵填充良好，这应该可以正常工作。如果它不是~~，您可能想要从矩阵重新创建元组列表，然后从中选择一个随机子集。之后，从剩余的元组中重新创建矩阵：

tuples = []
for y in range(n):
    for x in range(y+1):
        for _ in range(matrix[x][y])
            tuples.append((x,y))
remaining = random.sample(tuples, int(overall_amount*0.7) )

~~或者您可以进行组合，在第一次通过时查找非零的所有索引，然后对这些索引进行采样以减少计数：~~

valid_indices = []
for y in range(n):
    for x in range(y+1):
        valid_indices.append((x,y))

amount_to_decrease = 0.7 * overall_amount
decreased = 0
while decreased < amount_to_decrease:
    x,y = random.choice(valid_indices)
    matrix[x][y]-=1
    if x != y:
        matrix[y][x]-=1
    if matrix[y][x] == 0:
        valid_indices.remove((x,y))

还有另一种方法可以使用正确的可能性，但可能不会给你一个精确的减少。想法是设置保持线/计数的概率。如果您的目标是减少30％，这可能是0.3。然后你可以查看矩阵并检查每个计数是否应该保留。

keep_chance = 0.3
for y in range(n):
    for x in range(y+1):
        for _ in range(matrix[x][y])
            if random.random() > keep_chance:
                matrix[x][y] -= 1
                if x != y:
                    matrix[y][x]-=1

Answer 2

我想最好的方法取决于数据的大小：

你有一个巨大的矩阵，其中大部分是小数量？或
你有一个中等大小的矩阵，里面有大量的计数吗？

这是一个适合第二种情况的解决方案，尽管它也可以使用好的，第一种情况。

基本上，计数恰好在2D矩阵中的事实并非如此重要的是：这基本上是从人口中抽样的问题被装箱了。所以我们能做的就是直接提取箱子，忘记了矩阵：

import numpy as np
import random

# Input counts matrix
mat = np.array([
    [5, 5, 2],
    [1, 1, 3],
    [6, 0, 4]
], dtype=np.int64)

# Build a list of (row,col) pairs, and a list of counts
keys, counts = zip(*[
    ((i,j), mat[i,j])
        for i in range(mat.shape[0])
        for j in range(mat.shape[1])
        if mat[i,j] > 0
])

然后使用累计计数数组从这些箱中取样：

# Make the cumulative counts array
counts = np.array(counts, dtype=np.int64)
sum_counts = np.cumsum(counts)

# Decide how many counts to include in the sample
frac_select = 0.30
count_select = int(sum_counts[-1] * frac_select)

# Choose unique counts
ind_select = sorted(random.sample(xrange(sum_counts[-1]), count_select))

# A vector to hold the new counts
out_counts = np.zeros(counts.shape, dtype=np.int64)

# Perform basically the merge step of merge-sort, finding where
# the counts land in the cumulative array
i = 0
j = 0
while i<len(sum_counts) and j<len(ind_select):
    if ind_select[j] < sum_counts[i]:
        j += 1
        out_counts[i] += 1
    else:
        i += 1

# Rebuild the matrix using the `keys` list from before
out_mat = np.zeros(mat.shape, dtype=np.int64)
for i in range(len(out_counts)):
    out_mat[keys[i]] = out_counts[i]

现在，您将在out_mat中获得采样矩阵。

Answer 3

假设情侣1,10和10,1不同，所以mat [1] [10]不一定与mat [10] [1]相同（如果没有，请在线下方阅读）

首先计算矩阵中所有值的总和。

将此总和设为 S 。这会计算文件中的行数。

让 x 和 y 矩阵的维度。

现在循环 n 从0到[S的70％]：

选择1和x之间的随机整数。让这是 j
选择1到y之间的随机整数。让这是 k
如果mat [j] [k]＆gt; 0，减少mat [j] [k]并做n ++

由于您为文件中的每一行增加矩阵中的单个值，因此随机减少矩阵中的正值与抽取文件中的行相同。

如果10,1与1,10相同，则不需要一半矩阵，因此您可以更改算法：

n 的循环从0到[S的70％]：

选择1和x之间的随机整数。这是 j
选择1到k之间的随机整数。这是 k
如果mat [j] [k]＆gt; 0，减少mat [j] [k]并做n ++

子采样矩阵python

3 个答案: