Question

我有一个数据集X，X.shape会产生(10000, 9)。我想使用以下代码选择X的子集：

X = np.asarray(np.random.normal(size = (10000,9)))
train_fraction = 0.7 # fraction of X that will be marked as train data
train_size = int(X.shape[0]*train_fraction) # fraction converted to number
test_size = X.shape[0] - train_size # remaining rows will be marked as test data
train_ind = np.asarray([False]*X.shape[0])     
train_ind[np.random.randint(low = X.shape[0], size = (train_size,))] = True # mark True at 70% of the places

问题是np.sum(train_ind)不是7000的预期值。相反，它会提供5033等随机值。

我最初认为np.random.randint(low = X.shape[0], size = (train_size,))可能是罪魁祸首。但是当我np.random.randint(low = X.shape[0], size = (train_size,)).shape时，我得到(7000,)。

我哪里错了？

Answer 1

选择np.random.choice(np.arange(0,X.shape[0]), size = train_size, replace = False)

问题是，np.random.randint不会被注射，基本上数字1可能会出现两次。这意味着索引1将设置为True两次，而另一个则不会设置为True。

np.random.choice函数确保每个数字最多只出现一次（如果设置replace = False

Numpy会给出意想不到的结果吗？

1 个答案: