我有一个数据集X
,X.shape
会产生(10000, 9)
。我想使用以下代码选择X
的子集:
X = np.asarray(np.random.normal(size = (10000,9)))
train_fraction = 0.7 # fraction of X that will be marked as train data
train_size = int(X.shape[0]*train_fraction) # fraction converted to number
test_size = X.shape[0] - train_size # remaining rows will be marked as test data
train_ind = np.asarray([False]*X.shape[0])
train_ind[np.random.randint(low = X.shape[0], size = (train_size,))] = True # mark True at 70% of the places
问题是np.sum(train_ind)
不是7000的预期值。相反,它会提供5033等随机值。
我最初认为np.random.randint(low = X.shape[0], size = (train_size,))
可能是罪魁祸首。但是当我np.random.randint(low = X.shape[0], size = (train_size,)).shape
时,我得到(7000,)
。
我哪里错了?
答案 0 :(得分:1)
选择np.random.choice(np.arange(0,X.shape[0]), size = train_size, replace = False)
问题是,np.random.randint
不会被注射,基本上数字1可能会出现两次。这意味着索引1将设置为True
两次,而另一个则不会设置为True
。
np.random.choice
函数确保每个数字最多只出现一次(如果设置replace = False