Question

您好，我正在使用Sci Kit学习数据集，数字并拆分数据所以我有X_train和Y_train数组

数组以索引x [0]属于y [0]的方式关联

print x_train.shape 
(1347, 64)
print y_train.shape
(1347)
print set(y_train) 
(0,1,2,3,4,5,6,7,8,9)

我想从给定set（y）的x_train中提取一个随机样本，即通过仅提取set（y）的一个随机观测值对我的数据进行重采样。但是我不知道我是否可以使用numpy或pandas，有人知道如何处理吗？???

非常感谢您。

Answer 1

不清楚您想做什么。 set(y)包含数据集X的所有可用标签。

通常（直到您指定需要的内容），使用random.choice：

你有这个：

print set(y) 
(0,1,2,3,4,5,6,7,8,9)

首先将其转换为列表：

index_all = list(set(y))

现在，随机抽样set(y)：

# this is a random index (class/label) from 0 to 9.
random_index = np.random.choice(index_all, 1)

现在，我看到了两种可能性（我相信您想要案例2）：

1）根据此随机索引直接对x进行重采样（基于set(y)的随机数）最后，如果x是一个numpy数组：

x[random_index, :]

这将基于set(y)

返回x的随机观测值

2）对x重新采样，但获得带有标签y的随机观测值。标签“ y”在（random_index）上方随机定义

x[y==random_index]

这将返回与标签y关联的x的随机观察结果。

Answer 2

这是我通常用于构造数据框并从中提取数据的方法。

import numpy as np
import pandas as pd

#Dummy arrays for x and y
x_train = np.zeros((1347,64))
y_train = np.ones((1347))

#First we pair up the arrays according to their index using zip. Only use this 
#method if both arrays are of equal length.
training_dataset = list(zip(x_train,y_train))

#Next we load the dataset as a dataframe using Pandas
df = pd.DataFrame(data=training_dataset)
#Check that the dataframe is what you want
df.head()

#If you would like to extract a random row, you may use 
df.sample(n=1)

#Alternatively if you would like to extract a specific row (eg. 10th row aka index 9)
df.iloc[10]

我希望我了解您想要实现的目标，但是如果没有，请随时告诉我，以便我修改答案！

来源：

Pandas Docs

Selecting Rows and Columns in Pandas Dataframes

如何获得给定2个数组的随机样本？

2 个答案: