Question

我有以下函数从原始数据集生成随机样本：

def randomSampling(originalData):
    a = np.random.random_integers(0, 500, size=originalData.shape)

    #Number of elements in the result. We split in a half because we want a 50% sample
    N=round(parkinsonData.shape[0]/2)

    result = np.zeros(originalData.shape)
    ia = np.arange(result.size)

    #cast to float the sum of the flat a array
    tw = float(np.sum(a.ravel()))

    result.ravel()[np.random.choice(ia, p=a.ravel()/tw,size=N, replace=False)]=1
    return result

我的目的是获得原始数据的子集，其大小为原始集的50％。通过这种方式，我可以执行此功能两次，一次用于实现训练子集，另一次用于测试子集。

我的问题是，在原始数据中，我有一个名为status的字段，其值为0或1。我想在训练和测试的子集中保持两组课程之间的比例。

我怎么能用python做到这一点？此外，我不确定该功能是用原始集的一半寄存器创建样本。从理论上讲，这应该确保大小为50％：N=round(parkinsonData.shape[0]/2)

改进随机样本函数

0 个答案: