我一直试图创建一个从数据集中生成分层样本的函数(因为sklearn没有这样的函数)并且我已经提出了一个。
下面的函数生成索引,我希望用原始数据集切片,但出于某种原因,当它到达
时sampleData = dataset[indexes]
行,它会抛出一个
IndexError: indices are out-of-bounds
错误。但是,
sampleData = dataset.ix[indexes]
的工作原理。但是,我觉得这是错误的,搞砸了我后来的过程。任何人都有任何想法? :)
这里是完整的代码:
def stratifiedSampleGenerator(dataset,target,subsample_size=0.1):
print('Generating stratified sample of size ' + str(round(len(dataset)*subsample_size,2)))
dic={}
indexes = np.array([])
# find number of classes in sample
for label in target.unique():
labelSize = len(target[target==label])
dic[label] = int(labelSize * subsample_size)
# make a dataset of size sizeSample with ratio of classes in dic
for label in dic:
classIndex = target[target==label].index #obtain indexes of class
counts = dic[label] #get number of times class occurs
newIndex = np.random.choice(classIndex,counts,replace=False)
indexes = np.concatenate((indexes,newIndex),axis=0)
indexes = indexes.astype(int)
sampleData = dataset[indexes] #throws error
sampleData = dataset.ix[indexes] #doesnt
谢谢! :)
答案 0 :(得分:1)
In fact, sklearn
does have a way to split a dataset in a stratified fashion.
Wouldn't something like this be enough in your case?
from sklearn.cross_validation import train_test_split
dataset = ['A']*100 + ['B']*20 + ['C']*10
target = [0]*100 + [1]*20 + [2]*10
X_fit,X_eval,y_fit,y_eval= train_test_split(dataset,target,test_size=0.1,stratify=target)
print X_eval.count('A') # output: 10
print X_eval.count('B') # output: 2
print X_eval.count('C') # output: 1
Check the documentation here: http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html