我试图创建我的大型非平衡数据集的N个平衡随机子样本。有没有办法简单地使用scikit-learn / pandas或者我必须自己实现它?任何指向代码的指针都可以做到这一点吗?
这些子样本应该是随机的,并且可以重叠,因为我将每个子样本提供给一个非常大的分类器集合中的单独分类器。
在Weka中有一个名为spreadsubsample的工具,在sklearn中是否有相同的功能? http://wiki.pentaho.com/display/DATAMINING/SpreadSubsample
(我知道加权但不是我想要的。)
答案 0 :(得分:23)
这是我的第一个版本似乎工作得很好,随时可以复制或提出如何提高效率的建议(我对编程有很长的经验,但不是那么长的python或numpy)< / p>
此函数创建单个随机平衡子样本。
编辑:子样本大小现在对少数类进行抽样,这应该可以改变。
def balanced_subsample(x,y,subsample_size=1.0):
class_xs = []
min_elems = None
for yi in np.unique(y):
elems = x[(y == yi)]
class_xs.append((yi, elems))
if min_elems == None or elems.shape[0] < min_elems:
min_elems = elems.shape[0]
use_elems = min_elems
if subsample_size < 1:
use_elems = int(min_elems*subsample_size)
xs = []
ys = []
for ci,this_xs in class_xs:
if len(this_xs) > use_elems:
np.random.shuffle(this_xs)
x_ = this_xs[:use_elems]
y_ = np.empty(use_elems)
y_.fill(ci)
xs.append(x_)
ys.append(y_)
xs = np.concatenate(xs)
ys = np.concatenate(ys)
return xs,ys
对于任何尝试使用Pandas DataFrame进行上述工作的人,您需要进行一些更改:
将np.random.shuffle
行替换为
this_xs = this_xs.reindex(np.random.permutation(this_xs.index))
用
替换np.concatenate
行
xs = pd.concat(xs)
ys = pd.Series(data=np.concatenate(ys),name='target')
答案 1 :(得分:20)
现在有一个完整的python包来解决不平衡的数据。它在https://github.com/scikit-learn-contrib/imbalanced-learn
作为sklearn-contrib包提供答案 2 :(得分:7)
pandas Series的版本:
import numpy as np
def balanced_subsample(y, size=None):
subsample = []
if size is None:
n_smp = y.value_counts().min()
else:
n_smp = int(size / len(y.value_counts().index))
for label in y.value_counts().index:
samples = y[y == label].index.values
index_range = range(samples.shape[0])
indexes = np.random.choice(index_range, size=n_smp, replace=False)
subsample += samples[indexes].tolist()
return subsample
答案 3 :(得分:3)
sklearn.cross_validation
中公开的内置数据分割技术提供了 这种类型的数据分割。
与您的需求类似的是sklearn.cross_validation.StratifiedShuffleSplit
,它可以生成任意大小的子样本,同时保留整个数据集的结构,即精心强制执行相同的不平衡数据集。虽然这不是您想要的,但您可以使用其中的代码并始终将强加的比率更改为50/50。
(如果您愿意,可能会对scikit-learn做出非常好的贡献。)
答案 4 :(得分:3)
以上代码的一个版本适用于多类组(在我测试的案例组0,1,2,3,4中)
import numpy as np
def balanced_sample_maker(X, y, sample_size, random_seed=None):
""" return a balanced data set by sampling all classes with sample_size
current version is developed on assumption that the positive
class is the minority.
Parameters:
===========
X: {numpy.ndarrray}
y: {numpy.ndarray}
"""
uniq_levels = np.unique(y)
uniq_counts = {level: sum(y == level) for level in uniq_levels}
if not random_seed is None:
np.random.seed(random_seed)
# find observation index of each class levels
groupby_levels = {}
for ii, level in enumerate(uniq_levels):
obs_idx = [idx for idx, val in enumerate(y) if val == level]
groupby_levels[level] = obs_idx
# oversampling on observations of each label
balanced_copy_idx = []
for gb_level, gb_idx in groupby_levels.iteritems():
over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
balanced_copy_idx+=over_sample_idx
np.random.shuffle(balanced_copy_idx)
return (X[balanced_copy_idx, :], y[balanced_copy_idx], balanced_copy_idx)
这也会返回索引,以便它们可以用于其他数据集,并跟踪每个数据集的使用频率(有助于培训)
答案 5 :(得分:2)
下面是我创建平衡数据副本的python实现。 假设: 1.目标变量(y)是二进制类(0对1) 2. 1是少数。
from numpy import unique
from numpy import random
def balanced_sample_maker(X, y, random_seed=None):
""" return a balanced data set by oversampling minority class
current version is developed on assumption that the positive
class is the minority.
Parameters:
===========
X: {numpy.ndarrray}
y: {numpy.ndarray}
"""
uniq_levels = unique(y)
uniq_counts = {level: sum(y == level) for level in uniq_levels}
if not random_seed is None:
random.seed(random_seed)
# find observation index of each class levels
groupby_levels = {}
for ii, level in enumerate(uniq_levels):
obs_idx = [idx for idx, val in enumerate(y) if val == level]
groupby_levels[level] = obs_idx
# oversampling on observations of positive label
sample_size = uniq_counts[0]
over_sample_idx = random.choice(groupby_levels[1], size=sample_size, replace=True).tolist()
balanced_copy_idx = groupby_levels[0] + over_sample_idx
random.shuffle(balanced_copy_idx)
return X[balanced_copy_idx, :], y[balanced_copy_idx]
答案 6 :(得分:1)
对mikkom的最佳答案略有修改。
如果要保留较大类数据的排序,即。你不想洗牌。
而不是
if len(this_xs) > use_elems:
np.random.shuffle(this_xs)
这样做
if len(this_xs) > use_elems:
ratio = len(this_xs) / use_elems
this_xs = this_xs[::ratio]
答案 7 :(得分:1)
我找到了最佳解决方案here
这是我认为最简单的一个。
dataset = pd.read_csv("data.csv")
X = dataset.iloc[:, 1:12].values
y = dataset.iloc[:, 12].values
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X, y)
然后您可以使用 X_rus,y_rus 数据
答案 8 :(得分:0)
我的子样本版本,希望这有帮助
def subsample_indices(y, size):
indices = {}
target_values = set(y_train)
for t in target_values:
indices[t] = [i for i in range(len(y)) if y[i] == t]
min_len = min(size, min([len(indices[t]) for t in indices]))
for t in indices:
if len(indices[t]) > min_len:
indices[t] = random.sample(indices[t], min_len)
return indices
x = [1, 1, 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, -1]
j = subsample_indices(x, 2)
print j
print [x[t] for t in j[-1]]
print [x[t] for t in j[1]]
答案 9 :(得分:0)
虽然已经回答了,但我偶然发现了你的问题,寻找类似的东西。经过一些研究,我相信sklearn.model_selection.StratifiedKFold
可以用于此目的:
from sklearn.model_selection import StratifiedKFold
X = samples_array
y = classes_array # subsamples will be stratified according to y
n = desired_number_of_subsamples
skf = StratifiedKFold(n, shuffle = True)
batches = []
for _, batch in skf.split(X, y):
do_something(X[batch], y[batch])
添加_
非常重要,因为skf.split()
用于创建K折叠交叉验证的分层折叠,它会返回两个索引列表:{{1} }(train
个元素)和测试(n - 1 / n
元素)。
请注意,这是sklearn 0.18。在sklearn 0.17中,可以在模块1 / n
中找到相同的功能。
答案 10 :(得分:0)
一种简短的pythonic解决方案,用于通过子采样(uspl=True
)或过采样(uspl=False
)来平衡pandas DataFrame,并通过该数据帧中具有两个或更多值的指定列进行平衡。
对于uspl=True
,此代码将采用随机样本,无需替换,其大小等于所有阶层中的最小层数。对于uspl=False
,此代码将采用随机样本,其替换的大小等于所有阶层中的最大层数。
def balanced_spl_by(df, lblcol, uspl=True):
datas_l = [ df[df[lblcol]==l].copy() for l in list(set(df[lblcol].values)) ]
lsz = [f.shape[0] for f in datas_l ]
return pd.concat([f.sample(n = (min(lsz) if uspl else max(lsz)), replace = (not uspl)).copy() for f in datas_l ], axis=0 ).sample(frac=1)
这只适用于Pandas DataFrame,但这似乎是一个常见的应用程序,并且将其限制为Pandas DataFrames会大大缩短代码,据我所知。
答案 11 :(得分:0)
使用以下代码在每个类中选择100行,重复的行。 activity
是我的课程(数据集的标签)
balanced_df=Pdf_train.groupby('activity',as_index = False,group_keys=False).apply(lambda s: s.sample(100,replace=True))
答案 12 :(得分:0)
这是我的解决方案,可以与现有的sklearn管道紧密集成:
from sklearn.model_selection import RepeatedKFold
import numpy as np
class DownsampledRepeatedKFold(RepeatedKFold):
def split(self, X, y=None, groups=None):
for i in range(self.n_repeats):
np.random.seed()
# get index of major class (negative)
idxs_class0 = np.argwhere(y == 0).ravel()
# get index of minor class (positive)
idxs_class1 = np.argwhere(y == 1).ravel()
# get length of minor class
len_minor = len(idxs_class1)
# subsample of major class of size minor class
idxs_class0_downsampled = np.random.choice(idxs_class0, size=len_minor)
original_indx_downsampled = np.hstack((idxs_class0_downsampled, idxs_class1))
np.random.shuffle(original_indx_downsampled)
splits = list(self.cv(n_splits=self.n_splits, shuffle=True).split(original_indx_downsampled))
for train_index, test_index in splits:
yield original_indx_downsampled[train_index], original_indx_downsampled[test_index]
def __init__(self, n_splits=5, n_repeats=10, random_state=None):
self.n_splits = n_splits
super(DownsampledRepeatedKFold, self).__init__(
n_splits=n_splits, n_repeats=n_repeats, random_state=random_state
)
照常使用:
for train_index, test_index in DownsampledRepeatedKFold(n_splits=5, n_repeats=10).split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
答案 13 :(得分:0)
这是一个解决方案,
for
循环,纯NumPy之外)np.random.sample()
)。有助于在训练时期之间生成不同的混洗和平衡样本def stratified_random_sample_weights(labels):
sample_weights = np.zeros(num_samples)
for class_i in range(n_classes):
class_indices = np.where(labels[:, class_i]==1) # find indices where class_i is 1
class_indices = np.squeeze(class_indices) # get rid of extra dim
num_samples_class_i = len(class_indices)
assert num_samples_class_i > 0, f"No samples found for class index {class_i}"
sample_weights[class_indices] = 1.0/num_samples_class_i # note: samples with no classes present will get weight=0
return sample_weights / sample_weights.sum() # sum(weights) == 1
然后,您反复使用这些权重来生成np.random.sample()
的平衡索引:
sample_weights = stratified_random_sample_weights(labels)
chosen_indices = np.random.choice(list(range(num_samples)), size=sample_size, replace=True, p=sample_weights)
完整示例:
# generate data
from sklearn.preprocessing import OneHotEncoder
num_samples = 10000
n_classes = 10
ground_truth_class_weights = np.logspace(1,3,num=n_classes,base=10,dtype=float) # exponentially growing
ground_truth_class_weights /= ground_truth_class_weights.sum() # sum to 1
labels = np.random.choice(list(range(n_classes)), size=num_samples, p=ground_truth_class_weights)
labels = labels.reshape(-1, 1) # turn each element into a list
labels = OneHotEncoder(sparse=False).fit_transform(labels)
print(f"original counts: {labels.sum(0)}")
# [ 38. 76. 127. 191. 282. 556. 865. 1475. 2357. 4033.]
sample_weights = stratified_random_sample_weights(labels)
sample_size = 1000
chosen_indices = np.random.choice(list(range(num_samples)), size=sample_size, replace=True, p=sample_weights)
print(f"rebalanced counts: {labels[chosen_indices].sum(0)}")
# [104. 107. 88. 107. 94. 118. 92. 99. 100. 91.]
答案 14 :(得分:0)
这是我的 2 美分。假设我们有以下不平衡数据集:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Category': np.random.choice(['A','B','C'], size=1000, replace=True, p=[0.3, 0.5, 0.2]),
'Sentiment': np.random.choice([0,1], size=1000, replace=True, p=[0.35, 0.65]),
'Gender': np.random.choice(['M','F'], size=1000, replace=True, p=[0.70, 0.30])})
print(df.head())
第一行:
Category Sentiment Gender
0 C 1 M
1 B 0 M
2 B 0 M
3 B 0 M
4 A 0 M
现在假设我们想通过情绪获得一个平衡的数据集:
df_grouped_by = df.groupby(['Sentiment'])
df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))
df_balanced = df_balanced.droplevel(['Sentiment'])
df_balanced
print(df_balanced.head())
平衡数据集的第一行:
Category Sentiment Gender
0 C 0 F
1 C 0 M
2 C 0 F
3 C 0 M
4 C 0 M
让我们验证它在 Sentiment
方面是否平衡
df_balanced.groupby(['Sentiment']).size()
我们得到:
Sentiment
0 369
1 369
dtype: int64
如我们所见,我们最终得到了 369 个正面和 369 个负面情绪标签。