我正在尝试使用scikit-learn中的train_test_split
函数将我的数据集拆分为训练和测试集,但我收到此错误:
In [1]: y.iloc[:,0].value_counts()
Out[1]:
M2 38
M1 35
M4 29
M5 15
M0 15
M3 15
In [2]: xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85, stratify=y)
Out[2]:
Traceback (most recent call last):
File "run_ok.py", line 48, in <module>
xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=1/3,random_state=85,stratify=y)
File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1700, in train_test_split
train, test = next(cv.split(X=arrays[0], y=stratify))
File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 953, in split
for train, test in self._iter_indices(X, y, groups):
File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1259, in _iter_indices
raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
但是,所有课程都至少有15个样本。为什么我收到此错误?
X是一个表示数据点的pandas DataFrame,y是一个pandas DataFrame,其中一列包含目标变量。
我无法发布原始数据,因为它是专有的,但通过创建一个1k行×500列的随机pandas DataFrame(X)和一个行数相同的随机pandas DataFrame(y),它是相当可重现的(1k ),对于每一行,目标变量(分类标签)。 y pandas DataFrame应该有不同的分类标签(例如'class1','class2'...),每个标签应该至少有15次出现。
答案 0 :(得分:7)
在拆分训练和测试数据的同时删除import React, {Component} from 'react';
export default class App extends Component {
constructor(props) {
super(props);
this.state = {
myData: {
"list": [
{
"id": "1",
"first_name": "FirstName",
"last_name": "LastName",
"address": {
"street": "123",
"City": "CityName",
"State": "StateName"
},
"other_info": []
}
]
}
}
};
render() {
// Do not store this.state in a variable. it's bad coding habits
return(
<div className="container">
<table>
<tr>
<th>ID</th>
<td>{this.state.myData.list[0].id}</td>
</tr>
<tr>
<th>first_name</th>
<td>{this.state.myData.list[0].first_name}</td>
</tr>
</table>
</div>
)
}
}
stratify=y
希望这会有所帮助!
答案 1 :(得分:2)
问题是train_test_split
将2个数组作为输入,但y
数组是一列矩阵。如果我只传递y
的第一列,那就可以了。
train, xtest, ytrain, ytest = train_test_split(X, y.iloc[:,1], test_size=1/3,
random_state=85, stratify=y.iloc[:,1])
答案 2 :(得分:0)
尝试这种方式,它对我也有用,它还提到了here:
x_train, x_test, y_train, y_test = train_test_split(data_x,data_y,test_size=0.33, random_state=42) .
答案 3 :(得分:0)
要点是,如果您使用分层CV,则如果分割数不能产生数据中所有类的比率相同的所有CV分割,则会收到此警告。例如。如果您有一个类别的2个样本,那么将有2个CV集和2个此类样本,以及3个CV集和0个样本,因此该类别的比率样本在所有CV集中并不相等。但是问题仅在于任何一组中有0个样本,因此,如果您的样本数量至少与CV分割数一样多,即在这种情况下为5,则不会出现此警告。
答案 4 :(得分:0)
继续user2340939's answer。如果您确实需要在某些类中的行数较少的情况下对训练测试拆分进行分层,则可以尝试使用以下方法。我通常使用相同的方法,将此类类的所有行复制到训练和测试数据集..
from sklearn.model_selection import train_test_split
def get_min_required_rows(test_size=0.2):
return 1 / test_size
def make_stratified_splits(df, y_col="label", test_size=0.2):
"""
for any class with rows less than min_required_rows corresponding to the input test_size,
all the rows associated with the specific class will have a copy in both the train and test splits.
example: if test_size is 0.2 (20% otherwise),
min_required_rows = 5 (which is obtained from 1 / test_size i.e., 1 / 0.2)
where the resulting splits will have 4 train rows (80%), 1 test row (20%)..
"""
id_col = "id"
temp_col = "same-class-rows"
class_to_counts = df[y_col].value_counts()
df[temp_col] = df[y_col].apply(lambda y: class_to_counts[y])
min_required_rows = get_min_required_rows(test_size)
copy_rows = df[df[temp_col] < min_required_rows].copy(deep=True)
valid_rows = df[df[temp_col] >= min_required_rows].copy(deep=True)
X = valid_rows[id_col].tolist()
y = valid_rows[y_col].tolist()
# notice, this train_test_split is a stratified split
X_train, X_test, _, _ = train_test_split(X, y, test_size=test_size, random_state=43, stratify=y)
X_test = X_test + copy_rows[id_col].tolist()
X_train = X_train + copy_rows[id_col].tolist()
df.drop([temp_col], axis=1, inplace=True)
test_df = df[df[id_col].isin(X_test)].copy(deep=True)
train_df = df[df[id_col].isin(X_train)].copy(deep=True)
print (f"number of rows in the original dataset: {len(df)}")
test_prop = round(len(test_df) / len(df) * 100, 2)
train_prop = round(len(train_df) / len(df) * 100, 2)
print (f"number of rows in the splits: {len(train_df)} ({train_prop}%), {len(test_df)} ({test_prop}%)")
return train_df, test_df