
时间:2017-04-03 08:00:45

标签: python scikit-learn train-test-split


In [1]: y.iloc[:,0].value_counts()
M2    38
M1    35
M4    29
M5    15
M0    15
M3    15

In [2]: xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85, stratify=y)
Traceback (most recent call last):
  File "", line 48, in <module>
    xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=1/3,random_state=85,stratify=y)
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/", line 1700, in train_test_split
    train, test = next(cv.split(X=arrays[0], y=stratify))
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/", line 953, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/", line 1259, in _iter_indices
    raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.


X是一个表示数据点的pandas DataFrame,y是一个pandas DataFrame,其中一列包含目标变量。

我无法发布原始数据,因为它是专有的,但通过创建一个1k行×500列的随机pandas DataFrame(X)和一个行数相同的随机pandas DataFrame(y),它是相当可重现的(1k ),对于每一行,目标变量(分类标签)。 y pandas DataFrame应该有不同的分类标签(例如'class1','class2'...),每个标签应该至少有15次出现。

5 个答案:

答案 0 :(得分:7)

在拆分训练和测试数据的同时删除import React, {Component} from 'react'; export default class App extends Component { constructor(props) { super(props); this.state = { myData: { "list": [ { "id": "1", "first_name": "FirstName", "last_name": "LastName", "address": { "street": "123", "City": "CityName", "State": "StateName" }, "other_info": [] } ] } } }; render() { // Do not store this.state in a variable. it's bad coding habits return( <div className="container"> <table> <tr> <th>ID</th> <td>{this.state.myData.list[0].id}</td> </tr> <tr> <th>first_name</th> <td>{this.state.myData.list[0].first_name}</td> </tr> </table> </div> ) } }



答案 1 :(得分:2)


train, xtest, ytrain, ytest = train_test_split(X, y.iloc[:,1], test_size=1/3,
  random_state=85, stratify=y.iloc[:,1])

答案 2 :(得分:0)


x_train, x_test, y_train, y_test = train_test_split(data_x,data_y,test_size=0.33, random_state=42) .

答案 3 :(得分:0)



答案 4 :(得分:0)

继续user2340939's answer。如果您确实需要在某些类中的行数较少的情况下对训练测试拆分进行分层,则可以尝试使用以下方法。我通常使用相同的方法,将此类类的所有行复制到训练和测试数据集..

from sklearn.model_selection import train_test_split

def get_min_required_rows(test_size=0.2):
    return 1 / test_size

def make_stratified_splits(df, y_col="label", test_size=0.2):
        for any class with rows less than min_required_rows corresponding to the input test_size,
        all the rows associated with the specific class will have a copy in both the train and test splits.
        example: if test_size is 0.2 (20% otherwise),
        min_required_rows = 5 (which is obtained from 1 / test_size i.e., 1 / 0.2)
        where the resulting splits will have 4 train rows (80%), 1 test row (20%)..
    id_col = "id"
    temp_col = "same-class-rows"
    class_to_counts = df[y_col].value_counts()
    df[temp_col] = df[y_col].apply(lambda y: class_to_counts[y])
    min_required_rows = get_min_required_rows(test_size)
    copy_rows = df[df[temp_col] < min_required_rows].copy(deep=True)
    valid_rows = df[df[temp_col] >= min_required_rows].copy(deep=True)
    X = valid_rows[id_col].tolist()
    y = valid_rows[y_col].tolist()
    # notice, this train_test_split is a stratified split
    X_train, X_test, _, _ = train_test_split(X, y, test_size=test_size, random_state=43, stratify=y)
    X_test = X_test + copy_rows[id_col].tolist()
    X_train = X_train + copy_rows[id_col].tolist()
    df.drop([temp_col], axis=1, inplace=True)
    test_df = df[df[id_col].isin(X_test)].copy(deep=True)
    train_df = df[df[id_col].isin(X_train)].copy(deep=True)
    print (f"number of rows in the original dataset: {len(df)}")
    test_prop = round(len(test_df) / len(df) * 100, 2)
    train_prop = round(len(train_df) / len(df) * 100, 2)
    print (f"number of rows in the splits: {len(train_df)} ({train_prop}%), {len(test_df)} ({test_prop}%)")
    return train_df, test_df