Question

当尝试按列（分类）进行分层拆分时，它返回我错误。

Country     ColumnA    ColumnB   ColumnC   Label
AB            0.2        0.5       0.1       14  
CD            0.9        0.2       0.6       60
EF            0.4        0.3       0.8       5
FG            0.6        0.9       0.2       15

这是我的代码：

X = df.loc[:, df.columns != 'Label']
y = df['Label']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country)

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)

所以我得到如下错误：

ValueError: could not convert string to float: 'AB'

Answer 1

在再现您的代码时，我发现错误来自尝试将线性回归模型拟合到包含字符串的一组功能上。 This answer为您提供了一些选择。我建议使用 X_train, X_test = pd.get_dummies(X_train.Country), pd.get_dummies(X_test.Country) 进行train_test_split（）来对国家进行一次热编码，以保持所需的班级平衡。

Answer 2

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

df = pd.DataFrame({
        'Country': ['AB', 'CD', 'EF', 'FG']*20,
        'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
    })

df['Country_Code'] = df['Country'].astype('category').cat.codes

X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)
lm_predictions = lm.predict(X_test)

将country中的字符串值转换为数字并将其保存为新列
创建x火车数据放置label（y）以及字符串country列时

方法2

如果以后要进行预测的测试数据会出现，那么您将需要一种在进行预测之前将其country转换为code的机制。在这种情况下，推荐的方法是使用LabelEncoder，可以在其中使用fit方法将字符串编码为标签，然后再使用transform对测试数据的国家/地区进行编码。

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing

df = pd.DataFrame({
        'Country': ['AB', 'CD', 'EF', 'FG']*20,
        'ColumnA' : [1]*20*4,'ColumnB' : [10]*20*4, 'Label': [1,0,1,0]*20
    })

# Train-Validation 
le = preprocessing.LabelEncoder()
df['Country_Code'] = le.fit_transform(df['Country'])
X = df.loc[:, df.columns.drop(['Label','Country'])]
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=df.Country_Code)
lm = LinearRegression()
lm.fit(X_train,y_train)

# Test
test_df = pd.DataFrame({'Country': ['AB'], 'ColumnA' : [1],'ColumnB' : [10] })
test_df['Country_Code'] = le.transform(test_df['Country'])
print (lm.predict(test_df.loc[:, test_df.columns.drop(['Country'])]))

按列（对象）分层

2 个答案:

方法2