Question

我正在对数据使用Logistic回归模型。据我了解（例如，从这里：Pandas vs. Numpy Dataframes），与sklearn一起使用numpy.ndarray比使用Pandas Dataframes更好。这可以通过使用数据框上的.values属性来完成。我已经做到了，但是得到了ValueError：仅对pandas DataFrames支持使用字符串指定列。显然，我的代码做错了。任何见解都将受到赞赏。

很有趣的是，当我不使用.values时，我的代码可以工作，而仅将X用作DataFrame，将y用作Pandas Series。

# We will train our classifier with the following features:
# Numeric features to be scaled: LIMIT_BAL, AGE, PAY_X, BIL_AMTX, and PAY_AMTX
# Categorical features: SEX, EDUCATION, MARRIAGE

# We create the preprocessing pipelines for both numeric and categorical data
numeric_features = ['LIMIT_BAL', 'AGE', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 
                 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 
                 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

data['PAY_1'] = data.PAY_1.astype('float64')
data['PAY_2'] = data.PAY_2.astype('float64')
data['PAY_3'] = data.PAY_3.astype('float64')
data['PAY_4'] = data.PAY_4.astype('float64')
data['PAY_5'] = data.PAY_5.astype('float64')
data['PAY_6'] = data.PAY_6.astype('float64')
data['AGE'] = data.AGE.astype('float64')


numeric_transformer = Pipeline(steps=[
('scaler', MinMaxScaler())
])

categorical_features = ['SEX', 'EDUCATION', 'MARRIAGE']
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(categories='auto'))
])

preprocessor = ColumnTransformer(
transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

y = data['default'].values
X = data.drop('default', axis=1).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
random_state=10, stratify=y)

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
lr = Pipeline(steps=[('preprocessor', preprocessor),
                 ('classifier', LogisticRegression(solver='liblinear'))])

param_grid_lr = {
'classifier__C': np.logspace(-5, 8, 15)
}

lr_cv = GridSearchCV(lr, param_grid_lr, cv=10, iid=False)

lr_cv.fit(X_train, y_train)

ValueError：仅熊猫数据帧支持使用字符串指定列

Answer 1

您使用ColumnTransformer就像拥有一个数据框一样，但是您没有一个...

列：字符串或整数，类似于字符串或整数的数组，切片，布尔型掩码数组或可调用

在第二个轴上索引数据。整数被解释为位置列，字符串可以按名称引用DataFrame列。标量字符串或整数应在转换器期望X像一维数组（矢量）的情况下使用，否则会将二维数组传递给转换器。可调用对象传递了输入数据X，并且可以返回上述任何一个。

如果您传递列的字符串，则需要传递一个数据框。如果要使用numpy数组，则首先可能不需要转型，并且需要指定整数而不是字符串作为索引。

在sklearn的.fit（）方法中使用numpy.ndarray与Pandas Dataframe

1 个答案: