我正在使用 scikit learn 进行多线性回归练习
我有一个带有列/标签名称的数据集,我正在通过 onehotencoder 推送它以获取分类标签。
我可以得到系数,但我真正想做的是将系数映射回原始列名。
我正在尝试通过从列转换器获取功能名称来实现这一点。
print(ct.get_feature_names())
['encoder__x0_California', 'encoder__x0_Florida', 'encoder__x0_New York', 'x0', 'x1', 'x2']
# Multiple Linear Regression
正如你在上面看到的,我得到了带有 x0、x1、x2 的直通列
实际的 X 标签标题是
["R&D Spend","Administration","Marketing Spend","State]"
状态是被onehotencoded的列
知道查看每个特征名称系数的最佳方法是什么吗?
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
print(X)
# Encoding categorical data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
o = OneHotEncoder()
ct = ColumnTransformer(transformers=[('encoder',o, [3])], remainder='passthrough')
ft = ct.fit_transform(X)
X = np.array(ft)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Training the Multiple Linear Regression model on the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the Test set results
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Test': y_test, 'Prediction': y_pred}, columns=['Test', 'Prediction'])
print(df)
# Output coefficients to dataframe with labels
print(ct.get_feature_names())
df_coef = pd.DataFrame({'feature_names': ct.get_feature_names(dataset.columns),
'coef': np.squeeze(regressor.coef_)})
print(df_coef)