Question

我正在处理著名的Kaggle挑战“房价”。我想用sklearn.linear_model LinearRegression训练我的数据集

阅读以下文章后： https://developers.google.com/machine-learning/crash-course/representation/feature-engineering

我编写了一个函数，将火车DataFrame中的所有String值转换为List。例如，原始特征值可能看起来像这样[Ex，Gd，Ta，Po]，并且在转换后看起来像这样：[1,0,0,0] [0,1,0,0] [0， 0,1,0] [0,0,0,1]。

当我尝试训练数据时，出现以下错误：

回溯（最近通话最近）：文件 “ C：/用户/所有者/PycharmProjects/HousePrices/main.py”，第27行，在 linereg.fit（train_df，目标）文件“ C：\ Users \ Owner \ PycharmProjects \ HousePrices \ venv \ lib \ site-packages \ sklearn \ linear_model \ base.py”， 458号线，适合 y_numeric = True，multi_output = True）文件“ C：\ Users \ Owner \ PycharmProjects \ HousePrices \ venv \ lib \ site-packages \ sklearn \ utils \ validation.py”， 756行，在check_X_y中 estimator = estimator）文件“ C：\ Users \ Owner \ PycharmProjects \ HousePrices \ venv \ lib \ site-packages \ sklearn \ utils \ validation.py”，第567行，在check_array中 array = array.astype（np.float64）ValueError：设置具有序列的数组元素。

仅当我按照说明转换了某些列时，这种情况才会发生。

有没有办法训练以向量为值的线性回归模型？

这是我的转换函数：

def feature_to_boolean_vector(df, feature_name, new_name):
    vectors_list = [] #each tuple will represent an option
    feature_options = df[feature_name].unique()
    feature_options_length = len(feature_options)

    # creating a list the size of feature_options_length, all 0's
    list_to_be_vector = [0 for i in range(feature_options_length)]

    for i in range(feature_options_length):
        list_to_be_vector[i] = 1 # inserting 1 representing option number i
        vectors_list.append(list_to_be_vector.copy())
        list_to_be_vector[i] = 0

    mapping = dict(zip(feature_options, vectors_list)) # dict from values to vectors
    df[new_name] = df[feature_name].map(mapping)
    df.drop([feature_name], axis=1, inplace=True)

这是我的火车尝试（经过预处理）：

linereg = LinearRegression()
linereg.fit(train_df, target)

谢谢。

Answer 1

LinearRegression不支持列表功能。我看到您正在使用“一口气”，并且可以将每个维度用作要素列。相比之下，您可以在熊猫中使用更简单的方法pd.get_dummies。

print(df['feature'])
0    Ex
1    Gd
2    Ta
3    Po
Name: feature, dtype: object

df = pd.get_dummies(df['feature'])
print(df)
   Ex  Gd  Po  Ta
0   1   0   0   0
1   0   1   0   0
2   0   0   0   1
3   0   0   1   0

有没有一种方法可以将列表用作DataFrame中的值？

1 个答案: