Question

我有一个包含50多个列的数据集，并且希望使用循环来删除相对于目标的低相关特征，因此我不需要手动删除它们。

我尝试过：

for feature in df:
        if df[feature].corr() < threshold: df.drop(feature, axis=1, inplace=True)

...这显然不起作用。我对Python很陌生。

建议将不胜感激。

Answer 1

假设目标位于df['y']中：

df = pd.DataFrame({
    'a': range(500),
    'b': np.random.randint(0, 500, 500),
    'c': range(500),
    'd': np.random.randint(0, 500, 500),
    'y': range(500)})

threshold = 0.5
for feature in [c for c in df.columns if c != 'y']:
    if abs(df[feature].corr(df['y'])) < threshold:
        del df[feature]

df.head()

输出：

有没有办法自动从大型数据集中自动选择具有良好相关性的要素？

1 个答案: