Question

我在Kaggle（https://www.kaggle.com/c/house-prices-advanced-regression-techniques#description）上做了一些机器学习，并获得了火车CSV和测试CSV。

我想删除数据中的列，其中至少30％的值为空。如果我只是为我的训练集做这个，我会这样做：

train_df = pd.read_csv("train.csv")
train_len = len(train_df)
test_df = pd.read_csv("test.csv")
threshold = int(0.7 * train_len)
train_df.dropna(axis=1, thresh=threshold, inplace=True)

这很有效。但是，我想在我的测试集中删除这些列。具体来说，我想在训练集中找到具有30％或更多空值的列，并将它们从训练集和测试集中删除。

我正在考虑将我的DataFrames组合在一起：

combined_df = pd.concat([train_df, test_df], axis=0)

如果combined_df[:train_len, :]的空值超过30％，请从combined_df删除该列。

我该怎么做？要清楚，我不想在列车中循环，找到空值超过30％的列，从火车上下来，然后从测试中退出。

谢谢！

Answer 1

删除test_df中的列后，只需使用剩余的列名称选择train_df.dropna(axis=1, thresh=threshold, inplace=True) test_df = test_df[train_df.columns]中的列。

$mailTemplate = Mage::getModel('core/email_template');
$mailTemplate->setReplyTo('test@example.com');
$mailTemplate->sendTransactional($templateId, $sender, $recipient, '', $vars, $storeId);

Answer 2

train_df.columns[train_df.isnull().sum()/len(train_df)>0.3]
Out[1391]: Index(['B'], dtype='object')

combined_df.loc[:,train_df.isnull().sum()/len(df)>0.3]
Out[1394]: 
     B
0  2.0
1  NaN
2  4.0
3  NaN

数据输入

     A    B
0  1.0  2.0
1  NaN  NaN
2  3.0  4.0
3  4.0  NaN

熊猫：如果子矩阵有足够空，则删除列

2 个答案: