Question

我正在读取两个 CSV，一个包含数据，另一个用于从另一个笔记本中继承数据类型。

我正在使用数据类型来过滤数字与分类——所有字段都是数字。我会经常删除和添加列，因此静态列表不是一个很好的选择。

在某一时刻，此代码#would# 设置为对象，但由于某种原因，我的笔记本变得越来越严格。例如，在同一个数据集上，我现在必须使用 .info(verbose=True, null_count=True) 而之前我只需要 .info()。

dtypes csv 看起来像

<头>

列	Dtype
field1	float64
field2	float64
field3	int64
field4	对象

读入代码：

for rows, cols in data_types.iterrows():
    if data_types.iloc[rows].Dtype == 'int64':
        train_test_df[cols[0]] = train_test_df[cols[0]].astype(np.int64)
    elif data_types.iloc[rows].Dtype == 'float64':
        train_test_df[cols[0]] = train_test_df[cols[0]].astype(float)
    elif data_types.iloc[rows].Dtype == 'object':
        train_test_df[cols[0]] = train_test_df[cols[0]].astype(object)

稍后我需要将其拆分为数字和分类特征。

    categorical_features = df.select_dtypes(include = ["object"]).columns
    numeric_features = df.select_dtypes(exclude = ["object"]).columns

尝试使用复杂数据类型，但 sklearn PCA 不喜欢该数据类型。

还尝试将数字设置为字符串，但这很快也让我陷入了困境。

是否有关于替代 dtype 或方法的任何想法，这些方法可以让我灵活地删除和重新添加列而不会产生大量开销？

Answer 1

我能够解决这个问题，但它很笨拙，我仍然不确定为什么代码可以

看到 row == 对象（即条件正确触发）
在单独的代码单元中使用完全相同的代码（更新 df 名称）（即条件中的代码按预期执行）

该方法是为所有对象附加一个列表，并稍后在单元格中进行设置

# load cleaned data from earlier notebook (exported to csv)

train_test_df = pd.read_csv('train_test_cleaned_w_added_features_TRIMMED.csv', header = 0)
train_test_df = train_test_df.iloc[:,1:].drop(index=0,axis=1)

# reset datatypes to match earlier notebook (exported to csv)

data_types = pd.read_csv('dtypes_TRIMMED.csv', names = ['Column',"Dtype"])

cat = []
for rows, cols in data_types.iterrows():
    if data_types.iloc[rows].Dtype == 'object':

        #insert janky workaround here
        cat.append(data_types.iloc[rows].Column)
    elif data_types.iloc[rows].Dtype == 'int64':
        train_test_df[cols[0]] = train_test_df[cols[0]].astype(np.int64)
    elif data_types.iloc[rows].Dtype == 'float64':
        train_test_df[cols[0]] = train_test_df[cols[0]].astype(float)

train_test_df = train_test_df.reset_index(drop=True)

#execute janky workaround 
train_test_df[cat] = train_test_df[cat].astype(object)

强制字段为对象类型

1 个答案: