如何在训练和测试数据中对齐pandas get_dummies?

时间:2019-06-24 19:17:10

标签: python pandas

question有助于认识到我可以拆分训练和验证数据。这是我用来加载火车和测试的代码。

def load_data(datafile):
    training_data = pd.read_csv(datafile, header=0, low_memory=False)
    training_y = training_data[['job_performance']]
    training_x = training_data.drop(['job_performance'], axis=1)

    training_x.replace([np.inf, -np.inf], np.nan, inplace=True)
    training_x.fillna(training_x.mean(), inplace=True)
    training_x.fillna(0, inplace=True)
    categorical_data = training_x.select_dtypes(
        include=['category', object]).columns

    training_x = pd.get_dummies(training_x, columns=categorical_data)
    return training_x, training_y

datafile是我的培训文件。我有另一个文件test.csv,它与训练文件的列相同,但可能缺少类别。如何在测试文件中进行get_dummies并确保类别以相同的方式编码?

另外,我的测试数据缺少job_performance列,如何在函数中处理呢?

2 个答案:

答案 0 :(得分:2)

在处理多列时,最好使用sklearn.preprocessing.OneHotEncoder。这样可以很好地跟踪您的类别并优雅地处理未知类别。

sys.version
# '3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13) \n[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]'
sklearn.__version__
# '0.20.0'
np.__version__
# '1.15.0'
pd.__version__
# '0.24.2'

from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({
    'data': [1, 2, 3],
    'cat1': ['a', 'b', 'c'],
    'cat2': ['dog', 'cat', 'bird']
})

ohe = OneHotEncoder(handle_unknown='ignore')
categorical_columns = df.select_dtypes(['category', object]).columns
dummies = pd.DataFrame(ohe.fit_transform(df[categorical_columns]).toarray(), 
                       index=df.index, 
                       dtype=int)

df_ohe = pd.concat([df.drop(categorical_columns, axis=1), dummies], axis=1)
df_ohe

   data  0  1  2  3  4  5
0     1  1  0  0  0  0  1
1     2  0  1  0  0  1  0
2     3  0  0  1  1  0  0

您可以看到类别及其顺序:

 ohe.categories_
# [array(['a', 'b', 'c'], dtype=object),
#  array(['bird', 'cat', 'dog'], dtype=object)]

现在,要逆转此过程,我们只需要以前的类别即可。无需在这里腌制或腌制任何模型。

df2 = pd.DataFrame({
    'data': [1, 2, 1],
    'cat1': ['b', 'a', 'b'],
    'cat2': ['cat', 'dog', 'cat']
})

ohe2 = OneHotEncoder(categories=ohe.categories_)
ohe2.fit_transform(df2[categorical_columns])

dummies = pd.DataFrame(ohe2.fit_transform(df2[categorical_columns]).toarray(), 
                       index=df2.index, 
                       dtype=int)
pd.concat([df2.drop(categorical_columns, axis=1), dummies], axis=1)

   data  0  1  2  3  4  5
0     1  0  1  0  0  1  0
1     2  1  0  0  0  0  1
2     1  0  1  0  0  1  0

那对您意味着什么?您将想要更改功能以同时适用于训练和测试数据。向您的函数中添加一个额外的参数categories

def load_data(datafile, categories=None):
    data = pd.read_csv(datafile, header=0, low_memory=False)
    if 'job_performance' in data.keys():
        data_y = data[['job_performance']]
        data_x = data.drop(['job_performance'], axis=1)
    else:
        data_x = data
        data_y = None

    data_x.replace([np.inf, -np.inf], np.nan, inplace=True)
    data_x.fillna(data_x.mean(), inplace=True)
    data_x.fillna(0, inplace=True)

    ohe = OneHotEncoder(handle_unknown='ignore', 
                        categories=categories if categories else 'auto')

    categorical_data = data_x.select_dtypes(object)
    dummies = pd.DataFrame(
                ohe.fit_transform(categorical_data.astype(str)).toarray(), 
                index=data_x.index,
                dtype=int)

    data_x = pd.concat([
        data_x.drop(categorical_data.columns, axis=1), dummies], axis=1)

    return (data_x, data_y) + ((ohe.categories_, ) if not categories else ())

您的函数可以称为

# Load training data.
X_train, y_train, categories = load_data('train.csv')
...
# Load test data.
X_test, y_test = load_data('test.csv', categories=categories)

并且代码应该可以正常工作。

答案 1 :(得分:2)

如果要使用pandas get_dummies,则需要手动为训练中但不在测试中的值添加列,而忽略测试中但不在训练中的列。

您可以使用假人列名(默认情况下为“ origcolumn_value”),并使用单独的函数进行训练和测试。

遵循这些原则(尚未测试):

def load_and_clean(datafile_path, labeled=False):
    data = pd.read_csv(datafile_path, header=0, low_memory=False)

    if labeled:
        job_performance = data['job_performance']
        data = data.drop(['job_performance'], axis=1)

    data.replace([np.inf, -np.inf], np.nan, inplace=True)
    data.fillna(data.mean(), inplace=True)
    data.fillna(0, inplace=True)

    if labeled:
        data['job_performance'] = job_performance

    return data

def dummies_train(training_data):
    training_y = training_data[['job_performance']]
    training_x = data.drop(['job_performance'], axis=1)
    categorical_data = training_x.select_dtypes(
        include=['category', object]).columns
    training_x = pd.get_dummies(training_x, columns=categorical_data)
    return training_x, training_y, training_x.columns

def dummies_test(test_data, model_columns):
    categorical_data = test_data.select_dtypes(
        include=['category', object]).columns
    test_data = pd.get_dummies(test_data, columns=categorical_data)
    for c in model_columns:
        if c not in test_data.columns:
            test_data[c] = 0
    return test_data[model_columns]

training_x, training_y, model_columns = dummies_train(load_and_clean(<train_data_path>), labeled=True)
test_x = dummies_test(load_and_clean(<test_data_path>), model_columns)