scikit-learn:ColumnTransformer和OneHotEncoder –如何为所有字段中的所有新分类级别提供错误信息?

时间:2019-01-15 14:44:03

标签: python scikit-learn

我正在尝试将scikit的ColumnTransformer类用作实际的DataFrame转换器,并将用作“监视”转换器–即,用于监视新类何时进入分类功能的对象在我的数据集中。

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Original DataFrame off of which transformers are fit
orig_df = pd.DataFrame(
    {
        'a': [np.nan, 'a', 'b', 'b', 'a'],
        'b': ([np.nan] * 3) + ['a', 'a'],
        'c': np.random.randn(5)
    }
)

# New DataFrame that will be transformed using already fitted transformer
new_df = pd.DataFrame(
    {
        'a': [np.nan, 'a', 'b', 'b', 'c'],
        'b': ([np.nan] * 4) + ['b'],
        'c': np.random.randn(5)
    }
)

# Cast NaNs to str to play nicely with OneHotEncoder
for col in ('a', 'b'):
    orig_df[col] = orig_df[col].astype(str)
    new_df[col] = new_df[col].astype(str)

# Create master transformer for each of the three columns a, b, and c
transformer_config = [
    ('a', OneHotEncoder(sparse=False, handle_unknown='error'), ['a']),
    ('b', OneHotEncoder(sparse=False, handle_unknown='error'), ['b']),
    ('c', 'passthrough', ['c']),
]

transformer = ColumnTransformer(transformer_config)

# Fit to original dataset
transformer.fit(orig_df)

# Transform new dataset
transformer.transform(new_df)

哪个会产生:

  File "<stdin>", line 2, in <module>
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 495, in transform
    Xs = self._fit_transform(X, None, _transform_one, fitted=True)
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 393, in _fit_transform
    fitted=fitted, replace_strings=True))
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 983, in __call__
    if self.dispatch_one_batch(iterator):
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 825, in dispatch_one_batch
    self._dispatch(tasks)
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 782, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 545, in __init__
    self.results = batch()
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 261, in __call__
    for func, args, kwargs in self.items]
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 261, in <listcomp>
    for func, args, kwargs in self.items]
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/pipeline.py", line 605, in _transform_one
    res = transformer.transform(X)
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 591, in transform
    return self._transform_new(X)
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 553, in _transform_new
    X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
  File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 109, in _transform
    raise ValueError(msg)
ValueError: Found unknown categories ['c'] in column 0 during transform

这会产生我通常想要的错误,但仅针对一列。如您在new_df中所见,列b也有了一个新的级别('b')。是否有一种简单的方法可以报告使用此OneHotEncoder类的所有字段的所有新级别,而不是仅报告第一个错误的级别?

我的第一个想法是尝试分别遍历每个字段,尝试捕获每个ValueError,但这在ColumnTransformer中效果不佳:

>>> transformer.transform(new_df[['b']])
KeyError: "None of [['a']] are in the [columns]"

1 个答案:

答案 0 :(得分:1)

仅为您的示例提供建议的解决方案:

from sklearn.base import BaseEstimator

for _, t_inst, t_col in transformer.transformers_:
    try:
        if isinstance(t_inst, BaseEstimator):
            t_inst.transform(new_df[t_col])
        else:
            pass

    except Exception as e:
        print('During transformation of column {} the following error occurred: {}'.format(t_col, e))

输出

During transformation of column ['a'] the following error occured: Found unknown categories ['c'] in column 0 during transform
During transformation of column ['b'] the following error occured: Found unknown categories ['b'] in column 0 during transform

它只是试图一一应用转换。

请注意,.transformers_属性仅在拟合后可用