使用DictVectorizer的sklearn管道中的分类变量

时间:2016-10-17 20:21:35

标签: python pipeline categorical-data dictvectorizer

我想应用一个带数字和放大器的管道分类变量如下

import numpy as np
import pandas as pd
from sklearn import linear_model,  pipeline, preprocessing
from sklearn.feature_extraction import DictVectorizer 

df = pd.DataFrame({'a':range(12), 'b':[1,2,3,1,2,3,1,2,3,3,1,2], 'c':['a', 'b', 'c']*4, 'd': ['m', 'f']*6})
y = df['a']
X = df[['b', 'c', 'd']]

我为数字

创建索引
numeric = ['b']
numeric_indices = np.array([(column in numeric) for column in X.columns], dtype = bool)

&安培;对于分类变量

categorical = ['c', 'd'] 
categorical_indices = np.array([(column in categorical) for column in X.columns], dtype = bool)

然后我创建一个管道

regressor = linear_model.SGDRegressor()
encoder = DictVectorizer(sparse = False)

estimator = pipeline.Pipeline(steps = [       
    ('feature_processing', pipeline.FeatureUnion(transformer_list = [        

            #numeric
            ('numeric_variables_processing', pipeline.Pipeline(steps = [
                ('selecting', preprocessing.FunctionTransformer(lambda data: data[:, numeric_indices])),
                ('scaling', preprocessing.StandardScaler(with_mean = 0.))            
                        ])),

            #categorical
            ('categorical_variables_processing', pipeline.Pipeline(steps = [
                ('selecting', preprocessing.FunctionTransformer(lambda data: data[:, categorical_indices])),
                ('DictVectorizer', encoder )           
                        ])),
        ])),
    ('model_fitting', regressor)
    ]
)

我得到了

estimator.fit(X, y)
ValueError: could not convert string to float: 'f'

我知道我必须申请 encoder.fit() 在管道中,但不明白如何应用它 或者我们讨厌使用 preprocessing.OneHotEncoder()但我们又需要将字符串转换为float

如何改进?

1 个答案:

答案 0 :(得分:1)

我是这样看的

timeout 1800 watch -n 5 "ls"