如何使用sklearn列变压器?

时间:2019-01-12 14:11:00

标签: python scikit-learn

我正在尝试使用LabelEncoder然后使用OneHotEncoder将分类值(在我的情况下是国家/地区列)转换为编码后的值,并且能够转换分类值。但是我收到警告,就像不赞成使用OneHotEncoder'categorical_features'关键字“改为使用ColumnTransformer”。那么我如何使用ColumnTransformer来达到相同的结果?

下面是我的输入数据集和我尝试过的代码

Input Data set

Country Age Salary
France  44  72000
Spain   27  48000
Germany 30  54000
Spain   38  61000
Germany 40  67000
France  35  58000
Spain   26  52000
France  48  79000
Germany 50  83000
France  37  67000


import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

#X is my dataset variable name

label_encoder = LabelEncoder()
x.iloc[:,0] = label_encoder.fit_transform(x.iloc[:,0]) #LabelEncoder is used to encode the country value
hot_encoder = OneHotEncoder(categorical_features = [0])
x = hot_encoder.fit_transform(x).toarray()

我得到的输出是,如何使用列变换器获得相同的输出

0(fran) 1(ger) 2(spain) 3(age)  4(salary)
1         0       0      44        72000
0         0       1      27        48000
0         1       0      30        54000
0         0       1      38        61000
0         1       0      40        67000
1         0       0      35        58000
0         0       1      36        52000
1         0       0      48        79000
0         1       0      50        83000
1         0       0      37        67000

我尝试了以下代码

from sklearn.compose import ColumnTransformer, make_column_transformer

preprocess = make_column_transformer(

    ( [0], OneHotEncoder())
)
x = preprocess.fit_transform(x).toarray()

i可以使用上面的代码对“国家/地区”列进行编码,但是在转换后缺少x变量的年龄和薪水列

9 个答案:

答案 0 :(得分:3)

奇怪的是,您想将连续数据编码为Salary。除非您将薪水划分到特定范围/类别,否则这没有任何意义。如果我在你要去的地方,

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder



numeric_features = ['Salary']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['Age','Country']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

从这里您可以用分类器将其管道传输,例如

clf = Pipeline(steps=[('preprocessor', preprocessor),
                  ('classifier', LogisticRegression(solver='lbfgs'))])  

按原样使用它

clf.fit(X_train,y_train)

这将应用预处理器,然后将转换后的数据传递给预测器。

答案 1 :(得分:3)

我认为发布者并没有试图改变年龄和薪水。从文档(https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html)中,您ColumnTransformer(和make_column_transformer)仅包含在转换器中指定的列(在您的示例中为[0])。您应该设置restder =“ passthrough”来获取其余的列。换句话说:

preprocessor = make_column_transformer( (OneHotEncoder(),[0]),remainder="passthrough")
x = preprocessor.fit_transform(x)

答案 2 :(得分:1)

from sklearn.compose import make_column_transformer
preprocess = make_column_transformer(
    (OneHotEncoder(categories='auto'), [0]), 
    remainder="passthrough")
X = preprocess.fit_transform(X)

我使用上面的代码修复了相同的问题。

答案 3 :(得分:1)

您可以直接使用OneHotEncoder,而无需使用LabelEncoder

#  Encoding categorical data
from sklearn.preprocessing import OneHotEncoder
transformer = ColumnTransformer(
    transformers=[
        ("OneHotEncoder",
         OneHotEncoder(),
         [0]              # country column or the column on which categorical operation to be performed
         )
    ],
    remainder='passthrough'
)
X = transformer.fit_transform(X.tolist())

答案 4 :(得分:0)

@Fawwaz Yusran要解决此警告...

FutureWarning: The handling of integer data will change in version 0.22. Currently, the categories are determined based on the range [0, max(values)], while in the future they will be determined based on the unique values. If you want the future behaviour and silence this warning, you can specify "categories='auto'". In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly. warnings.warn(msg, FutureWarning)

删除以下内容...

labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

由于直接使用OneHotEncoder,因此不需要LabelEncoder。

答案 5 :(得分:0)

由于您仅要转换“国家/地区”列(即示例中的[0])。使用remainder="passthrough"获取剩余的列,这样您就可以按原样获取这些列。

尝试:

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelencoder=LabelEncoder()
x[:,0]=labelencoder.fit_transform(x[:,0])
preprocess = ColumnTransformer(transformers=[('onehot', OneHotEncoder() 
                               [0])],remainder="passthrough")
x = np.array(preprocess.fit_transform(x), dtype=np.int)

答案 6 :(得分:0)

最简单的方法是在CVS数据框上使用熊猫假人

dataset = pd.read_csv("yourfile.csv")
dataset = pd.get_dummies(dataset,columns=['Country'])

完成您的数据集将如下所示 Output

答案 7 :(得分:0)

import tkinter as tk

def update(seconds):
    global after
    if seconds >= 0:
        countdown.set(seconds_to_time(seconds))
        after = root.after(1000, lambda: update(seconds - 1))
    else:
        root.after_cancel(after)

def seconds_to_time(seconds):
    hours = seconds // 3600
    seconds -= hours * 3600
    minutes = seconds // 60
    seconds -= minutes * 60
    return f'{hours:02d}:{minutes:02d}:{seconds:02d}'

def stop():
    try:
        root.after_cancel(after)
    except NameError:
        pass


#GUI
root = tk.Tk()
root.title("Tequila timer")

#Load of background with the tequila bottle
canvas = tk.Canvas(root, width=423, height=700)
canvas.pack()
Load = tk.PhotoImage(file="tequila.png")
canvas.create_image(211, 350, image=Load)

countdown = tk.StringVar()
countdown.set("00:00:00")

#buttons
btn_1min = tk.Button(root, text="1 min", width=10, height=5, command=lambda: update(60))
btn_1min_v = canvas.create_window(140, 350, window=btn_1min)

btn_10min = tk.Button(root, text="10 min", width=10, height=5, command=lambda: update(600))
bt1_10min_v = canvas.create_window(283, 350, window=btn_10min)

btn_1hour = tk.Button(root, text="1 hour", width=10, height=5, command=lambda: update(3600))
bt1_1hour_v = canvas.create_window(140, 475, window=btn_1hour)

btn_2hours = tk.Button(root, text="2 hours", width=10, height=5, command=lambda: update(7200))
bt1_2hours_v = canvas.create_window(283, 475, window=btn_2hours)

btn_stop = tk.Button(root, text="Stop", width=10, height=5, command=stop)
bt1_stop_v = canvas.create_window(211, 600, window=btn_stop)

#Display
label = tk.Label(root, textvariable=countdown, width=9, font=("calibri", 40, "bold"))
label.pack()
label_v = canvas.create_window(211, 200, window=label)

root.mainloop()

OneHotEnocoder的最大优点是一次转换多个列,请参见传递多个列的示例

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_previsores = LabelEncoder()

onehotencorder = ColumnTransformer(transformers=[("OneHot", OneHotEncoder(), [0])],remainder='passthrough')
x= onehotencorder.fit_transform(x).toarray()

如果是单列,则可以采用传统方式

onehotencorder = ColumnTransformer(transformers=[("OneHot", OneHotEncoder(), [1,3,5,6,7,8,9,13])],remainder='passthrough')

另一个建议。

请勿使用名称为x,y,z的变量 把它代表什么,例如: 预测变量 类, 国家,等。

答案 8 :(得分:0)

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
print(X[:, 0])
ct = ColumnTransformer([("Country", OneHotEncoder(), [1])], remainder = 'passthrough')
#onehotencoder = OneHotEncoder(categorical_features = [0])
X = ct.fit_transform(X).toarray()