使用Python或Sklearn中的整数值对字符串值进行编码

时间:2016-05-11 08:36:09

标签: python nlp scikit-learn

如何用整数值对数据表中字符串类型的列值进行编码。例如,我有两个特征变量:颜色(可能的字符串值R,G和B)和技能(可能的字符串值C ++,Java,SQL和Python)。给定数据表有两列 -

Color' -> R G B B G R B G G R G  ;
Skills' -> Java , C++, SQL, Java, Python, Python, SQL, C++, Java, SQL, Java.

我想知道哪个sklearn函数/方法将转换为两列以上,如R = 0,G = 1和B = 2且C ++ = 0,Java = 1,SQL = 2且Python = 3:

Color: 0, 1, 2, 2, 1, 0, 2, 1, 1, 0, 1
Skills:  1, 0, 2, 1, 3, 3, 2, 0, 1, 2, 1

请告诉我怎么做?

1 个答案:

答案 0 :(得分:4)

使用Sckit-learn LabelEncoder()方法

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({
'colors':  ["R" ,"G", "B" ,"B" ,"G" ,"R" ,"B" ,"G" ,"G" ,"R" ,"G" ],
'skills':  ["Java" , "C++", "SQL", "Java", "Python", "Python", "SQL","C++", "Java", "SQL", "Java"]
})

def encode_df(dataframe):
    le = LabelEncoder()
    for column in dataframe.columns:
        dataframe[column] = le.fit_transform(dataframe[column])
    return dataframe

#encode the dataframe
encode_df(df)