LabelBinarizer不适用于2个分类值

时间:2017-08-08 20:00:24

标签: python dataframe scikit-learn

我使用的代码如下:

import pandas as pd
from sklearn.preprocessing import LabelBinarizer
logging.info('performing binary encoding')

other_CSV = pd.read_csv('/home/bluedata/decisionengine/cc1.txt', sep  = '|', encoding = 'ISO-8859-1')
other_CSV_0 = other_CSV.copy(deep="True")
print other_CSV_0

lb_style = LabelBinarizer()

rating_text = lb_style.fit_transform(other_CSV["rating_text"])
rating_text_df = pd.DataFrame(rating_text, columns=lb_style.classes_)
other_CSV_1 = other_CSV.join(rating_text_df)

print other_CSV_1

user_foodie_level = lb_style.fit_transform(other_CSV["user_foodie_level"])
user_foodie_level_df = pd.DataFrame(user_foodie_level, columns=lb_style.classes_)
other_CSV_2 = other_CSV_1.join(user_foodie_level_df)

print other_CSV_2

lb_style = LabelBinarizer()
class_name = lb_style.fit_transform(other_CSV["class_name"])
class_name_df = pd.DataFrame(class_name, columns=lb_style.classes_)
other_CSV_3 = other_CSV_2.join(class_name_df)    
other_CSV_3.to_csv("/home/bluedata/decisionengine/ec1.txt",sep = "|", index=False, encoding = 'utf-8')

user_foodie_level是二进制文件,因此包含两个值:foodiebig foodie

使用上面的代码对此列进行二值化会给我一个错误:

  

ValueError:传递值的形状为(1,5),索引意味着(2,5)。

如果我为列user_foodie_level提供了超过2个分类值,它会为我提供所需的输出。如果我在列中只有两个分类值,我无法理解为什么它不起作用。

Data that I have used for this code

2 个答案:

答案 0 :(得分:2)

问题在于以下几行:

user_foodie_level_df = pd.DataFrame(user_foodie_level, columns=lb_style.classes_)

问题是user_foodie_level的维度为(1,5),您通过将两个列名称(2,5)赋予数据框构造函数来告诉pandas维度为['Big Foodie' 'Foodie'] 。您需要更改为:

user_foodie_level_df = pd.DataFrame(user_foodie_level, columns=['binarized_user_foodie_level'])

要了解其中的原因,请检查以下内容。

说明

两个值(二进制)分类变量的标签二值化是一种特殊情况,其中LabelBinarizer()按列返回1维向量,与具有两个以上变量的分类变量不同。在后一种情况下,维度列方式相当于lb_style.classes_中元素的数量,这意味着只有当您在分类变量中有两个以上的值时,构建数据框的方式才是正确的。试图二进制化。

以下代码段可帮助您查看两种情况之间LabelBinarizer输出的差异:

import pandas as pd 
from sklearn.preprocessing import LabelBinarizer
from StringIO import StringIO


data = """
user_foodie_level
Big Foodie
Foodie
Foodie
Foodie
Big Foodie
Foodie
"""


data1 = """
user_foodie_level
Big Foodie
Foodie
Foodie
Foodie
Big Foodie
Foodie
New Foodie
"""


def test_binarization(data):

    data = pd.read_csv(StringIO(data))
    print(data.head())

    lb_style = LabelBinarizer() 
    user_foodie_level = lb_style.fit_transform(data["user_foodie_level"]) 
    print(user_foodie_level)

    print("lb.classes_")
    print(lb_style.classes_)


print("two values categorical variable test")
test_binarization(data)

print("Three values categorical variable test")
test_binarization(data1)

代码段的输出:

two values categorical variable test
  user_foodie_level
0        Big Foodie
1            Foodie
2            Foodie
3            Foodie
4        Big Foodie
[[0]
 [1]
 [1]
 [1]
 [0]
 [1]]
lb.classes_
['Big Foodie' 'Foodie']

Three values categorical variable test
  user_foodie_level
0        Big Foodie
1            Foodie
2            Foodie
3            Foodie
4        Big Foodie
[[1 0 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [1 0 0]
 [0 1 0]
 [0 0 1]]
lb.classes_
['Big Foodie' 'Foodie' 'New Foodie']

答案 1 :(得分:1)

首先,它按预期工作。

当您尝试使用二进制DataFrame实例化user_foodie_level和使用lb_style.classes获取的类时,会发生错误。要解决此问题,您应标记{strong> user_foodie_level_df的唯一列。首选方法如下:

from sklearn.preprocessing import LabelBinarize
import pandas as pd

col1 = ['yes', 'no', 'yes', 'yes', 'yes']
col2 = ['the worst' ,'bad', 'okay', 'good', 'the best']
data = pd.DataFrame(data=[col1, col2])

print(data)

>>>            0    1     2     3         4
    0        yes   no   yes   yes       yes
    1  the worst  bad  okay  good  the best

lb = LabelBinarizer()

col1_lb = pd.DataFrame(lb.fit_transform(col1), columns=['example'])
col2_lb = lb.fit_transform(col2)
col2_tags = lb.classes_
col2_lb = pd.DataFrame(data=col2_lb, columns=col2_tags)

print(col1_lb)

>>>    user_foodie_level
    0                  1
    1                  0
    2                  1
    3                  1
    4                  1

print(col2_lb)

>>>    bad  good  okay  the best  the worst
    0    0     0     0         0          1
    1    1     0     0         0          0
    2    0     0     1         0          0
    3    0     1     0         0          0
    4    0     0     0         1          0

data = col2_lb.join(col1_lb)

print(data)

>>>    bad  good  okay  the best  the worst  example
    0    0     0     0         0          1        1
    1    1     0     0         0          0        0
    2    0     0     1         0          0        1
    3    0     1     0         0          0        1
    4    0     0     0         1          0        1

我们可以通过执行以下操作重现相同的错误:

col1_lb = lb.fit_transform(col1)
col1_tags = lb.classes_

df = pd.DataFrame(col1_lb, columns=col1_tags)
  

ValueError:传递值的形状为(1,5),index表示暗示(2,5)

这意味着您为单个现有数据列传递两个列名称。

希望有所帮助。

更深一点

如果有两个值,如['yes','no'],那么二值化会创建一个列:

>>> [[1], 
     [0]] 

这意味着您只能为此列应用一个名称。

如果有三个值,如['yes','no','不知道'],则二值化会创建一个如下矩阵:

>>> [[1, 0, 0],
     [0, 1, 0],
     [0, 0, 1]]

这正好是三列。因此,三个名称是合适的。