如何对熊猫中的多值分类变量进行二进制编码?

时间:2019-11-15 06:42:45

标签: python pandas

我具有以下数据框,其中某个列具有多个值:

             my column

         0 - ["A", "B"]
         1 - ["B", "C", "D"]
         2 - ["B", "D"]

如何获得这样的数据框:(其中每一列都使用“我的列”中的值的名称)

         "A"  "B"  "C"  "D"
      0 - 1    1    0    0
      1 - 0    1    1    1
      2 - 0    1    0    1

4 个答案:

答案 0 :(得分:2)

如果列中有列表,则将Series.str.joinSeries.str.get_dummies配合使用:

df = df['my column'].str.join('|').str.get_dummies()
print (df)
   A  B  C  D
0  1  1  0  0
1  0  1  1  1
2  0  1  0  1

MultiLabelBinarizer

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df['my column']),columns=mlb.classes_)
print (df)
   A  B  C  D
0  1  1  0  0
1  0  1  1  1
2  0  1  0  1

如果有字符串,请将Series.str.stripstr.get_dummies一起使用,最后在必要时从列名中删除"

df = (df['my column'].str.strip('[]')
                     .str.get_dummies(', ')
                     .rename(columns=lambda x: x.strip('"')))
print (df)
   A  B  C  D
0  1  1  0  0
1  0  1  1  1
2  0  1  0  1

答案 1 :(得分:1)

只是为了好玩,这是一个天真的假人实现:

import pandas as pd

my_column = pd.Series([['A','B'],['B','C','D'],['B','D']])

frameA = pd.DataFrame(my_column, columns=['my_column'])
#extract all new headers from the DataFrame rows, in order:
headers = sorted(list(set([x for y in frame['my_column'] for x in y])))
#make a list of the DataFrame rows (stored as lists):
rows = [y for x in range(len(frame)) for y in frame.loc[x]]

builder = {}               #construct a dictionary to build a new DataFrame from
for header in headers:
    column = []
    for row in rows:
        if header in row:
            column.append(1)
        else:
            column.append(0)
    builder.update({header:column})

frameB = pd.DataFrame(builder)

print(frameB)

导致:

   A  B  C  D
0  1  1  0  0 
1  0  1  1  1
2  0  1  0  1

答案 2 :(得分:0)

我认为您正在寻找的是熊猫中的get_dummies()函数,您可以找到here的文档

从文档中:

s = pd.Series(list('abca'))
pd.get_dummies(s)

   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0

答案 3 :(得分:0)

您可以使用CountVectorizer,它是专门为此目的而设计的。它需要文本语料,并为此One-Hot Encoding

注意:我使用的是“猫”,“狗”,“母牛”,“老虎”,而不是“ A”,“ B”,“ C”,“ D”

代码:

进口:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

将列表元素转换为字符串的方法:

def get_string(listt):
  return ' '.join(listt)

从列表创建DataFrame:

my_column = pd.Series([['Cat','Dog'],['Dog','Cow','Tiger'],['Dog','Tiger']])
df = pd.DataFrame(my_column, columns=['my_column'])
print(df)
df['text_data'] = df.my_column.apply(get_string)
print(df)

执行文本向量化:     tf_vectorizer = CountVectorizer(stop_words = None)     vectorized_data = tf_vectorizer.fit_transform(df.text_data)

准备最终的DataFrame:

final_df = pd.DataFrame(vectorized_data.toarray(),columns=tf_vectorizer.get_feature_names())
print(final_df)

投放:

我们的数据框:

           my_column
0         [Cat, Dog]
1  [Dog, Cow, Tiger]
2       [Dog, Tiger]

带有文本列的DataFrame:

           my_column      text_data
0         [Cat, Dog]        Cat Dog
1  [Dog, Cow, Tiger]  Dog Cow Tiger
2       [Dog, Tiger]      Dog Tiger

预期结果:

   cat  cow  dog  tiger
0    1    0    1      0
1    0    1    1      1
2    0    0    1      1