pandas:一个热编码 - 如果col包含csv值,如何获得一个热编码

时间:2018-01-22 03:48:08

标签: python encoding data-science

说数据就像

d = {'col1': ['a,b', 'b', 'c,d', 'a,c'], 'col2': [3, 4, 5, 6]}
s = pd.DataFrame(d)
    col1    col2
0   a,b      3
1   b        4
2   c,d      5
3   a,c      6

想对一个热门编码col1。如下:

    a   b   c   d
0   1   1   0   0   
1   0   1   0   0   
2   0   0   1   1
3   1   0   1   0

由于

1 个答案:

答案 0 :(得分:2)

你可以使用list和dict comprehensions在4行代码中执行此操作(如果你折叠第3和第4代,则为3行)。

# 1. Create a list of lists, where each sublist contains the characters
#    contained in the columnd    
separated_data = [[sub_el for sub_el in el.strip(',') if ',' not in sub_el] 
                    for el in s['col1']]
# separated_data is [['a', 'b'], ['b'], ['c', 'd'], ['a', 'c']]


# 2. (optional) find the set of keys contained in your dataframe,
#        if you don't already know that
keys = set([key for sublist in separated_data for key in sublist])
# keys is {'a', 'b', 'c', 'd'}


# 3. Create a dictionary, where the each character is a key and each value
#     is a list. The n-th value of the list says 1 if the character is
#     contained in the n-th row, 0 otherwise
columns = {key: [1 if key in sublist else 0 for sublist in separated_data] 
                for key in keys}
              for key in keys]
# columns is {'a': [1, 0, 0, 1], 'b': [1, 1, 0, 0], 'c': [0, 0, 1, 1], 'd': [0, 0, 1, 0]}


# 4. Your dataframe
onehot_dataframe = pd.Dataframe(columns)
# onehot_dataframe is:
#    a  b  c  d
# 0  1  1  0  0
# 1  0  1  0  0
# 2  0  0  1  1
# 3  1  0  1  0