说数据就像
d = {'col1': ['a,b', 'b', 'c,d', 'a,c'], 'col2': [3, 4, 5, 6]}
s = pd.DataFrame(d)
col1 col2
0 a,b 3
1 b 4
2 c,d 5
3 a,c 6
想对一个热门编码col1。如下:
a b c d
0 1 1 0 0
1 0 1 0 0
2 0 0 1 1
3 1 0 1 0
由于
答案 0 :(得分:2)
你可以使用list和dict comprehensions在4行代码中执行此操作(如果你折叠第3和第4代,则为3行)。
# 1. Create a list of lists, where each sublist contains the characters
# contained in the columnd
separated_data = [[sub_el for sub_el in el.strip(',') if ',' not in sub_el]
for el in s['col1']]
# separated_data is [['a', 'b'], ['b'], ['c', 'd'], ['a', 'c']]
# 2. (optional) find the set of keys contained in your dataframe,
# if you don't already know that
keys = set([key for sublist in separated_data for key in sublist])
# keys is {'a', 'b', 'c', 'd'}
# 3. Create a dictionary, where the each character is a key and each value
# is a list. The n-th value of the list says 1 if the character is
# contained in the n-th row, 0 otherwise
columns = {key: [1 if key in sublist else 0 for sublist in separated_data]
for key in keys}
for key in keys]
# columns is {'a': [1, 0, 0, 1], 'b': [1, 1, 0, 0], 'c': [0, 0, 1, 1], 'd': [0, 0, 1, 0]}
# 4. Your dataframe
onehot_dataframe = pd.Dataframe(columns)
# onehot_dataframe is:
# a b c d
# 0 1 1 0 0
# 1 0 1 0 0
# 2 0 0 1 1
# 3 1 0 1 0