我在数据框下方
df[['row_num','set_id']].head()
row_num path_id_set
988681 [31672, 0]
988680 [31965, 0]
988679 [0, 78464]
我正在尝试使用多标签二值化器,但由于错误代码float对象无法迭代而失败
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit_transform(df['set_id'].str.split(','))
TypeError: 'float' object is not iterable
答案 0 :(得分:1)
我认为问题是缺少值,您可以使用:
print (df)
row_num set_id
0 988681 NaN
1 988680 [31965,0]
2 988679 [0,78464]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
#create boolean mask matched non NaNs values
mask = df['set_id'].notnull()
#filter by boolean indexing
arr = mlb.fit_transform(df.loc[mask, 'set_id'].dropna().str.strip('[]').str.split(','))
#create DataFrame and add missing (NaN)s index values
df = (pd.DataFrame(arr, index=df.index[mask], columns=mlb.classes_)
.reindex(df.index, fill_value=0))
print (df)
0 31965 78464
0 0 0 0
1 1 1 0
2 1 0 1