Dataframe看起来像这样(空白单元格是'',字段,extra_dimensions是列)
field | extra_dimensions
------------------------
a |
b | [abc, def]
c | [ghi]
我有一个所需尺寸和额外尺寸的列表:
required_dimensions = [123, 456]
extra_dimensions = [abc, def, ghi]
期望的输出:
field | 123 | 456 | abc | def | ghi
-----------------------------------
a | 1 | 1 | 0 | 0 | 0
b | 1 | 1 | 1 | 1 | 0
c | 1 | 1 | 0 | 0 | 1
尝试:
columns = ['field', 'extra_dimensions'] + required_dimensions + extra_dimensions
df = df.reindex(columns=columns)
for i in required_dimensions:
df[i].fillna('1', inplace=True)
for i in extra_dimensions:
df[i][df['extra_dimensions'].str.contains(i)] = '1'
但我明白了:
ValueError: cannot index with vector containing NA / NaN values
会喜欢我尝试的任何意见或对更好方法的任何想法。提前谢谢!
答案 0 :(得分:0)
再次使用get_dummies
.....
required_dimensions = ['123', '456']
df=pd.DataFrame({'field':list('abc'),'extra_dimensions':[[],['abc','def'],['ghi']]})
df=pd.get_dummies(df.set_index('field')['extra_dimensions'].apply(pd.Series).stack()).sum(level=0).reindex(df.field).fillna(0)
d = dict.fromkeys(required_dimensions, 1)
df.assign(**d)
Out[283]:
abc def ghi 123 456
field
a 0.0 0.0 0.0 1 1
b 1.0 1.0 0.0 1 1
c 0.0 0.0 1.0 1 1