我有以下数据框,如下所示:
df = pd.DataFrame({'fruit': ['berries','berries', 'berries', 'tropical',
'tropical','tropical','berries','nuts'],
'code': [100,100,100,200,200, 300,400,500],
'subcode': ['100A', '100B', '100C','200A', '200B','300A',
'400A', '500A']})
code fruit subcode
0 100 berries 100A
1 100 berries 100B
2 100 berries 100C
3 200 tropica 200A
4 200 tropical 200B
5 300 tropical 300A
6 400 berries 400A
7 500 nuts 500A
我想将数据框转换为以下格式:
code fruit subcode1 subcode1 subcode1
0 100 berries 100A 100B 100C
3 200 tropica 200A 200B
5 300 tropical 300A
6 400 berries 400A
7 500 nuts 500A
不幸的是,我对如何继续感到困惑。我已经查阅过Unmelt Pandas DataFrame之类的帖子,并且具有堆栈和非堆栈的组合。我怀疑也涉及到串联。希望能为我指明正确的方向提供任何建议!
答案 0 :(得分:4)
与set_index
和unstack
玩一会儿,你会明白的。
(df.set_index(['code', 'fruit'])
.set_index(df.subcode.str.extract('([a-zA-Z]+)', expand=False), append=True)
.subcode
.unstack()
.fillna('') # these last three
.reset_index() # operations are
.rename_axis(None, axis=1) # not important
)
code fruit A B C
0 100 berries 100A 100B 100C
1 200 tropical 200A 200B
2 300 tropical 300A
3 400 berries 400A
4 500 nuts 500A
答案 1 :(得分:4)
您可以使用groupby
,取值并将其转换为序列。
df.groupby(['code','fruit'])['subcode'].apply(
lambda x: x.values
).apply(pd.Series)
.add_prefix('subcode_')
subcode_0 subcode_1 subcode_2
code fruit
100 berries 100A 100B 100C
200 tropical 200A 200B NaN
300 tropical 300A NaN NaN
400 berries 400A NaN NaN
500 nuts 500A NaN NaN
答案 2 :(得分:3)
使用defaultdict
from collections import defaultdict
d = defaultdict(list)
for f, c, s in df.itertuples(index=False):
d[(f, c)].append(s)
pd.DataFrame.from_dict(
{k: dict(enumerate(v)) for k, v in d.items()}, orient='index'
).add_prefix('subcode').rename_axis(['fruit', 'code']).reset_index()
fruit code subcode0 subcode1 subcode2
0 berries 100 100A 100B 100C
1 berries 400 400A NaN NaN
2 nuts 500 500A NaN NaN
3 tropical 200 200A 200B NaN
4 tropical 300 300A NaN NaN