简单地说,我需要将以下输入数据帧转换为以下输出。
几个小时后,我想出了办法,通过结合多个以前的stackoverflow问题,我可以转换数据帧,但是由于我使用数据透视和应用方法,因此转换大型数据帧需要花费大量时间。
import numpy as np
import pandas as pd
df = pd.DataFrame({"id":[1,2,3,4,5],
"day":pd.Timestamp('20190529'),
"subject":"math,english,economics",
"score":pd.Categorical(["68,62,49","58,72,87","28,32,46","48,72,66","46,25,93"]),
"Department":pd.Categorical(["Economics","Computer Science","Sociology","Business","Math"])})
---Input DataFrame---
id day subject score Department
0 1 2019-05-29 math,english,economics 68,62,49 Economics
1 2 2019-05-29 math,economics 58,87 Computer Science
2 3 2019-05-29 philosophy,english,business 28,32,46 Sociology
3 4 2019-05-29 physics,sociology 72,66 Business
4 5 2019-05-29 Math 93 Math
输出类似于下面的
---Output DataFrame---
id day Department Math business economics english math philosophy physics sociology
1 2019-05-29 Economics NaN NaN 49 62 68 NaN NaN NaN
2 2019-05-29 Computer Science NaN NaN 87 NaN 58 NaN NaN NaN
3 2019-05-29 Sociology NaN 46 NaN 32 NaN 28 NaN NaN
4 2019-05-29 Business NaN NaN NaN NaN NaN NaN 72 66
5 2019-05-29 Math 93 NaN NaN NaN NaN NaN NaN NaN
我的方法是
我的代码如下
df["subject"] = df["subject"].str.split(",")
df["score"] = df["score"].str.split(",")
subject = df.apply(lambda x: pd.Series(x['subject']),axis=1).stack().reset_index(level=1, drop=True)
score = df.apply(lambda x: pd.Series(x['score']),axis=1).stack().reset_index(level=1, drop=True)
subject.name = 'subject'
score.name = 'score'
subject_score = pd.concat([subject, score],join='outer', axis=1)
pdf = df.drop('subject', axis=1).drop("score", axis=1).join(subject_score)
pivot = pdf.pivot(columns="subject",values="score")
concate_table = df.drop("subject",axis = 1).drop("score", axis=1)
output = concate_table.join(pivot)
我最近才刚开始学习熊猫,我敢肯定这不是列转置的最佳方法。
如果您能给我一些如何优化此代码的建议,我将不胜感激。
谢谢。
答案 0 :(得分:0)
我先用unnesting,然后再用pivot
和concat
s=df[['subject','score']]
s=unnesting(s.apply(lambda x :x.str.split(','),1),['subject','score']).pivot(columns='subject',values='score')
df=pd.concat([df,s],axis=1)
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
答案 1 :(得分:0)
我将定义一个自定义函数stack_str
,以使用expand=True
和stack
和reset_index
到一系列将字符串列解压缩到数据框。
将stack_str
应用于2列字符串,使df1
包含2列。
接下来,对pivot
进行df1
,以将subject
的值设为columns
,将scores
的值设为values
。最后,加入回到df
已经删除的两列包含字符串的行中。
def stack_str(x):
s = x.str.split(',', expand=True).stack().reset_index(level=-1, drop=True)
return s
df1 = df[['subject', 'score']].apply(stack_list)
Out[984]:
subject score
0 math 68
0 english 62
0 economics 49
1 math 58
1 economics 87
2 philosophy 28
2 english 32
2 business 46
3 physics 72
3 sociology 66
4 Math 93
df2 = df.drop(['subject', 'score'], axis=1).join(df1.pivot(columns='subject', values='score'))
Out[986]:
id day Department Math business economics english math \
0 1 2019-05-29 Economics NaN NaN 49 62 68
1 2 2019-05-29 Computer_Science NaN NaN 87 NaN 58
2 3 2019-05-29 Sociology NaN 46 NaN 32 NaN
3 4 2019-05-29 Business NaN NaN NaN NaN NaN
4 5 2019-05-29 Math 93 NaN NaN NaN NaN
philosophy physics sociology
0 NaN NaN NaN
1 NaN NaN NaN
2 28 NaN NaN
3 NaN 72 66
4 NaN NaN NaN