我有两个不同的csv文件,我将它们合并到一个数据框中,并根据“ class_name”列进行分组。分组依据按预期工作,但我不知道如何通过将各个分组相互比较来执行操作。从r1.csv开始,班级代数减少了5个学生,所以我希望-5,微积分增加5,所以必须增加+5,这必须作为新列添加到单独的数据框中。与日期算术相同。
这是我到目前为止尝试过的
import pandas as pd
report_1_df=pd.read_csv('r1.csv')
report_2_df=pd.read_csv('r2.csv')
for group,elements in pd.concat([report_1_df, report_2_df], axis=0, sort=False).groupby('class_name'):
print(elements)
我可以看到我的小组工作正常,我尝试了.sum()
.diff()
,但似乎没有人做我想做的事,我可以在这里做什么。谢谢。
r1.csv
class_name,student_count,start_time,end_time
algebra,15,"2019,Dec,08","2019,Dec,09"
calculus,10,"2019,Dec,08","2019,Dec,09"
statistics,12,"2019,Dec,08","2019,Dec,09"
r2.csv
class_name,student_count,start_time,end_time
calculus,15,"2019,Dec,09","2019,Dec,10"
algebra,10,"2019,Dec,09","2019,Dec,10"
trigonometry,12,"2019,Dec,09","2019,Dec,10"
需要
class_name,student_count,student_count_change,start_time,start_time_delay,end_time,end_time_delay
algebra,10,-5,"2019,Dec,09",1,"2019,Dec,10",1
calculus,15,5,"2019,Dec,09",1,"2019,Dec,10",1
statistics,12,-12,"2019,Dec,08",0,"2019,Dec,09",0
trigonometry,12,12,"2019,Dec,09",0,"2019,Dec,10",0
答案 0 :(得分:0)
不确定是否有更直接的方法,但是可以先在两个dfs上附加缺失的数据来开始:
classes = (df1["class_name"].append(df2["class_name"])).unique()
def fill_data(df):
for i in np.setdiff1d(classes, df["class_name"].values):
df.loc[df.shape[0]] = [i, 0, *df.iloc[0,2:].values]
return df
df1 = fill_data(df1)
df2 = fill_data(df2)
在填补了缺失的类之后,现在您可以使用groupby
为差异分配一个新列,最后使用drop_duplicates
:
df = pd.concat([df1,df2],axis=0).reset_index(drop=True)
df["diff"] = df.groupby("class_name")["student_count"].diff().fillna(df["student_count"])
print (df.drop_duplicates("class_name",keep="last"))
class_name student_count start_time end_time diff
4 calculus 15 2019,Dec,09 2019,Dec,10 5.0
5 algebra 10 2019,Dec,09 2019,Dec,10 -5.0
6 trigonometry 12 2019,Dec,09 2019,Dec,10 12.0
7 statistics 0 2019,Dec,09 2019,Dec,10 -12.0