我有两个数据框,如下所示:
year1 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
'TEMP':[12, 13, 14, 15, 15, 18],
'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '08/01/20']}
df1 = pd.DataFrame(year1)
year2 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
'TEMP':[15, 15, 15, 15, 14, 14],
'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '10/01/20']}
df2 = pd.DataFrame(year2)
数据帧未按日期编制索引(索引是其他一些列)。我想将数据框与这两个数据框中的日期值匹配的行合并,并根据日期匹配添加新列:
df_FINAL['AVG_TEMP'] = (df1['TEMP'] + df2['TEMP']) / 2
所以最终的DataFrame应该像这样:
DAY TEMP DATE AVG_TEMP
0 MON 15 01/01/20 13.5
1 MON 15 02/01/20 14.0
2 MON 15 03/01/20 14.5
3 TUE 15 06/01/20 15.0
4 TUE 14 07/01/20 14.5
如何实现?
答案 0 :(得分:2)
您可以在pd.merge
和DATE
列上使用DAY
,因为相同的日期将是同一天。将合并中创建的TEMP_x
和TEMP_y
列的平均值取为AVG_TEMP
,然后删除TEMP_x
和TEMP_y
列。
import pandas as pd
year1 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
'TEMP':[12, 13, 14, 15, 15, 18],
'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '08/01/20']}
df1 = pd.DataFrame(year1)
year2 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
'TEMP':[15, 15, 15, 15, 14, 14],
'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '10/01/20']}
df2 = pd.DataFrame(year2)
df_result = df1.merge(df2, on=["DATE","DAY"])
df_result['AVG_TEMP'] = (df_result['TEMP_x'] + df_result['TEMP_y']) / 2
df_result = df_result.drop(columns=['TEMP_x','TEMP_y'])
输出:
>>> df_result
DAY DATE AVG_TEMP
0 MON 01/01/20 13.5
1 MON 02/01/20 14.0
2 MON 03/01/20 14.5
3 TUE 06/01/20 15.0
4 TUE 07/01/20 14.5
答案 1 :(得分:0)
使用内部联接在两列上调用pd.merge()
(值必须同时出现在df
中才能在结果中出现)以创建中间df
。然后创建一个新列来计算平均值
df3 = df1.merge(df2,on=['DATE','DAY'],how='inner')
df3['AVG_TEMP'] = (df3.TEMP_x + df3.TEMP_y)/2
df3.drop(['TEMP_x','TEMP_y'],inplace=True,axis=1)
答案 2 :(得分:0)
您可以使用merge命令并使用lambda函数完成所有这些操作。我还为您提供了一些备用选项,以便您知道它们对您可用。
import pandas as pd
year1 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
'TEMP':[12, 13, 14, 15, 15, 18],
'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '08/01/20']}
df1 = pd.DataFrame(year1)
year2 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
'TEMP':[15, 15, 15, 15, 14, 14],
'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '10/01/20']}
df2 = pd.DataFrame(year2)
#merge on inner join based on your example
#you can either use rename or suffix. here i am using suffix
#first suffix is stripped, second is _y which will be later dropped
#kept .rename command in case you want to try that option
您的问题的答案从这里开始
df_FINAL = (pd.merge(df2, df1, on = "DATE",how='inner',suffixes=('', '_y'))
#.rename(columns={'DAY_x':'DAY','TEMP_x':'TEMP'})
.assign(AVG_TEMP = lambda x: (x['TEMP'] + x['TEMP_y'])/2))
#drop the _y columns as you don't need them
df_FINAL.drop(list(df_FINAL.filter(regex='_y$')), axis=1, inplace=True)
print(df_FINAL)
执行此操作的另一种方法是将所有这些合并为一个命令,如下所示:
#merge on inner join based on your example
#first suffix is stripped, second is _y which will be later dropped
#after the processing, filter out the column with _y
df_FINAL = (pd.merge(df2, df1, on = "DATE",how='inner',suffixes=('', '_y'))
.assign(AVG_TEMP = lambda x: (x['TEMP'] + x['TEMP_y'])/2)
.filter(regex='^(?!.*_y)'))
最终结果如下:
DAY TEMP DATE AVG_TEMP
0 MON 15 01/01/20 13.5
1 MON 15 02/01/20 14.0
2 MON 15 03/01/20 14.5
3 TUE 15 06/01/20 15.0
4 TUE 14 07/01/20 14.5
答案 3 :(得分:0)
使用pd.concat()
和df.groupby
df3 = pd.concat([df2, df1])
df3['AVG_TEMP'] = df3.groupby('DATE', as_index=False)['TEMP'].apply(lambda x: x.mean() if len(x) > 1 else None)
df3 = df3.groupby('DATE', as_index=False).first().dropna()
print(df3)
输出:
DATE DAY TEMP AVG_TEMP
0 01/01/20 MON 15 13.5
1 02/01/20 MON 15 14.0
2 03/01/20 MON 15 14.5
3 06/01/20 TUE 15 15.0
4 07/01/20 TUE 14 14.5