我有两个数据帧,test1
和test2
。对于ID
中的每个test2
值,我想检查date
中的test2
并将其与{{ 1}}。如果ID
中的test1
中的任何一个在date
中的日期范围内,则对test2
列求和,并将该总和分配为{{1}中的其他列}。
输出:
因此,新的test1
df将具有一列amount
,该列是test1
中所有金额的总和,其中test1
在{{1 }}-为此amount_sum
test2
答案 0 :(得分:3)
使用:
#outer join both df by ID columns
df = test1.merge(test2, on='ID', how='outer')
#filter by range
df = df[(df.date > df.date1) & (df.date < df.date2)]
#thank you @Abhi for alternative
#df = df[df.date.between(df.date1, df.date2, inclusive=False)]
#aggregate sum
s = df.groupby(['ID','date1','date2'])['amount'].sum()
#add new column to test1
test = test1.join(s, on=['ID','date1','date2'])
示例:
#https://stackoverflow.com/q/21494489
np.random.seed(123)
#https://stackoverflow.com/a/50559321/2901002
def gen(start, end, n):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
n = 10
test1 = pd.DataFrame({
'ID':np.random.choice(list('abc'), n),
'date1': gen(pd.to_datetime('2010-01-01'),pd.to_datetime('2010-03-01'), n).floor('d'),
'date2':gen(pd.to_datetime('2010-03-01'),pd.to_datetime('2010-06-01'), n).floor('d')
})
m = 5
test2 = pd.DataFrame({
'ID': np.random.choice(list('abc'), m),
'amount':np.random.randint(10, size=m),
'date':gen(pd.to_datetime('2010-01-01'), pd.to_datetime('2010-06-01'), m).floor('d')
})
print (test1)
ID date1 date2
0 c 2010-01-15 2010-05-22
1 b 2010-02-08 2010-04-16
2 c 2010-01-24 2010-04-12
3 c 2010-02-01 2010-04-09
4 a 2010-01-19 2010-05-20
5 c 2010-01-27 2010-05-24
6 c 2010-02-23 2010-03-15
7 b 2010-01-31 2010-05-09
8 c 2010-02-23 2010-03-29
9 b 2010-01-08 2010-03-07
print (test2)
ID amount date
0 a 4 2010-05-15
1 b 6 2010-03-26
2 a 1 2010-01-07
3 b 5 2010-02-07
4 a 6 2010-04-13
#outer join both df by ID columns
df = test1.merge(test2, on='ID', how='outer')
#filter by range
df = df[(df.date > df.date1) & (df.date < df.date2)]
print (df)
ID date1 date2 amount date
6 b 2010-02-08 2010-04-16 6.0 2010-03-26
8 b 2010-01-31 2010-05-09 6.0 2010-03-26
9 b 2010-01-31 2010-05-09 5.0 2010-02-07
11 b 2010-01-08 2010-03-07 5.0 2010-02-07
12 a 2010-01-19 2010-05-20 4.0 2010-05-15
14 a 2010-01-19 2010-05-20 6.0 2010-04-13
#thank you @Abhi for alternative
#df = df[df.date.between(df.date1, df.date2, inclusive=False)]
#aggregate sum
s = df.groupby(['ID','date1','date2'])['amount'].sum()
#add new column to test1
test = test1.join(s, on=['ID','date1','date2'])
print (test)
ID date1 date2 amount
0 c 2010-01-15 2010-05-22 NaN
1 b 2010-02-08 2010-04-16 6.0
2 c 2010-01-24 2010-04-12 NaN
3 c 2010-02-01 2010-04-09 NaN
4 a 2010-01-19 2010-05-20 10.0
5 c 2010-01-27 2010-05-24 NaN
6 c 2010-02-23 2010-03-15 NaN
7 b 2010-01-31 2010-05-09 11.0
8 c 2010-02-23 2010-03-29 NaN
9 b 2010-01-08 2010-03-07 5.0