对不起,我是python的新手。
我有一个实体的数据框,每个月记录一次值。对于数据框中的每个唯一实体,我找到最大值,然后找到最大值的对应月份。使用最大值月份,可以以天为单位来计算每个唯一实体的月份与最大值月份之间的时间差。这适用于小型数据框。
我知道我的循环性能不佳,无法扩展到更大的数据帧(例如3M行(+ 156MB))。经过数周的研究,我发现我的循环退化了,并感觉到有一个麻木的解决方案或更多的Python语言。有人可以看到一种性能更高的方法来计算几天内的时间增量吗?
我在lambda函数中尝试了各种value.shift(x)计算,但峰值不一致。我还尝试过计算更多的列,以最大程度地减少循环计算。
import pandas as pd
df = pd.DataFrame({'entity':['A','A','A','A','B','B','B','C','C','C','C','C'], 'month': ['10/31/2018','11/30/2018','12/31/2018','1/31/2019','1/31/2009','2/28/2009','3/31/2009','8/31/2011','9/30/2011','10/31/2011','11/30/2011','12/31/2011'], 'value':['80','600','500','400','150','300','100','200','250','300','200','175'], 'month_number': ['1','2','3','4','1','2','3','1','2','3','4','5']})
df['month'] = df['month'].apply(pd.to_datetime)
for entity in set(df['entity']):
# set peak value
peak_value = df.loc[df['entity'] == entity, 'value'].max()
# set peak value date
peak_date = df.loc[(df['entity'] == entity) & (df['value'] == peak_value), 'month'].min()
# subtract peak date from current date
delta = df.loc[df['entity'] == entity, 'month'] - peak_date
# update days_delta with delta in days
df.loc[df['entity'] == entity, 'days_delta'] = delta
结果:
entity month value month_number days_delta
A 2018-10-31 80 1 0 days
A 2018-11-30 600 2 30 days
A 2018-12-31 500 3 61 days
A 2019-01-31 400 4 92 days
B 2009-01-31 150 1 -28 days
B 2009-02-28 300 2 0 days
B 2009-03-31 100 3 31 days
C 2011-08-31 200 1 -61 days
C 2011-09-30 250 2 -31 days
C 2011-10-31 300 3 0 days
C 2011-11-30 200 4 30 days
C 2011-12-31 175 5 61 days
答案 0 :(得分:0)
首先我们还要确保value
是数字
df = pd.DataFrame({
'entity':['A','A','A','A','B','B','B','C','C','C','C','C'],
'month': ['10/31/2018','11/30/2018','12/31/2018','1/31/2019',
'1/31/2009','2/28/2009','3/31/2009','8/31/2011',
'9/30/2011','10/31/2011','11/30/2011','12/31/2011'],
'value':['80','600','500','400','150','300','100','200','250','300','200','175'],
'month_number': ['1','2','3','4','1','2','3','1','2','3','4','5']
})
df['month'] = df['month'].apply(pd.to_datetime)
df['value'] = pd.to_numeric(df['value'])
transform
和idxmax
max_months = df.groupby('entity').value.transform('idxmax').map(df.month)
df.assign(days_delta=df.month - max_months)
entity month value month_number days_delta
0 A 2018-10-31 80 1 -30 days
1 A 2018-11-30 600 2 0 days
2 A 2018-12-31 500 3 31 days
3 A 2019-01-31 400 4 62 days
4 B 2009-01-31 150 1 -28 days
5 B 2009-02-28 300 2 0 days
6 B 2009-03-31 100 3 31 days
7 C 2011-08-31 200 1 -61 days
8 C 2011-09-30 250 2 -31 days
9 C 2011-10-31 300 3 0 days
10 C 2011-11-30 200 4 30 days
11 C 2011-12-31 175 5 61 days