这是我的数据框:
import pandas as pd
df = pd.DataFrame({
'KEY': [1, 2, 3, 1, 1, 2],
'START_DATE': ['2018-01-05', '2018-01-04', '2018-01-01', '2018-01-23', '2018-02-01', '2018-03-11'],
'STOP_DATE': ['2018-01-22', '2018-03-10', '2018-01-31', '2018-02-15', '2018-04-01', '2018-07-21'],
'AMOUNT': [5, 3, 11, 14, 7, 9],
})
df.START_DATE = pd.to_datetime(df.START_DATE, format='%Y-%m-%d')
df.STOP_DATE = pd.to_datetime(df.STOP_DATE, format='%Y-%m-%d')
df
>>> AMOUNT KEY START_DATE STOP_DATE
0 5 A 2018-01-05 2018-01-22
1 3 B 2018-01-04 2018-03-10
2 11 C 2018-01-01 2018-01-31
3 14 A 2018-01-23 2018-02-15
4 7 A 2018-02-01 2018-04-01
5 9 B 2018-03-11 2018-07-21
考虑AMOUNT
KEY
在AMOUNT
和START_DATE
之间的线性分布(按日),我试图每月STOP_DATE
和KEY = A
DAYS AMOUNT
A 2018_01 27 10.250000
2018_02 43 12.016667
2018_03 31 3.616667
2018_04 1 0.116667
B 2018_01 28 1.272727
2018_02 28 1.272727
2018_03 31 1.875598
2018_04 30 2.030075
2018_05 31 2.097744
2018_06 30 2.030075
2018_07 21 1.421053
C 2018_01 31 11.000000
2018_02 0 0.000000
。输出如下所示。我还要跟踪一个月内的收费天数。例如,pd.date_range
在2月份有重叠的时段,因此收费时段的数量可以是> 2。 28.
numba
我想出了下面详述的解决方案,但效率非常低,并且需要花费大量时间才能运行大约1亿行的数据集。我正在寻找一个改进的版本,但无法设法矢量化它的from pandas.tseries.offsets import MonthEnd
# Prepare the final dataframe (filled with zeros)
bounds = df.groupby('KEY').agg({'START_DATE': min, 'STOP_DATE':max}).reset_index()
multiindex = []
for row in bounds.itertuples():
dates = pd.date_range(start=row.START_DATE, end=row.STOP_DATE + MonthEnd(),
freq='M').strftime('%Y_%m')
multiindex.extend([(row.KEY, date) for date in dates])
index = pd.MultiIndex.from_tuples(multiindex)
final = pd.DataFrame(0, index=index, columns=['DAYS', 'AMOUNT'])
# Run the actual iteration over rows
df['TOTAL_DAYS'] = (df.STOP_DATE - df.START_DATE).dt.days + 1
for row in df.itertuples():
data = pd.Series(index=pd.date_range(start=row.START_DATE, end=row.STOP_DATE))
data = data.resample('MS').size().rename('DAYS').to_frame()
data['AMOUNT'] = data.DAYS / row.TOTAL_DAYS * row.AMOUNT
data.index = data.index.strftime('%Y_%m')
# Add data to the final dataframe
final.loc[(row.KEY, data.index.tolist()), 'DAYS'] += data.DAYS.values
final.loc[(row.KEY, data.index.tolist()), 'AMOUNT'] += data.AMOUNT.values
部分。不确定select *
from
( select
ORDER
,LOG_NUMBER
,LOG_CODE
-- last log number for each order
,max(LOG_NUMBER) over (partition by ORDER) as maxnum
from mytable
) dt
where LOG_NUMBER = maxnum
and LOG_CODE = 'Prepared'
@jit是否有帮助?添加标签以防万一。
select *
from
( select
ORDER
,LOG_NUMBER
,LOG_CODE
-- youngest row gets row_number 1
,ROW_NUMBER() over (partition by ORDER order by LOG_NUMBER desc) as rn
from mytable
) dt
where rn = 1
and LOG_CODE = 'Prepared'
答案 0 :(得分:2)
我最终想出了这个解决方案(很大程度上受到了@jezrael关于此post的回答)。可能不是最节省内存的解决方案,但这对我来说不是主要问题,执行时间就是问题所在!
from pandas.tseries.offsets import MonthBegin
df['ID'] = range(len(df))
df['TOTAL_DAYS'] = (df.STOP_DATE - df.START_DATE).dt.days + 1
df
>>> AMOUNT KEY START_DATE STOP_DATE ID TOTAL_DAYS
0 5 A 2018-01-05 2018-01-22 0 18
1 3 B 2018-01-04 2018-03-10 1 66
2 11 C 2018-01-01 2018-01-31 2 31
3 14 A 2018-01-23 2018-02-15 3 24
4 7 A 2018-02-01 2018-04-01 4 60
5 9 B 2018-03-11 2018-07-21 5 133
final = (df[['ID', 'START_DATE', 'STOP_DATE']].set_index('ID').stack()
.reset_index(level=-1, drop=True)
.rename('DATE_AFTER')
.to_frame())
final = final.groupby('ID').apply(
lambda x: x.set_index('DATE_AFTER').resample('M').asfreq()).reset_index()
final = final.merge(df[['ID', 'KEY', 'AMOUNT', 'TOTAL_DAYS']], how='left', on=['ID'])
final['PERIOD'] = final.DATE_AFTER.dt.to_period('M')
final['DATE_BEFORE'] = final.DATE_AFTER - MonthBegin()
此时final
看起来像这样:
final
>>> ID DATE_AFTER KEY AMOUNT TOTAL_DAYS PERIOD DATE_BEFORE
0 0 2018-01-31 A 5 18 2018-01 2018-01-01
1 1 2018-01-31 B 3 66 2018-01 2018-01-01
2 1 2018-02-28 B 3 66 2018-02 2018-02-01
3 1 2018-03-31 B 3 66 2018-03 2018-03-01
4 2 2018-01-31 C 11 31 2018-01 2018-01-01
5 3 2018-01-31 A 14 24 2018-01 2018-01-01
6 3 2018-02-28 A 14 24 2018-02 2018-02-01
7 4 2018-02-28 A 7 60 2018-02 2018-02-01
8 4 2018-03-31 A 7 60 2018-03 2018-03-01
9 4 2018-04-30 A 7 60 2018-04 2018-04-01
10 5 2018-03-31 B 9 133 2018-03 2018-03-01
11 5 2018-04-30 B 9 133 2018-04 2018-04-01
12 5 2018-05-31 B 9 133 2018-05 2018-05-01
13 5 2018-06-30 B 9 133 2018-06 2018-06-01
14 5 2018-07-31 B 9 133 2018-07 2018-07-01
然后我们将初始df
合并两次(开始和月末):
final = pd.merge(
final,
df[['ID', 'STOP_DATE']].assign(PERIOD = df.STOP_DATE.dt.to_period('M')),
how='left', on=['ID', 'PERIOD'])
final = pd.merge(
final,
df[['ID', 'START_DATE']].assign(PERIOD = df.START_DATE.dt.to_period('M')),
how='left', on=['ID', 'PERIOD'])
final['STOP_DATE'] = final.STOP_DATE.combine_first(final.DATE_AFTER)
final['START_DATE'] = final.START_DATE.combine_first(final.DATE_BEFORE)
final['DAYS'] = (final.STOP_DATE- final.START_DATE).dt.days + 1
final = final.drop(columns=['ID', 'DATE_AFTER', 'DATE_BEFORE'])
final.AMOUNT *= final.DAYS/final.TOTAL_DAYS
final = final.groupby(['KEY', 'PERIOD']).agg({'AMOUNT': sum, 'DAYS': sum})
预期结果:
AMOUNT DAYS
KEY PERIOD
A 2018-01 10.250000 27
2018-02 12.016667 43
2018-03 3.616667 31
2018-04 0.116667 1
B 2018-01 1.272727 28
2018-02 1.272727 28
2018-03 1.875598 31
2018-04 2.030075 30
2018-05 2.097744 31
2018-06 2.030075 30
2018-07 1.421053 21
C 2018-01 11.000000 31