python pandas:尝试使用date_range

时间:2018-03-25 21:21:15

标签: python pandas vectorization numba

这是我的数据框:

import pandas as pd
df = pd.DataFrame({
    'KEY': [1, 2, 3, 1, 1, 2],
    'START_DATE': ['2018-01-05', '2018-01-04', '2018-01-01', '2018-01-23', '2018-02-01', '2018-03-11'],
    'STOP_DATE': ['2018-01-22', '2018-03-10', '2018-01-31', '2018-02-15', '2018-04-01', '2018-07-21'],
    'AMOUNT': [5, 3, 11, 14, 7, 9],
    })
df.START_DATE = pd.to_datetime(df.START_DATE, format='%Y-%m-%d')
df.STOP_DATE = pd.to_datetime(df.STOP_DATE, format='%Y-%m-%d')
df
>>>   AMOUNT KEY  START_DATE  STOP_DATE
0       5     A   2018-01-05  2018-01-22
1       3     B   2018-01-04  2018-03-10
2      11     C   2018-01-01  2018-01-31
3      14     A   2018-01-23  2018-02-15
4       7     A   2018-02-01  2018-04-01
5       9     B   2018-03-11  2018-07-21

考虑AMOUNT KEYAMOUNTSTART_DATE之间的线性分布(按日),我试图每月STOP_DATEKEY = A DAYS AMOUNT A 2018_01 27 10.250000 2018_02 43 12.016667 2018_03 31 3.616667 2018_04 1 0.116667 B 2018_01 28 1.272727 2018_02 28 1.272727 2018_03 31 1.875598 2018_04 30 2.030075 2018_05 31 2.097744 2018_06 30 2.030075 2018_07 21 1.421053 C 2018_01 31 11.000000 2018_02 0 0.000000 。输出如下所示。我还要跟踪一个月内的收费天数。例如,pd.date_range在2月份有重叠的时段,因此收费时段的数量可以是> 2。 28.

numba

我想出了下面详述的解决方案,但效率非常低,并且需要花费大量时间才能运行大约1亿行的数据集。我正在寻找一个改进的版本,但无法设法矢量化它的from pandas.tseries.offsets import MonthEnd # Prepare the final dataframe (filled with zeros) bounds = df.groupby('KEY').agg({'START_DATE': min, 'STOP_DATE':max}).reset_index() multiindex = [] for row in bounds.itertuples(): dates = pd.date_range(start=row.START_DATE, end=row.STOP_DATE + MonthEnd(), freq='M').strftime('%Y_%m') multiindex.extend([(row.KEY, date) for date in dates]) index = pd.MultiIndex.from_tuples(multiindex) final = pd.DataFrame(0, index=index, columns=['DAYS', 'AMOUNT']) # Run the actual iteration over rows df['TOTAL_DAYS'] = (df.STOP_DATE - df.START_DATE).dt.days + 1 for row in df.itertuples(): data = pd.Series(index=pd.date_range(start=row.START_DATE, end=row.STOP_DATE)) data = data.resample('MS').size().rename('DAYS').to_frame() data['AMOUNT'] = data.DAYS / row.TOTAL_DAYS * row.AMOUNT data.index = data.index.strftime('%Y_%m') # Add data to the final dataframe final.loc[(row.KEY, data.index.tolist()), 'DAYS'] += data.DAYS.values final.loc[(row.KEY, data.index.tolist()), 'AMOUNT'] += data.AMOUNT.values 部分。不确定select * from ( select ORDER ,LOG_NUMBER ,LOG_CODE -- last log number for each order ,max(LOG_NUMBER) over (partition by ORDER) as maxnum from mytable ) dt where LOG_NUMBER = maxnum and LOG_CODE = 'Prepared' @jit是否有帮助?添加标签以防万一。

select *
from 
 ( select 
      ORDER
      ,LOG_NUMBER
      ,LOG_CODE
      -- youngest row gets row_number 1
      ,ROW_NUMBER() over (partition by ORDER order by LOG_NUMBER desc) as rn
   from mytable
 ) dt
where rn = 1
  and LOG_CODE = 'Prepared'

1 个答案:

答案 0 :(得分:2)

我最终想出了这个解决方案(很大程度上受到了@jezrael关于此post的回答)。可能不是最节省内存的解决方案,但这对我来说不是主要问题,执行时间就是问题所在!

from pandas.tseries.offsets import MonthBegin

df['ID'] = range(len(df))
df['TOTAL_DAYS'] = (df.STOP_DATE - df.START_DATE).dt.days + 1
df
>>>   AMOUNT KEY  START_DATE  STOP_DATE   ID  TOTAL_DAYS
0       5     A   2018-01-05  2018-01-22   0      18
1       3     B   2018-01-04  2018-03-10   1      66
2      11     C   2018-01-01  2018-01-31   2      31
3      14     A   2018-01-23  2018-02-15   3      24
4       7     A   2018-02-01  2018-04-01   4      60
5       9     B   2018-03-11  2018-07-21   5     133

final = (df[['ID', 'START_DATE', 'STOP_DATE']].set_index('ID').stack()
           .reset_index(level=-1, drop=True)
           .rename('DATE_AFTER')
           .to_frame())
final = final.groupby('ID').apply(
    lambda x: x.set_index('DATE_AFTER').resample('M').asfreq()).reset_index()
final = final.merge(df[['ID', 'KEY', 'AMOUNT', 'TOTAL_DAYS']], how='left', on=['ID'])
final['PERIOD'] = final.DATE_AFTER.dt.to_period('M')
final['DATE_BEFORE'] = final.DATE_AFTER - MonthBegin()

此时final看起来像这样:

final
>>> ID DATE_AFTER KEY  AMOUNT  TOTAL_DAYS  PERIOD DATE_BEFORE
0    0 2018-01-31   A       5          18 2018-01  2018-01-01
1    1 2018-01-31   B       3          66 2018-01  2018-01-01
2    1 2018-02-28   B       3          66 2018-02  2018-02-01
3    1 2018-03-31   B       3          66 2018-03  2018-03-01
4    2 2018-01-31   C      11          31 2018-01  2018-01-01
5    3 2018-01-31   A      14          24 2018-01  2018-01-01
6    3 2018-02-28   A      14          24 2018-02  2018-02-01
7    4 2018-02-28   A       7          60 2018-02  2018-02-01
8    4 2018-03-31   A       7          60 2018-03  2018-03-01
9    4 2018-04-30   A       7          60 2018-04  2018-04-01
10   5 2018-03-31   B       9         133 2018-03  2018-03-01
11   5 2018-04-30   B       9         133 2018-04  2018-04-01
12   5 2018-05-31   B       9         133 2018-05  2018-05-01
13   5 2018-06-30   B       9         133 2018-06  2018-06-01
14   5 2018-07-31   B       9         133 2018-07  2018-07-01

然后我们将初始df合并两次(开始和月末):

final = pd.merge(
    final,
    df[['ID', 'STOP_DATE']].assign(PERIOD = df.STOP_DATE.dt.to_period('M')),
how='left', on=['ID', 'PERIOD'])

final = pd.merge(
    final,
    df[['ID', 'START_DATE']].assign(PERIOD = df.START_DATE.dt.to_period('M')),
how='left', on=['ID', 'PERIOD'])

final['STOP_DATE'] = final.STOP_DATE.combine_first(final.DATE_AFTER)
final['START_DATE'] = final.START_DATE.combine_first(final.DATE_BEFORE)

final['DAYS'] = (final.STOP_DATE- final.START_DATE).dt.days + 1

final = final.drop(columns=['ID', 'DATE_AFTER', 'DATE_BEFORE'])
final.AMOUNT *= final.DAYS/final.TOTAL_DAYS
final = final.groupby(['KEY', 'PERIOD']).agg({'AMOUNT': sum, 'DAYS': sum})

预期结果:

             AMOUNT     DAYS
KEY PERIOD                  
A   2018-01  10.250000    27   
    2018-02  12.016667    43
    2018-03   3.616667    31
    2018-04   0.116667     1
B   2018-01   1.272727    28
    2018-02   1.272727    28
    2018-03   1.875598    31
    2018-04   2.030075    30
    2018-05   2.097744    31
    2018-06   2.030075    30
    2018-07   1.421053    21
C   2018-01  11.000000    31