基于日期范围的熊猫数据框条件累积总和

时间:2020-10-10 11:09:38

标签: python pandas dataframe

我有一个熊猫数据框:

         Date            Party    Status
-------------------------------------------
0        01-01-2018      John     Sent
1        13-01-2018      Lisa     Received
2        15-01-2018      Will     Received
3        19-01-2018      Mark     Sent
4        02-02-2018      Will     Sent
5        28-02-2018      John     Received

我想添加执行.cumsum()的新列,但这取决于日期。看起来像这样:

                                                Num of Sent         Num of Received
         Date            Party    Status        in Past 30 Days     in Past 30 Days
-----------------------------------------------------------------------------------
0        01-01-2018      John     Sent          1                   0
1        13-01-2018      Lisa     Received      1                   1
2        15-01-2018      Will     Received      1                   2
3        19-01-2018      Mark     Sent          2                   2
4        02-02-2018      Will     Sent          2                   2
5        28-02-2018      John     Received      1                   1

通过编写以下代码,我设法实现了所需的内容:

def inner_func(date_var, status_var, date_array, status_array):
    sent_increment = 0
    received_increment = 0

    for k in range(0, len(date_array)):
        if((date_var - date_array[k]).days <= 30):
            if(status_array[k] == "Sent"):
                sent_increment += 1
            elif(status_array[k] == "Received"):
                received_increment += 1

    return sent_increment, received_increment
import pandas as pd
import time
df = pd.DataFrame({"Date": pd.to_datetime(["01-01-2018", "13-01-2018", "15-01-2018", "19-01-2018", "02-02-2018", "28-02-2018"]),
                   "Party": ["John", "Lisa", "Will", "Mark", "Will", "John"],
                   "Status": ["Sent", "Received", "Received", "Sent", "Sent", "Received"]})

df = df.sort_values("Date")
date_array = []
status_array = []

for i in range(0, len(df)):
        date_var = df.loc[i,"Date"]
        date_array.append(date_var)
        status_var = df.loc[i,"Status"]
        status_array.append(status_var)
        sent_count, received_count = inner_func(date_var, status_var, date_array, status_array)
        df.loc[i, "Num of Sent in Past 30 days"] = sent_count
        df.loc[i, "Num of Received in Past 30 days"] = received_count

但是,df较大时,此过程的计算量很大,而且速度很慢,因为嵌套循环两次遍历数据帧两次。是否有一种更Python的方式来实现我要实现的目标,而又不以我正在做的方式遍历数据帧?

更新2

Michael提供了针对我所寻找的解决方案:here。假设我要在groupby对象上应用解决方案。例如,使用滚动解决方案为每一方计算累计和:

                                                Sent past 30       Received past 30
         Date            Party    Status        days by party      days by party
-----------------------------------------------------------------------------------
0        01-01-2018      John     Sent          1                   0
1        13-01-2018      Lisa     Received      0                   1
2        15-01-2018      Will     Received      0                   1
3        19-01-2018      Mark     Sent          1                   0
4        02-02-2018      Will     Sent          1                   1
5        28-02-2018      John     Received      0                   1

我尝试使用下面的groupby方法来重新生成解决方案:

l = []
grp_obj = df.groupby("Party")
grp_obj.rolling('30D',  min_periods=1)["dummy"].apply(lambda x: l.append(x.value_counts()) or 0)
df.reset_index(inplace=True)

但是我最终得到了不正确的值。我知道这是因为concat方法在不考虑数据索引的情况下合并了数据帧,因为groupby对数据的排序不同。有没有一种方法可以修改追加的列表以包括原始索引,以便可以将value_counts数据框合并/合并到原始索引中?

1 个答案:

答案 0 :(得分:2)

如果您将Date设置为索引,并将Status临时转换为类别,则可以使用pd.rolling并获得一些技巧

df = df.set_index('Date')
df['dummy'] = df['Status'].astype('category',copy=False).cat.codes
l = []
df.rolling('30D', min_periods=1)['dummy'].apply(lambda x: l.append(x.value_counts()) or 0)
df.reset_index(inplace=True)
pd.concat(
    [df,
    (pd.DataFrame(l)
        .rename(columns={1.0: "Sent past 30 Days", 0.0: "Received past 30 Days"})
        .fillna(0)
        .astype('int'))
    ], axis=1).drop('dummy', 1)

出局:

        Date Party    Status  Received past 30 Days  Sent past 30 Days
0 2018-01-01  John      Sent                      0                  1
1 2018-01-13  Lisa  Received                      1                  1
2 2018-01-15  Will  Received                      2                  1
3 2018-01-19  Mark      Sent                      2                  2
4 2018-02-02  Will      Sent                      2                  2
5 2018-02-28  John  Received                      1                  1

维护原始索引以允许后续合并

稍微调整数据使其在Dateindex中具有不同的顺序

df = pd.DataFrame({"Date": pd.to_datetime(["01-01-2018", "13-01-2018", "03-01-2018", "19-01-2018", "08-02-2018", "22-02-2018"]),
                   "Party": ["John", "Lisa", "Will", "Mark", "Will", "John"],
                   "Status": ["Sent", "Received", "Received", "Sent", "Sent", "Received"]})
df

出局:

        Date Party    Status
0 2018-01-01  John      Sent
1 2018-01-13  Lisa  Received
2 2018-03-01  Will  Received
3 2018-01-19  Mark      Sent
4 2018-08-02  Will      Sent
5 2018-02-22  John  Received

Date排序后存储原始索引,对按Date排序的数据帧进行操作后重新索引

df = df.sort_values('Date')
df = df.reset_index()
df = df.set_index('Date')
df['dummy'] = df['Status'].astype('category',copy=False).cat.codes
l = []
df.rolling('30D', min_periods=1)['dummy'].apply(lambda x: l.append(x.value_counts()) or 0)
df.reset_index(inplace=True)
df = pd.concat(
      [df,
      (pd.DataFrame(l)
          .rename(columns={1.0: "Sent past 30 Days", 0.0: "Received past 30 Days"})
          .fillna(0)
          .astype('int'))
      ], axis=1).drop('dummy', 1)
df.set_index('index')

出局:

            Date Party    Status  Received past 30 Days  Sent past 30 Days
index                                                                     
0     2018-01-01  John      Sent                      0                  1
1     2018-01-13  Lisa  Received                      1                  1
3     2018-01-19  Mark      Sent                      1                  2
5     2018-02-22  John  Received                      1                  0
2     2018-03-01  Will  Received                      2                  0
4     2018-08-02  Will      Sent                      0                  1

分组计算值

先按PartyDate排序以正确顺序附加分组的计数

df = pd.DataFrame({"Date": pd.to_datetime(["01-01-2018", "13-01-2018", "15-01-2018", "19-01-2018", "02-02-2018", "28-02-2018"]),
                   "Party": ["John", "Lisa", "Will", "Mark", "Will", "John"],
                   "Status": ["Sent", "Received", "Received", "Sent", "Sent", "Received"]})
df = df.sort_values(['Party','Date'])

concat之前重新索引以附加到右行之后

df = df.set_index('Date')
df['dummy'] = df['Status'].astype('category',copy=False).cat.codes
l = []
df.groupby('Party').rolling('30D', min_periods=1)['dummy'].apply(lambda x: l.append(x.value_counts()) or 0)
df.reset_index(inplace=True)

pd.concat(
      [df,
      (pd.DataFrame(l)
          .rename(columns={1.0: "Sent past 30 Days", 0.0: "Received past 30 Days"})
          .fillna(0)
          .astype('int'))
      ], axis=1).drop('dummy', 1).sort_values('Date')

出局:

        Date Party    Status  Received past 30 Days  Sent past 30 Days
0 2018-01-01  John      Sent                      0                  1
2 2018-01-13  Lisa  Received                      1                  0
4 2018-01-15  Will  Received                      1                  0
3 2018-01-19  Mark      Sent                      0                  1
5 2018-02-02  Will      Sent                      1                  1
1 2018-02-28  John  Received                      1                  0

微基准测试

由于此解决方案也在迭代数据集,因此我比较了两种方法的运行时间。仅使用了很小的数据集,因为原始解决方案的运行时间正在快速增加。

结果

benchmark results

用于重现基准的代码

import pandas as pd
import perfplot

def makedata(n=1):
  df = pd.DataFrame({"Date": pd.to_datetime(["01-01-2018", "13-01-2018", "15-01-2018", "19-01-2018", "02-02-2018", "28-02-2018"]*n),
                   "Party": ["John", "Lisa", "Will", "Mark", "Will", "John"]*n,
                   "Status": ["Sent", "Received", "Received", "Sent", "Sent", "Received"]*n})

  return df.sort_values("Date")

def rolling(df):
  df = df.set_index('Date')
  df['dummy'] = df['Status'].astype('category',copy=False).cat.codes
  l = []
  df.rolling('30D', min_periods=1)['dummy'].apply(lambda x: l.append(x.value_counts()) or 0)
  df.reset_index(inplace=True)
  return pd.concat(
      [df,
      (pd.DataFrame(l)
          .rename(columns={1.0: "Sent past 30 Days", 0.0: "Received past 30 Days"})
          .fillna(0)
          .astype('int'))
      ], axis=1).drop('dummy', 1)

def forloop(df):
  date_array = []
  status_array = []
  def inner_func(date_var, status_var, date_array, status_array):
      sent_increment = 0
      received_increment = 0

      for k in range(0, len(date_array)):
          if((date_var - date_array[k]).days <= 30):
              if(status_array[k] == "Sent"):
                  sent_increment += 1
              elif(status_array[k] == "Received"):
                  received_increment += 1

      return sent_increment, received_increment

  for i in range(0, len(df)):
          date_var = df.loc[i,"Date"]
          date_array.append(date_var)
          status_var = df.loc[i,"Status"]
          status_array.append(status_var)
          sent_count, received_count = inner_func(date_var, status_var, date_array, status_array)
          df.loc[i, "Num of Sent in Past 30 days"] = sent_count
          df.loc[i, "Num of Received in Past 30 days"] = received_count
  return df

perfplot.show(
    setup=makedata,
    kernels=[forloop, rolling],
    n_range=[x for x in range(5, 105, 5)],
    equality_check=None,
    xlabel='len(df)'
)