如何优化使用groupby和聚合的大熊猫数据框?

时间:2020-05-22 11:21:48

标签: python pandas optimization

我正在读取一个大约30万行的Excel文件到熊猫数据框。我是, 然后,使用groupby将其分组为大约18000行。然后,我循环每个组,并在该组中执行一个过滤器(月数据的日期过滤器)来计算总和。整个过程大约需要60分钟。有什么办法可以优化这个?代码如下:

    qgift_dl = pd.read_csv(file, encoding='latin1')
    qgift_dl['user_id'] = df1['user_id'].astype(str)  # read csv file
    qgift_dl['Gift Date'] = pd.to_datetime(df1['Gift Date'])
    min_date = qgift_dl['Gift Date'].min()
    today = datetime.datetime.today()
    qgift_dates = get_date_range(min_date, today) # get all dates between
    q_grouped = qgift_dl.groupby(['user_id'])
    details= []
    for group in q_grouped:
        d_rows = group[1]
        d_row_data = [group[0]]  # add donor id
        for dt in qgift_dates:
            lower = dt.strftime('%Y-%m-01')
            upper = dt.strftime('%Y-%m-%d')
            filtered = d_rows[(d_rows['Gift Date'] >= lower) & (d_rows['Gift Date'] <= upper)]
            d_row_data.append(filtered['Amount'].sum())
        details.append(d_row_data)

下面是get_date_range函数。它获取两个范围之间的所有日期范围(Y-m-d)。在我的情况下,范围是“ 2008-04-30”至“ 2020-05-30”。

from dateutil.relativedelta import relativedelta
import datetime, calendar

def get_date_range(start, end):

    result = []
    while start <= end:
        result.append(start)
        start += relativedelta(months=1)
    return result

Excel数据示例如下: sample excel data 链接到示例文件: https://docs.google.com/spreadsheets/d/1YeH35w0rqVoHukGTSDtISlztdZAiDYsmfLWVia2x1U0/edit?usp=sharing

1 个答案:

答案 0 :(得分:1)

从预期结果中,您希望每个用户和每个月的总金额。熊猫工具为groupbysum,如果希望将日期作为列,则为unstack

result = df.groupby(['user_id', pd.to_datetime(df['Gift Date'], dayfirst=True
                    )+ pd.offsets.Day() - pd.offsets.MonthBegin()])[['Amount']].sum(
             ).unstack()