我正在读取一个大约30万行的Excel文件到熊猫数据框。我是, 然后,使用groupby将其分组为大约18000行。然后,我循环每个组,并在该组中执行一个过滤器(月数据的日期过滤器)来计算总和。整个过程大约需要60分钟。有什么办法可以优化这个?代码如下:
qgift_dl = pd.read_csv(file, encoding='latin1')
qgift_dl['user_id'] = df1['user_id'].astype(str) # read csv file
qgift_dl['Gift Date'] = pd.to_datetime(df1['Gift Date'])
min_date = qgift_dl['Gift Date'].min()
today = datetime.datetime.today()
qgift_dates = get_date_range(min_date, today) # get all dates between
q_grouped = qgift_dl.groupby(['user_id'])
details= []
for group in q_grouped:
d_rows = group[1]
d_row_data = [group[0]] # add donor id
for dt in qgift_dates:
lower = dt.strftime('%Y-%m-01')
upper = dt.strftime('%Y-%m-%d')
filtered = d_rows[(d_rows['Gift Date'] >= lower) & (d_rows['Gift Date'] <= upper)]
d_row_data.append(filtered['Amount'].sum())
details.append(d_row_data)
下面是get_date_range函数。它获取两个范围之间的所有日期范围(Y-m-d)。在我的情况下,范围是“ 2008-04-30”至“ 2020-05-30”。
from dateutil.relativedelta import relativedelta
import datetime, calendar
def get_date_range(start, end):
result = []
while start <= end:
result.append(start)
start += relativedelta(months=1)
return result
Excel数据示例如下: 链接到示例文件: https://docs.google.com/spreadsheets/d/1YeH35w0rqVoHukGTSDtISlztdZAiDYsmfLWVia2x1U0/edit?usp=sharing
答案 0 :(得分:1)
从预期结果中,您希望每个用户和每个月的总金额。熊猫工具为groupby
和sum
,如果希望将日期作为列,则为unstack
:
result = df.groupby(['user_id', pd.to_datetime(df['Gift Date'], dayfirst=True
)+ pd.offsets.Day() - pd.offsets.MonthBegin()])[['Amount']].sum(
).unstack()