我有一个相对较大的数据框(约1000万行)。它有一个id
和DateTimeIndex
。我必须计算一段时间(上周\月\年)中每行带有一定id
的条目数。我已经使用relativedelta
创建了自己的函数,并将日期存储在单独的字典{id: [dates]}
中,但是它的工作速度非常慢。我应该如何快速正确地做到这一点?
P.S .:我听说过pandas.rolling()
,但不知道如何正确使用它。
P.P.S .:我的功能:
def isinrange(date, listdate, delta):
date,listdate = datetime.datetime.strptime(date,format),datetime.datetime.strptime(listdate,format)
return date-delta<=listdate
主代码,包含大量不必要的操作:
dictionary = dict() #structure {id: [dates]}
for row in df.itertuples():#filling a dictionary
if row.id in dictionary:
dictionary[row.id].append(row.DateTimeIndex)
else:
dictionary[row.id] = [row.DateTimeIndex,]
week,month,year = relativedelta(days =7),relativedelta(months = 1),relativedelta(years = 1)#relative delta init
for row, i in zip(df.itertuples(),range(df.shape[0])):#iterating over dataframe
cnt1=cnt2=cnt3=0 #weekly,monthly, yearly - for each row
for date in dictionary[row.id]:#for each date with an id from row
index_date=row.DateTimeIndex
if date<=index_date: #if date from dictionary is lesser than from a row
if isinrange(index_date,date,year):
cnt1+=1
if isinrange(index_date,date,month):
cnt2+=1
if isinrange(index_date,date,week):
cnt3+=1
df.loc[[i,36],'Weekly'] = cnt1 #add values to a data frame
df.loc[[i,37],'Monthly'] = cnt2
df.loc[[i,38],'Yearly']=cnt3
示例:
id date
1 2015-05-19
1 2015-05-22
2 2018-02-21
2 2018-02-23
2 2018-02-27
预期结果:
id date last_week
1 2015-05-19 0
1 2015-05-22 1
2 2018-02-21 0
2 2018-02-23 1
2 2018-02-27 2
答案 0 :(得分:0)
year_range = ["2018"]
month_range = ["06"]
day_range = [str(x) for x in range(18, 25)]
date_range = [year_range, month_range, day_range]
# df = your dataframe
your_result = df[df.date.apply(lambda x: sum([x.split("-")[i] in date_range[i] for i in range(3)]) == 3)].groupby("id").size().reset_index(name="counts")
print(your_result[:5])
我不确定我是否理解正确,但是您正在寻找类似的东西吗?
带有15百万行“测试”数据帧的时间约为15秒
id counts
0 0 454063
1 1 454956
2 2 454746
3 3 455317
4 4 454312
Wall time: 14.5 s
“测试”数据框:
id date
0 4 2018-06-06
1 2 2018-06-18
2 4 2018-06-06
3 3 2018-06-18
4 5 2018-06-06
答案 1 :(得分:0)
import pandas as pd
src = "path/data.csv"
df = pd.read_csv(src, sep=",")
print df
# id date
# 0 1 2015-05-19
# 1 1 2015-05-22
# 2 2 2018-02-21
# 3 2 2018-02-23
# 4 2 2018-02-27
# Convert date column to a datetime
df['date'] = pd.to_datetime(df['date'])
# Retrieve rows in the date range
date_ini = '2015-05-18'
date_end = '2016-05-18'
filtered_rows = df.loc[(df['date'] > date_ini) & (df['date'] <= date_end)]
print filtered_rows
# id date
# 0 1 2015-05-19
# 1 1 2015-05-22
# Group rows by id
grouped_by_id = filtered_rows.groupby(['id']).agg(['count'])
print grouped_by_id
# count
# id
# 1 2