熊猫如何在选定日期范围内按ID计算条目数

时间:2018-06-26 09:34:46

标签: python pandas datetime dataframe pandas-groupby

我有一个相对较大的数据框(约1000万行)。它有一个idDateTimeIndex。我必须计算一段时间(上周\月\年)中每行带有一定id的条目数。我已经使用relativedelta创建了自己的函数,并将日期存储在单独的字典{id: [dates]}中,但是它的工作速度非常慢。我应该如何快速正确地做到这一点?

P.S .:我听说过pandas.rolling(),但不知道如何正确使用它。

P.P.S .:我的功能:

def isinrange(date, listdate, delta):
    date,listdate = datetime.datetime.strptime(date,format),datetime.datetime.strptime(listdate,format)
    return date-delta<=listdate

主代码,包含大量不必要的操作:

dictionary = dict() #structure {id: [dates]}
for row in df.itertuples():#filling a dictionary
    if row.id in dictionary:
        dictionary[row.id].append(row.DateTimeIndex)
    else:
        dictionary[row.id] = [row.DateTimeIndex,]

week,month,year = relativedelta(days =7),relativedelta(months = 1),relativedelta(years = 1)#relative delta init

for row, i in zip(df.itertuples(),range(df.shape[0])):#iterating over dataframe
    cnt1=cnt2=cnt3=0 #weekly,monthly, yearly - for each row
    for date in dictionary[row.id]:#for each date with an id from row
        index_date=row.DateTimeIndex 
        if date<=index_date: #if date from dictionary is lesser than from a row 
            if isinrange(index_date,date,year):
                cnt1+=1
            if isinrange(index_date,date,month):
                cnt2+=1
            if isinrange(index_date,date,week):
                cnt3+=1
    df.loc[[i,36],'Weekly'] = cnt1 #add values to a data frame
    df.loc[[i,37],'Monthly'] = cnt2
    df.loc[[i,38],'Yearly']=cnt3

示例:

id  date
1   2015-05-19
1   2015-05-22
2   2018-02-21
2   2018-02-23
2   2018-02-27

预期结果:

id  date    last_week
1   2015-05-19  0
1   2015-05-22  1
2   2018-02-21  0
2   2018-02-23  1
2   2018-02-27  2

2 个答案:

答案 0 :(得分:0)

year_range = ["2018"]
month_range = ["06"]
day_range = [str(x) for x in range(18, 25)]
date_range = [year_range, month_range, day_range]

# df = your dataframe
your_result = df[df.date.apply(lambda x: sum([x.split("-")[i] in date_range[i] for i in range(3)]) == 3)].groupby("id").size().reset_index(name="counts")
print(your_result[:5])

我不确定我是否理解正确,但是您正在寻找类似的东西吗?
带有15百万行“测试”数据帧的时间约为15秒

   id  counts
0   0  454063
1   1  454956
2   2  454746
3   3  455317
4   4  454312
Wall time: 14.5 s

“测试”数据框:

   id        date
0   4  2018-06-06
1   2  2018-06-18
2   4  2018-06-06
3   3  2018-06-18
4   5  2018-06-06

答案 1 :(得分:0)

import pandas as pd                                                                               
src = "path/data.csv"                                                        
df = pd.read_csv(src, sep=",")                                                                    
print df                                                                                          
#    id        date                                                                               
# 0   1  2015-05-19                                                                               
# 1   1  2015-05-22                                                                               
# 2   2  2018-02-21                                                                               
# 3   2  2018-02-23                                                                               
# 4   2  2018-02-27                                                                               

# Convert date column to a datetime                                                               
df['date'] = pd.to_datetime(df['date'])                                                           

# Retrieve rows in the date range                                                                 

date_ini = '2015-05-18'                                                                           
date_end = '2016-05-18'                                                                           

filtered_rows = df.loc[(df['date'] > date_ini) & (df['date'] <= date_end)]                        
print filtered_rows                                                                               
#    id       date                                                                                
# 0   1 2015-05-19                                                                                
# 1   1 2015-05-22                                                                                

# Group rows by id                                                                                
grouped_by_id = filtered_rows.groupby(['id']).agg(['count'])                                      
print  grouped_by_id                                                                              
#    count                                                                                        
# id                                                                                              
# 1      2