Pandas Count在时间段内发生数据改善

时间:2017-07-19 11:59:41

标签: python pandas numpy

在我的数据中,每一行代表一个案例,其属性是Agence,GT和Agent,DateA代表开始日期,DateB是结束日期。 我已经找到了一种方法来记录每个Agence,GT和Agent每月和每年运行的案例数量。

我的问题是它非常长(真正的数据只有16,000行,但是我必须做7次这样的操作,但是做一个列需要大约5分钟),如果需要它们,用户将不会感到高兴1小时得到他们想要的东西......

如何改进它并仍然获得相同的输出?

(版本:python:3.3.5 | pandas:0.15.2 | numpy:1.9.1)

这是我的代码:

import pandas as pd
import numpy as np
import time

def getListeMonthYearBetween (catA,catB,catC,mA,yA,mB,yB) :
    mA = int(mA)
    yA=int(yA)
    mB=int(mB)
    yB=int(yB)
    df = pd.DataFrame([[catA,catB,catC,mA,yA]],columns=['Agence', 'GT','Agent','Mois','Année'])

    for i in range(0, 12*(yB-yA) +(mB-mA)):
        df2 = pd.DataFrame([[catA,catB,catC,((mA+i)%12+1),(yA+((mA+i)//12))]],columns=['Agence', 'GT','Agent','Mois','Année'])
        df=df.append(df2)

    return df

def getStatTwoDates(df, DateA, DateB, nomNewColumn):

    df[DateA] = pd.to_datetime(df[DateA])

    df[DateB] = np.where(pd.isnull(df[DateB]),pd.to_datetime('today'),df[DateB])

    df[DateB]=df[DateB].apply(getBackToDateTime)

    df=df[(~pd.isnull(df[DateA]))&(df[DateA]<df[DateB])] 

    df['YearA'], df['MonthA'],df['YearB'], df['MonthB'] = df[DateA].dt.year, df[DateA].dt.month , df[DateB].dt.year, df[DateB].dt.month 

    df=df[['Agence', 'GT','Agent','YearA','MonthA','YearB','MonthB']]
    dfStat = pd.DataFrame(columns=['Agence', 'GT','Agent','Mois','Année'])

    for row in df.itertuples() :
        data = getListeMonthYearBetween (row[1],row[2],row[3],row[5],row[4],row[7],row[6])
        dfStat=dfStat.append(data)

    dfStat = pd.DataFrame(dfStat.groupby(['Agence', 'GT','Agent','Mois','Année']).size().reset_index(name=nomNewColumn))

    return dfStat

def getBackToDateTime(x):
    if type(x) is type(pd.to_datetime('today')):
        return x
    else :
        return pd.to_datetime(x)

df = pd.DataFrame([['Agence1','A1','B1',pd.to_datetime('11/08/2016', format='%d/%m/%Y'),pd.to_datetime('21/09/2016', format='%d/%m/%Y')], 
                   ['Agence1','A1','B1',pd.to_datetime('27/02/2016', format='%d/%m/%Y'),pd.to_datetime('21/08/2016', format='%d/%m/%Y')],
                   ['Agence1','A2','B2',pd.to_datetime('11/09/2016', format='%d/%m/%Y'),pd.to_datetime('14/01/2017', format='%d/%m/%Y')],
                   ['Agence1','A3','B3',pd.to_datetime('05/10/2016', format='%d/%m/%Y'),pd.to_datetime('09/10/2016', format='%d/%m/%Y')],
                   ['Agence1','A1','B2',pd.to_datetime('08/01/2016', format='%d/%m/%Y'),pd.to_datetime('10/06/2016', format='%d/%m/%Y')],
                   ['Agence1','A2','B2',pd.to_datetime('09/11/2016', format='%d/%m/%Y'),pd.to_datetime('10/12/2016', format='%d/%m/%Y')],
                   ['Agence1','A3','B3',pd.to_datetime('02/09/2016', format='%d/%m/%Y'),pd.to_datetime('01/02/2017', format='%d/%m/%Y')]],
                   columns=['Agence', 'GT','Agent','DateA','DateB'])

newDf=getStatTwoDates(df, 'DateA', 'DateB', 'Count')

我拥有的是什么:

Agence      GT     Agent      DateA            DateB

Agence1     A1      B1      2016-08-11      2016-09-21
Agence1     A1      B1      2016-02-27      2016-08-21
Agence1     A2      B2      2016-09-11      2017-01-14
Agence1     A3      B3      2016-10-05      2016-10-09
Agence1     A1      B2      2016-01-08      2016-06-10
Agence1     A2      B2      2016-11-09      2016-12-10
Agence1     A3      B3      2016-09-02      2017-02-01

我得到了什么:

Agence      GT      Agent   Mois    Année   Count

Agence1     A1       B1      2      2016      1
Agence1     A1       B1      3      2016      1
Agence1     A1       B1      4      2016      1
Agence1     A1       B1      5      2016      1
Agence1     A1       B1      6      2016      1
Agence1     A1       B1      7      2016      1
Agence1     A1       B1      8      2016      2
Agence1     A1       B1      9      2016      1
Agence1     A1       B2      1      2016      1
Agence1     A1       B2      2      2016      1
Agence1     A1       B2      3      2016      1
Agence1     A1       B2      4      2016      1
Agence1     A1       B2      5      2016      1
Agence1     A1       B2      6      2016      1
Agence1     A2       B2      1      2017      1
Agence1     A2       B2      9      2016      1
Agence1     A2       B2     10      2016      1
Agence1     A2       B2     11      2016      2
Agence1     A2       B2     12      2016      2
Agence1     A3       B3      1      2017      1
Agence1     A3       B3      2      2017      1
Agence1     A3       B3      9      2016      1
Agence1     A3       B3     10      2016      2
Agence1     A3       B3     11      2016      1
Agence1     A3       B3     12      2016      1

1 个答案:

答案 0 :(得分:1)

生成2个日期时间的月份列表的方法可能更有效

def gen_montly_list(start, end):
    start = pd.Timestamp(start.year, start.month, 1)
    end = beginning_of_next_month(end)
    return pd.DatetimeIndex(start=start, end=end, freq='1M')

def beginning_of_next_month(date):
    month = (date.month) % 12 + 1
    year = date.year if date.month != 12 else date.year + 1
    # print(year, month)
    return pd.Timestamp(year, month, 1)
gen_montly_list(pd.to_datetime('11/08/2016', format='%d/%m%Y'),pd.to_datetime('21/12/2016', format='%d/%m/%Y'))
DatetimeIndex(['2016-08-31', '2016-09-30', '2016-10-31', '2016-11-30',
               '2016-12-31'],
              dtype='datetime64[ns]', freq='M')

然后,对于计数,您可以使用collections.Counter

def count_occurences(df):
    c = collections.Counter()
    for row in df.itertuples():
        # print(row)
        c.update(gen_montly_list(row.DateA, row.DateB))
    return c

所以现在我们必须做一个groupby并将每个组传递给这个函数,并聚合这个信息

results = pd.DataFrame()

for group in df.groupby(['Agence', 'GT','Agent']):
    # print(group)
    res = pd.Series(count_occurences(group[1]))
    res = pd.DataFrame({'year':res.index.year, 'month' : res.index.month, 'count':res})
#     res.columns = ['year', 'month', 'count']
    for k, v in zip(['Agence', 'GT','Agent'], group[0]):
        res[k] = v
#     res.set_index(['Agence', 'GT','Agent', 'year', 'month', ], inplace = True)
    results = results.append(res.reset_index(drop=True))
results.reindex(columns=['Agence', 'GT','Agent', 'year', 'month', 'count']).reset_index(drop=True)
Agence  GT  Agent   year    month   count
0   Agence1     A1  B1  2016    2   1
1   Agence1     A1  B1  2016    3   1
2   Agence1     A1  B1  2016    4   1
3   Agence1     A1  B1  2016    5   1
4   Agence1     A1  B1  2016    6   1
5   Agence1     A1  B1  2016    7   1
6   Agence1     A1  B1  2016    8   2
7   Agence1     A1  B1  2016    9   1
8   Agence1     A1  B2  2016    1   1
9   Agence1     A1  B2  2016    2   1
10  Agence1     A1  B2  2016    3   1
11  Agence1     A1  B2  2016    4   1
12  Agence1     A1  B2  2016    5   1
13  Agence1     A1  B2  2016    6   1
14  Agence1     A2  B2  2016    9   1
15  Agence1     A2  B2  2016    10  1
16  Agence1     A2  B2  2016    11  2
17  Agence1     A2  B2  2016    12  2
18  Agence1     A2  B2  2017    1   1
19  Agence1     A3  B3  2016    9   1
20  Agence1     A3  B3  2016    10  2
21  Agence1     A3  B3  2016    11  1
22  Agence1     A3  B3  2016    12  1
23  Agence1     A3  B3  2017    1   1
24  Agence1     A3  B3  2017    2   1
results.set_index(['Agence', 'GT','Agent', 'year', 'month'])

使用MultiIndex

生成DataFrame