在我的数据中,每一行代表一个案例,其属性是Agence,GT和Agent,DateA代表开始日期,DateB是结束日期。 我已经找到了一种方法来记录每个Agence,GT和Agent每月和每年运行的案例数量。
我的问题是它非常长(真正的数据只有16,000行,但是我必须做7次这样的操作,但是做一个列需要大约5分钟),如果需要它们,用户将不会感到高兴1小时得到他们想要的东西......
如何改进它并仍然获得相同的输出?
(版本:python:3.3.5 | pandas:0.15.2 | numpy:1.9.1)
import pandas as pd
import numpy as np
import time
def getListeMonthYearBetween (catA,catB,catC,mA,yA,mB,yB) :
mA = int(mA)
yA=int(yA)
mB=int(mB)
yB=int(yB)
df = pd.DataFrame([[catA,catB,catC,mA,yA]],columns=['Agence', 'GT','Agent','Mois','Année'])
for i in range(0, 12*(yB-yA) +(mB-mA)):
df2 = pd.DataFrame([[catA,catB,catC,((mA+i)%12+1),(yA+((mA+i)//12))]],columns=['Agence', 'GT','Agent','Mois','Année'])
df=df.append(df2)
return df
def getStatTwoDates(df, DateA, DateB, nomNewColumn):
df[DateA] = pd.to_datetime(df[DateA])
df[DateB] = np.where(pd.isnull(df[DateB]),pd.to_datetime('today'),df[DateB])
df[DateB]=df[DateB].apply(getBackToDateTime)
df=df[(~pd.isnull(df[DateA]))&(df[DateA]<df[DateB])]
df['YearA'], df['MonthA'],df['YearB'], df['MonthB'] = df[DateA].dt.year, df[DateA].dt.month , df[DateB].dt.year, df[DateB].dt.month
df=df[['Agence', 'GT','Agent','YearA','MonthA','YearB','MonthB']]
dfStat = pd.DataFrame(columns=['Agence', 'GT','Agent','Mois','Année'])
for row in df.itertuples() :
data = getListeMonthYearBetween (row[1],row[2],row[3],row[5],row[4],row[7],row[6])
dfStat=dfStat.append(data)
dfStat = pd.DataFrame(dfStat.groupby(['Agence', 'GT','Agent','Mois','Année']).size().reset_index(name=nomNewColumn))
return dfStat
def getBackToDateTime(x):
if type(x) is type(pd.to_datetime('today')):
return x
else :
return pd.to_datetime(x)
df = pd.DataFrame([['Agence1','A1','B1',pd.to_datetime('11/08/2016', format='%d/%m/%Y'),pd.to_datetime('21/09/2016', format='%d/%m/%Y')],
['Agence1','A1','B1',pd.to_datetime('27/02/2016', format='%d/%m/%Y'),pd.to_datetime('21/08/2016', format='%d/%m/%Y')],
['Agence1','A2','B2',pd.to_datetime('11/09/2016', format='%d/%m/%Y'),pd.to_datetime('14/01/2017', format='%d/%m/%Y')],
['Agence1','A3','B3',pd.to_datetime('05/10/2016', format='%d/%m/%Y'),pd.to_datetime('09/10/2016', format='%d/%m/%Y')],
['Agence1','A1','B2',pd.to_datetime('08/01/2016', format='%d/%m/%Y'),pd.to_datetime('10/06/2016', format='%d/%m/%Y')],
['Agence1','A2','B2',pd.to_datetime('09/11/2016', format='%d/%m/%Y'),pd.to_datetime('10/12/2016', format='%d/%m/%Y')],
['Agence1','A3','B3',pd.to_datetime('02/09/2016', format='%d/%m/%Y'),pd.to_datetime('01/02/2017', format='%d/%m/%Y')]],
columns=['Agence', 'GT','Agent','DateA','DateB'])
newDf=getStatTwoDates(df, 'DateA', 'DateB', 'Count')
Agence GT Agent DateA DateB
Agence1 A1 B1 2016-08-11 2016-09-21
Agence1 A1 B1 2016-02-27 2016-08-21
Agence1 A2 B2 2016-09-11 2017-01-14
Agence1 A3 B3 2016-10-05 2016-10-09
Agence1 A1 B2 2016-01-08 2016-06-10
Agence1 A2 B2 2016-11-09 2016-12-10
Agence1 A3 B3 2016-09-02 2017-02-01
Agence GT Agent Mois Année Count
Agence1 A1 B1 2 2016 1
Agence1 A1 B1 3 2016 1
Agence1 A1 B1 4 2016 1
Agence1 A1 B1 5 2016 1
Agence1 A1 B1 6 2016 1
Agence1 A1 B1 7 2016 1
Agence1 A1 B1 8 2016 2
Agence1 A1 B1 9 2016 1
Agence1 A1 B2 1 2016 1
Agence1 A1 B2 2 2016 1
Agence1 A1 B2 3 2016 1
Agence1 A1 B2 4 2016 1
Agence1 A1 B2 5 2016 1
Agence1 A1 B2 6 2016 1
Agence1 A2 B2 1 2017 1
Agence1 A2 B2 9 2016 1
Agence1 A2 B2 10 2016 1
Agence1 A2 B2 11 2016 2
Agence1 A2 B2 12 2016 2
Agence1 A3 B3 1 2017 1
Agence1 A3 B3 2 2017 1
Agence1 A3 B3 9 2016 1
Agence1 A3 B3 10 2016 2
Agence1 A3 B3 11 2016 1
Agence1 A3 B3 12 2016 1
答案 0 :(得分:1)
生成2个日期时间的月份列表的方法可能更有效
def gen_montly_list(start, end):
start = pd.Timestamp(start.year, start.month, 1)
end = beginning_of_next_month(end)
return pd.DatetimeIndex(start=start, end=end, freq='1M')
def beginning_of_next_month(date):
month = (date.month) % 12 + 1
year = date.year if date.month != 12 else date.year + 1
# print(year, month)
return pd.Timestamp(year, month, 1)
gen_montly_list(pd.to_datetime('11/08/2016', format='%d/%m%Y'),pd.to_datetime('21/12/2016', format='%d/%m/%Y'))
DatetimeIndex(['2016-08-31', '2016-09-30', '2016-10-31', '2016-11-30',
'2016-12-31'],
dtype='datetime64[ns]', freq='M')
然后,对于计数,您可以使用collections.Counter
def count_occurences(df):
c = collections.Counter()
for row in df.itertuples():
# print(row)
c.update(gen_montly_list(row.DateA, row.DateB))
return c
所以现在我们必须做一个groupby并将每个组传递给这个函数,并聚合这个信息
results = pd.DataFrame()
for group in df.groupby(['Agence', 'GT','Agent']):
# print(group)
res = pd.Series(count_occurences(group[1]))
res = pd.DataFrame({'year':res.index.year, 'month' : res.index.month, 'count':res})
# res.columns = ['year', 'month', 'count']
for k, v in zip(['Agence', 'GT','Agent'], group[0]):
res[k] = v
# res.set_index(['Agence', 'GT','Agent', 'year', 'month', ], inplace = True)
results = results.append(res.reset_index(drop=True))
results.reindex(columns=['Agence', 'GT','Agent', 'year', 'month', 'count']).reset_index(drop=True)
Agence GT Agent year month count
0 Agence1 A1 B1 2016 2 1
1 Agence1 A1 B1 2016 3 1
2 Agence1 A1 B1 2016 4 1
3 Agence1 A1 B1 2016 5 1
4 Agence1 A1 B1 2016 6 1
5 Agence1 A1 B1 2016 7 1
6 Agence1 A1 B1 2016 8 2
7 Agence1 A1 B1 2016 9 1
8 Agence1 A1 B2 2016 1 1
9 Agence1 A1 B2 2016 2 1
10 Agence1 A1 B2 2016 3 1
11 Agence1 A1 B2 2016 4 1
12 Agence1 A1 B2 2016 5 1
13 Agence1 A1 B2 2016 6 1
14 Agence1 A2 B2 2016 9 1
15 Agence1 A2 B2 2016 10 1
16 Agence1 A2 B2 2016 11 2
17 Agence1 A2 B2 2016 12 2
18 Agence1 A2 B2 2017 1 1
19 Agence1 A3 B3 2016 9 1
20 Agence1 A3 B3 2016 10 2
21 Agence1 A3 B3 2016 11 1
22 Agence1 A3 B3 2016 12 1
23 Agence1 A3 B3 2017 1 1
24 Agence1 A3 B3 2017 2 1
results.set_index(['Agence', 'GT','Agent', 'year', 'month'])
使用MultiIndex
生成DataFrame