我正在尝试根据索引遍历一个巨大的熊猫数据框(超过370.000行)。
对于每一行,代码应回顾该索引的最后12个条目(如果有),并根据(连续)季度/学期/年进行总结。
如果没有信息或信息不足(仅过去3个月),则代码应将其他月份/季度视为0。
这是我的数据框示例:
这是预期的输出:
因此,在DateID“ 1”上查找该行没有其他信息。 DateID“ 1”是这种情况下的最后一个月(可以说是第12个月),因此是Q4和H2。前一个月的其他所有时间都不存在,因此不予考虑。
我已经找到了一个可行的解决方案,但是它的效率非常低,并且花费了大量的时间,这是无法接受的。
这是我的代码示例:
for company_name, c in df.groupby('Account Name'):
for i, row in c.iterrows():
i += 1
if i < 4:
q4 = c.iloc[:i]['Value$'].sum()
q3 = 0
q2 = 0
q1 = 0
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
elif 3 < i < 7:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[:i-3]['Value$'].sum()
q2 = 0
q1 = 0
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
elif 6 < i < 10:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[i-6:i-3]['Value$'].sum()
q2 = c.iloc[:i-6]['Value$'].sum()
q1 = 0
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
elif 9 < i < 13:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[i-6:i-3]['Value$'].sum()
q2 = c.iloc[i-9:i-6]['Value$'].sum()
q1 = c.iloc[:i-9]['Value$'].sum()
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
else:
q4 = c.iloc[i-3:i]['Value$'].sum()
q3 = c.iloc[i-6:i-3]['Value$'].sum()
q2 = c.iloc[i-9:i-6]['Value$'].sum()
q1 = c.iloc[i-12:i-9]['Value$'].sum()
h2 = q4 + q3
h1 = q2 + q1
year = q4 + q3 + q2 + q1
new_df = new_df.append({'Account Name':row['Account Name'], 'DateID': row['DateID'],'Q4':q4,'Q3':q3,'Q2':q2,'Q1':q1,'H1':h1,'H2':h2,'Year':year},ignore_index=True)
正如我所说,我正在寻找一种更有效的方法来计算这些数字,因为每个帐户有近10.000个帐户名和30个日期ID。
非常感谢!
答案 0 :(得分:3)
如果我说对了,这应该可以计算出您的数字:
grouped= df.groupby('Account Name')['Value$']
last_3= grouped.apply(lambda ser: ser.rolling(window=3, min_periods=1).sum())
last_6= grouped.apply(lambda ser: ser.rolling(window=6, min_periods=1).sum())
last_9= grouped.apply(lambda ser: ser.rolling(window=9, min_periods=1).sum())
last_12= grouped.apply(lambda ser: ser.rolling(window=12, min_periods=1).sum())
df['Q4']= last_3
df['Q3']= last_6 - last_3
df['Q2']= last_9 - last_6
df['Q1']= last_12 - last_9
df['H1']= df['Q1'] + df['Q2']
df['H2']= df['Q3'] + df['Q4']
这将输出:
Out[19]:
Account Name DateID Value$ Q4 Q3 Q2 Q1 H1 H2
0 A 0 33 33.0 0.0 0.0 0.0 0.0 33.0
1 A 1 20 53.0 0.0 0.0 0.0 0.0 53.0
2 A 2 24 77.0 0.0 0.0 0.0 0.0 77.0
3 A 3 21 65.0 33.0 0.0 0.0 0.0 98.0
4 A 4 22 67.0 53.0 0.0 0.0 0.0 120.0
5 A 5 31 74.0 77.0 0.0 0.0 0.0 151.0
6 A 6 30 83.0 65.0 33.0 0.0 33.0 148.0
7 A 7 23 84.0 67.0 53.0 0.0 53.0 151.0
8 A 8 11 64.0 74.0 77.0 0.0 77.0 138.0
9 A 9 35 69.0 83.0 65.0 33.0 98.0 152.0
10 A 10 32 78.0 84.0 67.0 53.0 120.0 162.0
11 A 11 31 98.0 64.0 74.0 77.0 151.0 162.0
12 A 12 32 95.0 69.0 83.0 65.0 148.0 164.0
13 A 13 20 83.0 78.0 84.0 67.0 151.0 161.0
14 A 14 15 67.0 98.0 64.0 74.0 138.0 165.0
15 B 0 44 44.0 0.0 0.0 0.0 0.0 44.0
16 B 1 43 87.0 0.0 0.0 0.0 0.0 87.0
17 B 2 31 118.0 0.0 0.0 0.0 0.0 118.0
18 B 3 10 84.0 44.0 0.0 0.0 0.0 128.0
19 B 4 13 54.0 87.0 0.0 0.0 0.0 141.0
20 B 5 20 43.0 118.0 0.0 0.0 0.0 161.0
21 B 6 28 61.0 84.0 44.0 0.0 44.0 145.0
22 B 7 14 62.0 54.0 87.0 0.0 87.0 116.0
23 B 8 20 62.0 43.0 118.0 0.0 118.0 105.0
24 B 9 41 75.0 61.0 84.0 44.0 128.0 136.0
25 B 10 39 100.0 62.0 54.0 87.0 141.0 162.0
26 B 11 46 126.0 62.0 43.0 118.0 161.0 188.0
27 B 12 26 111.0 75.0 61.0 84.0 145.0 186.0
28 B 13 24 96.0 100.0 62.0 54.0 116.0 196.0
29 B 14 34 84.0 126.0 62.0 43.0 105.0 210.0
32 C 2 12 12.0 0.0 0.0 0.0 0.0 12.0
33 C 3 15 27.0 0.0 0.0 0.0 0.0 27.0
34 C 4 45 72.0 0.0 0.0 0.0 0.0 72.0
35 C 5 22 82.0 12.0 0.0 0.0 0.0 94.0
36 C 6 48 115.0 27.0 0.0 0.0 0.0 142.0
37 C 7 45 115.0 72.0 0.0 0.0 0.0 187.0
38 C 8 11 104.0 82.0 12.0 0.0 12.0 186.0
39 C 9 27 83.0 115.0 27.0 0.0 27.0 198.0
对于以下测试数据:
data= {'Account Name': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
'DateID': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 2, 3, 4, 5, 6, 7, 8, 9],
'Value$': [33, 20, 24, 21, 22, 31, 30, 23, 11, 35, 32, 31, 32, 20, 15, 44, 43, 31, 10, 13, 20, 28, 14, 20, 41, 39, 46, 26, 24, 34, 12, 15, 45, 22, 48, 45, 11, 27]
}
df= pd.DataFrame(data)
编辑::如果要在同一时期计算唯一的整体,则可以执行以下操作:
def get_nunique(np_array):
unique, counts= np.unique(np_array, return_counts=True)
return len(unique)
df['Category'].rolling(window=3, min_periods=1).apply(get_nunique)
答案 1 :(得分:1)
我不想完全超载以上答案,所以我为您的第二部分添加了一个新答案:
# define a function that
# creates the unique counts
# by aggregating period_length times
# so 3 times for the quarter mapping
# and 6 times for the half year
# it's basically doing something like
# a sliding window aggregation
def get_mapping(df, period_lenght=3):
df_mapping= None
for offset in range(period_lenght):
quarter= (df['DateID']+offset) // period_lenght
aggregated= df.groupby([quarter, df['Account Name']]).agg({'DateID': max, 'Category': lambda ser: len(set(ser))})
incomplete_data= ((aggregated['DateID']+offset+1)//period_lenght <= aggregated.index.get_level_values(0)) & (aggregated.index.get_level_values(0) >= period_lenght)
aggregated.drop(aggregated.index[incomplete_data].to_list(), inplace=True)
aggregated.set_index('DateID', append=True, inplace=True)
aggregated= aggregated.droplevel(0, axis='index')
if df_mapping is None:
df_mapping= aggregated
else:
df_mapping= pd.concat([df_mapping, aggregated], axis='index')
return df_mapping
# apply it for 3 months and merge it to the source df
df_mapping= get_mapping(df, period_lenght=3)
df_mapping.columns= ['unique_3_months']
df_with_3_months= df.merge(df_mapping, left_on=['Account Name', 'DateID'], how='left', right_index=True)
# do the same for 6 months and merge it again
df_mapping= get_mapping(df, period_lenght=6)
df_mapping.columns= ['unique_6_months']
df_with_6_months= df_with_3_months.merge(df_mapping, left_on=['Account Name', 'DateID'], how='left', right_index=True)
结果是:
Out[305]:
Account Name DateID Value$ Category unique_3_months unique_6_months
0 A 0 10 1 1 1
1 A 1 12 2 2 2
2 A 1 38 1 2 2
3 A 2 20 3 3 3
4 A 3 25 3 3 3
5 A 4 24 4 2 4
6 A 5 27 8 3 5
7 A 6 30 5 3 6
8 A 7 47 7 3 5
9 A 8 30 4 3 5
10 A 9 17 7 2 4
11 A 10 20 8 3 4
12 A 11 33 8 2 4
13 A 12 45 9 2 4
14 A 13 19 2 3 5
15 A 14 24 10 3 3
15 A 14 24 10 3 4
15 A 14 24 10 3 4
15 A 14 24 10 3 5
15 A 14 24 10 3 1
15 A 14 24 10 3 2
16 B 0 41 2 1 1
17 B 1 13 9 2 2
18 B 2 17 6 3 3
19 B 3 45 7 3 4
20 B 4 11 6 2 4
21 B 5 38 8 3 5
22 B 6 44 8 2 4
23 B 7 15 8 1 3
24 B 8 50 2 2 4
25 B 9 27 7 3 4
26 B 10 38 10 3 4
27 B 11 25 6 3 5
28 B 12 25 8 3 5
29 B 13 14 7 3 5
30 B 14 25 9 3 3
30 B 14 25 9 3 4
30 B 14 25 9 3 5
30 B 14 25 9 3 5
30 B 14 25 9 3 1
30 B 14 25 9 3 2
31 C 2 31 9 1 1
32 C 3 31 7 2 2
33 C 4 26 5 3 3
34 C 5 11 2 3 4
35 C 6 15 8 3 5
36 C 7 22 2 2 5
37 C 8 33 2 2 4
38 C 9 16 5 2 3
38 C 9 16 5 2 3
38 C 9 16 5 2 3
38 C 9 16 5 2 1
38 C 9 16 5 2 2
38 C 9 16 5 2 2
输出基于以下输入数据:
data= {
'Account Name': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
'DateID': [0, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 2, 3, 4, 5, 6, 7, 8, 9],
'Value$': [10, 12, 38, 20, 25, 24, 27, 30, 47, 30, 17, 20, 33, 45, 19, 24, 41, 13, 17, 45, 11, 38, 44, 15, 50, 27, 38, 25, 25, 14, 25, 31, 31, 26, 11, 15, 22, 33, 16],
'Category': [1, 2, 1, 3, 3, 4, 8, 5, 7, 4, 7, 8, 8, 9, 2, 10, 2, 9, 6, 7, 6, 8, 8, 8, 2, 7, 10, 6, 8, 7, 9, 9, 7, 5, 2, 8, 2, 2, 5]
}
df= pd.DataFrame(data)