遍历多索引熊猫数据框

时间:2019-09-12 06:56:01

标签: python-3.x pandas iteration pandas-groupby multi-index

我正在尝试根据索引遍历一个巨大的熊猫数据框(超过370.000行)。

对于每一行,代码应回顾该索引的最后12个条目(如果有),并根据(连续)季度/学期/年进行总结。

如果没有信息或信息不足(仅过去3个月),则代码应将其他月份/季度视为0。

这是我的数据框示例:

enter image description here

这是预期的输出:

enter image description here

因此,在DateID“ 1”上查找该行没有其他信息。 DateID“ 1”是这种情况下的最后一个月(可以说是第12个月),因此是Q4和H2。前一个月的其他所有时间都不存在,因此不予考虑。

我已经找到了一个可行的解决方案,但是它的效率非常低,并且花费了大量的时间,这是无法接受的。

这是我的代码示例:

for company_name, c in df.groupby('Account Name'):
    for i, row in c.iterrows():
        i += 1
        if i < 4:              
            q4 = c.iloc[:i]['Value$'].sum()
            q3 = 0
            q2 = 0
            q1 = 0
            h2 = q4 + q3
            h1 = q2 + q1
            year = q4 + q3 + q2 + q1

        elif 3 < i < 7:
            q4 = c.iloc[i-3:i]['Value$'].sum()
            q3 = c.iloc[:i-3]['Value$'].sum()
            q2 = 0
            q1 = 0
            h2 = q4 + q3
            h1 = q2 + q1
            year = q4 + q3 + q2 + q1

        elif 6 < i < 10:
            q4 = c.iloc[i-3:i]['Value$'].sum()
            q3 = c.iloc[i-6:i-3]['Value$'].sum()
            q2 = c.iloc[:i-6]['Value$'].sum()
            q1 = 0
            h2 = q4 + q3
            h1 = q2 + q1
            year = q4 + q3 + q2 + q1
        elif 9 < i < 13:
            q4 = c.iloc[i-3:i]['Value$'].sum()
            q3 = c.iloc[i-6:i-3]['Value$'].sum()
            q2 = c.iloc[i-9:i-6]['Value$'].sum()
            q1 = c.iloc[:i-9]['Value$'].sum()
            h2 = q4 + q3
            h1 = q2 + q1
            year = q4 + q3 + q2 + q1
        else:
            q4 = c.iloc[i-3:i]['Value$'].sum()
            q3 = c.iloc[i-6:i-3]['Value$'].sum()
            q2 = c.iloc[i-9:i-6]['Value$'].sum()
            q1 = c.iloc[i-12:i-9]['Value$'].sum()
            h2 = q4 + q3
            h1 = q2 + q1
            year = q4 + q3 + q2 + q1

        new_df = new_df.append({'Account Name':row['Account Name'], 'DateID': row['DateID'],'Q4':q4,'Q3':q3,'Q2':q2,'Q1':q1,'H1':h1,'H2':h2,'Year':year},ignore_index=True)

正如我所说,我正在寻找一种更有效的方法来计算这些数字,因为每个帐户有近10.000个帐户名和30个日期ID。

非常感谢!

2 个答案:

答案 0 :(得分:3)

如果我说对了,这应该可以计算出您的数字:

grouped= df.groupby('Account Name')['Value$']
last_3= grouped.apply(lambda ser: ser.rolling(window=3, min_periods=1).sum())
last_6= grouped.apply(lambda ser: ser.rolling(window=6, min_periods=1).sum())
last_9= grouped.apply(lambda ser: ser.rolling(window=9, min_periods=1).sum())
last_12= grouped.apply(lambda ser: ser.rolling(window=12, min_periods=1).sum())

df['Q4']= last_3
df['Q3']= last_6  - last_3
df['Q2']= last_9  - last_6
df['Q1']= last_12 - last_9
df['H1']= df['Q1'] + df['Q2']
df['H2']= df['Q3'] + df['Q4']

这将输出:

Out[19]: 
   Account Name  DateID  Value$     Q4     Q3     Q2     Q1     H1     H2
0             A       0      33   33.0    0.0    0.0    0.0    0.0   33.0
1             A       1      20   53.0    0.0    0.0    0.0    0.0   53.0
2             A       2      24   77.0    0.0    0.0    0.0    0.0   77.0
3             A       3      21   65.0   33.0    0.0    0.0    0.0   98.0
4             A       4      22   67.0   53.0    0.0    0.0    0.0  120.0
5             A       5      31   74.0   77.0    0.0    0.0    0.0  151.0
6             A       6      30   83.0   65.0   33.0    0.0   33.0  148.0
7             A       7      23   84.0   67.0   53.0    0.0   53.0  151.0
8             A       8      11   64.0   74.0   77.0    0.0   77.0  138.0
9             A       9      35   69.0   83.0   65.0   33.0   98.0  152.0
10            A      10      32   78.0   84.0   67.0   53.0  120.0  162.0
11            A      11      31   98.0   64.0   74.0   77.0  151.0  162.0
12            A      12      32   95.0   69.0   83.0   65.0  148.0  164.0
13            A      13      20   83.0   78.0   84.0   67.0  151.0  161.0
14            A      14      15   67.0   98.0   64.0   74.0  138.0  165.0
15            B       0      44   44.0    0.0    0.0    0.0    0.0   44.0
16            B       1      43   87.0    0.0    0.0    0.0    0.0   87.0
17            B       2      31  118.0    0.0    0.0    0.0    0.0  118.0
18            B       3      10   84.0   44.0    0.0    0.0    0.0  128.0
19            B       4      13   54.0   87.0    0.0    0.0    0.0  141.0
20            B       5      20   43.0  118.0    0.0    0.0    0.0  161.0
21            B       6      28   61.0   84.0   44.0    0.0   44.0  145.0
22            B       7      14   62.0   54.0   87.0    0.0   87.0  116.0
23            B       8      20   62.0   43.0  118.0    0.0  118.0  105.0
24            B       9      41   75.0   61.0   84.0   44.0  128.0  136.0
25            B      10      39  100.0   62.0   54.0   87.0  141.0  162.0
26            B      11      46  126.0   62.0   43.0  118.0  161.0  188.0
27            B      12      26  111.0   75.0   61.0   84.0  145.0  186.0
28            B      13      24   96.0  100.0   62.0   54.0  116.0  196.0
29            B      14      34   84.0  126.0   62.0   43.0  105.0  210.0
32            C       2      12   12.0    0.0    0.0    0.0    0.0   12.0
33            C       3      15   27.0    0.0    0.0    0.0    0.0   27.0
34            C       4      45   72.0    0.0    0.0    0.0    0.0   72.0
35            C       5      22   82.0   12.0    0.0    0.0    0.0   94.0
36            C       6      48  115.0   27.0    0.0    0.0    0.0  142.0
37            C       7      45  115.0   72.0    0.0    0.0    0.0  187.0
38            C       8      11  104.0   82.0   12.0    0.0   12.0  186.0
39            C       9      27   83.0  115.0   27.0    0.0   27.0  198.0

对于以下测试数据:

data= {'Account Name': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
 'DateID': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 2, 3, 4, 5, 6, 7, 8, 9],
 'Value$': [33, 20, 24, 21, 22, 31, 30, 23, 11, 35, 32, 31, 32, 20, 15, 44, 43, 31, 10, 13, 20, 28, 14, 20, 41, 39, 46, 26, 24, 34, 12, 15, 45, 22, 48, 45, 11, 27]
}

df= pd.DataFrame(data)

编辑::如果要在同一时期计算唯一的整体,则可以执行以下操作:

def get_nunique(np_array):
    unique, counts= np.unique(np_array, return_counts=True)
    return len(unique)

df['Category'].rolling(window=3, min_periods=1).apply(get_nunique)

答案 1 :(得分:1)

我不想完全超载以上答案,所以我为您的第二部分添加了一个新答案:

# define a function that
# creates the unique counts
# by aggregating period_length times
# so 3 times for the quarter mapping
# and 6 times for the half year
# it's basically doing something like
# a sliding window aggregation
def get_mapping(df, period_lenght=3):
    df_mapping= None
    for offset in range(period_lenght):
        quarter= (df['DateID']+offset) // period_lenght
        aggregated= df.groupby([quarter, df['Account Name']]).agg({'DateID': max, 'Category': lambda ser: len(set(ser))})
        incomplete_data= ((aggregated['DateID']+offset+1)//period_lenght <= aggregated.index.get_level_values(0)) & (aggregated.index.get_level_values(0) >= period_lenght)
        aggregated.drop(aggregated.index[incomplete_data].to_list(), inplace=True)
        aggregated.set_index('DateID', append=True, inplace=True)
        aggregated= aggregated.droplevel(0, axis='index')
        if df_mapping is None:
            df_mapping= aggregated
        else:
            df_mapping= pd.concat([df_mapping, aggregated], axis='index')
    return df_mapping

# apply it for 3 months and merge it to the source df
df_mapping= get_mapping(df, period_lenght=3)
df_mapping.columns= ['unique_3_months']    
df_with_3_months= df.merge(df_mapping, left_on=['Account Name', 'DateID'], how='left', right_index=True)

# do the same for 6 months and merge it again
df_mapping= get_mapping(df, period_lenght=6)
df_mapping.columns= ['unique_6_months']
df_with_6_months= df_with_3_months.merge(df_mapping, left_on=['Account Name', 'DateID'], how='left', right_index=True)

结果是:

Out[305]: 
   Account Name  DateID  Value$  Category  unique_3_months  unique_6_months
0             A       0      10         1                1                1
1             A       1      12         2                2                2
2             A       1      38         1                2                2
3             A       2      20         3                3                3
4             A       3      25         3                3                3
5             A       4      24         4                2                4
6             A       5      27         8                3                5
7             A       6      30         5                3                6
8             A       7      47         7                3                5
9             A       8      30         4                3                5
10            A       9      17         7                2                4
11            A      10      20         8                3                4
12            A      11      33         8                2                4
13            A      12      45         9                2                4
14            A      13      19         2                3                5
15            A      14      24        10                3                3
15            A      14      24        10                3                4
15            A      14      24        10                3                4
15            A      14      24        10                3                5
15            A      14      24        10                3                1
15            A      14      24        10                3                2
16            B       0      41         2                1                1
17            B       1      13         9                2                2
18            B       2      17         6                3                3
19            B       3      45         7                3                4
20            B       4      11         6                2                4
21            B       5      38         8                3                5
22            B       6      44         8                2                4
23            B       7      15         8                1                3
24            B       8      50         2                2                4
25            B       9      27         7                3                4
26            B      10      38        10                3                4
27            B      11      25         6                3                5
28            B      12      25         8                3                5
29            B      13      14         7                3                5
30            B      14      25         9                3                3
30            B      14      25         9                3                4
30            B      14      25         9                3                5
30            B      14      25         9                3                5
30            B      14      25         9                3                1
30            B      14      25         9                3                2
31            C       2      31         9                1                1
32            C       3      31         7                2                2
33            C       4      26         5                3                3
34            C       5      11         2                3                4
35            C       6      15         8                3                5
36            C       7      22         2                2                5
37            C       8      33         2                2                4
38            C       9      16         5                2                3
38            C       9      16         5                2                3
38            C       9      16         5                2                3
38            C       9      16         5                2                1
38            C       9      16         5                2                2
38            C       9      16         5                2                2

输出基于以下输入数据:

data= {
       'Account Name': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
       'DateID': [0, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 2, 3, 4, 5, 6, 7, 8, 9],
       'Value$': [10, 12, 38, 20, 25, 24, 27, 30, 47, 30, 17, 20, 33, 45, 19, 24, 41, 13, 17, 45, 11, 38, 44, 15, 50, 27, 38, 25, 25, 14, 25, 31, 31, 26, 11, 15, 22, 33, 16],
       'Category': [1, 2, 1, 3, 3, 4, 8, 5, 7, 4, 7, 8, 8, 9, 2, 10, 2, 9, 6, 7, 6, 8, 8, 8, 2, 7, 10, 6, 8, 7, 9, 9, 7, 5, 2, 8, 2, 2, 5]
}

df= pd.DataFrame(data)