Question

我有一个数据框，其中显示了对某项物品收取的每笔费用的金额。数据框大约有一千万行。我想创建一个新的数据框，如果该值不等于零，则该数据框是每一列中每个项目的计数。

基本上，我正在尝试创建收费频率，以查看是否可以检测到有助于更好预测的模式。

      Item   Fee1    Fee2    Fee3  Fee4  Fee5   Fee6  Fee7  Fee8  Fee9  Fee10  
0    10520      0     -25    -500     0   -50    -67   -99     0   -10     -5   
1    11111    -25       0     -55    -5   -20    -15  -201   -15   -50    -15   
2    85558   -100      -2       0   -35     0      0     0     0     0      0   
3    99999      0       0       0     0     0      0     0     0     0      0   
4    10000   -105       0       0     0     0     -4   -41     0     0      0   
5    66666      0       0       0     0     0      0     0     0     0      0   
6    88888     -5      -5      -4    -5    -3     -5     0    -1    -2      0   
7   125651     -1       0       0     0     0      0     0     0     0      0   
8   678923      0       0       0     0     0   -564     0     0     0      0   
9    10520     -1     -20   -2105     0     0      0     0     0     0      0   
10   11111      0      -5       0    -3     0    -15     0  -516  -351   -684   
11   85558   -151    -561       0  -516  -561 -31554 -5646 -5468 -3546   -684   
12   99999      0       0       0     0     0      0     0     0     0      0   
13   10000      0   -9681    -651  -654  -651   -651  -561  -561  -651   -561   
14   66666      0       0       0     0     0      0  -644   -65   -65    -65   
15   88888 -11651 -651615    -684     0     0      0     0     0     0      0   
16  125651 -84941  -68481 -685464 -6846   -84   -684   -11   -51     0   -888   
17  678923      0       0       0     0     0      0     0     0     0      0   

    Fee11  Fee12  Fee13  
0     -67      0      0  
1     -50      0      0  
2       0      0      0  
3       0      0   -900  
4       0      0      0  
5       0      0      0  
6      -8     -3  -7777  
7       0      0  -8888  
8       0 -85161      0  
9       0      0      0  
10   -654    -64      0  
11   -654   -654   -654  
12      0      0    -22  
13   -561   -561   -651  
14    -65    -65      0  
15      0      0      0  
16 -87984   -894      0  
17      0      0      0

我正在寻找类似以下数据框的结果。

     Item  Fee1  Fee2  Fee3  Fee4  Fee5  Fee6  Fee7  Fee8  Fee9  Fee10  Fee11  
0   10520     1     2     2     0     1     1     1     0     1      1      1   
1   11111     1     1     1     2     1     2     1     2     2      2      2   
2   85558     2     2     0     2     1     1     1     1     1      1      1   
3   99999     0     0     0     0     0     0     0     0     0      0      0   
4   10000     1     1     1     1     1     2     2     1     1      1      1   
5   66666     0     0     0     0     0     0     1     1     1      1      1   
6   88888     2     2     2     1     1     1     0     1     1      0      1   
7  125651     2     1     1     1     1     1     1     1     0      1      1   
8  678923     0     0     0     0     0     1     0     0     0      0      0   

   Fee12  Fee13  
0      0      0  
1      1      0  
2      1      1  
3      0      2  
4      1      1  
5      1      0  
6      1      1  
7      1      1  
8      1      0

我已经尝试了下面的代码，但没有完成。我让它运行了一个小时，然后杀死了脚本。

dfcounted = df.groupby('Item')['Fee1', 'Fee2', 'Fee3', 'Fee4', 'Fee5', 'Fee6', 'Fee7', 'Fee8', 'Fee9', 
               'Fee10', 'Fee11', 'Fee12', 'Fee13'].agg({'Fee1': lambda x: (x<0).count(), 
               'Fee2': lambda x: (x<0).count(), 'Fee3': lambda x: (x<0).count(), 
               'Fee4': lambda x: (x<0).count(), 'Fee5': lambda x: (x<0).count(), 
               'Fee6': lambda x: (x<0).count(), 'Fee7': lambda x: (x<0).count(), 
               'Fee8': lambda x: (x<0).count(), 'Fee9': lambda x: (x<0).count(), 
               'Fee10': lambda x: (x<0).count(), 'Fee11': lambda x: (x<0).count(),
               'Fee12': lambda x: (x<0).count(), 'Fee13': lambda x: (x<0).count()})

但是，使用此示例数据，它返回了以下数据框。我也尝试将计数转换为总和，并且收到了一个全零的数据框。

        Fee1  Fee2  Fee3  Fee4  Fee5  Fee6  Fee7  Fee8  Fee9  Fee10  Fee11  
Item                                                                         
10000      2     2     2     2     2     2     2     2     2      2      2   
10520      2     2     2     2     2     2     2     2     2      2      2   
11111      2     2     2     2     2     2     2     2     2      2      2   
66666      2     2     2     2     2     2     2     2     2      2      2   
85558      2     2     2     2     2     2     2     2     2      2      2   
88888      2     2     2     2     2     2     2     2     2      2      2   
99999      2     2     2     2     2     2     2     2     2      2      2   
125651     2     2     2     2     2     2     2     2     2      2      2   
678923     2     2     2     2     2     2     2     2     2      2      2   

        Fee12  Fee13  
Item                  
10000       2      2  
10520       2      2  
11111       2      2  
66666       2      2  
85558       2      2  
88888       2      2  
99999       2      2  
125651      2      2  
678923      2      2

我是Pandas的新手，希望获得一些帮助。随着年份的增加，该文件的大小会随着每个月的增加而增加。

我不确定还有什么尝试的方法，因为我需要频繁收费以帮助找到一种模式。

谢谢。

Answer 1

您可以通过以下方式简化dict：

groupby

编辑

为了进行优化，我们可以将DF转换为布尔值（如果value小于零，则为True），然后应用df.groupby('Item').apply(lambda x: (x < 0).sum()).drop('Item', 1) output: Fee1 Fee2 Fee3 Fee4 Fee5 Fee6 Fee7 Fee8 Fee9 Fee10 Item 10000 1 1 1 1 1 2 2 1 1 1 10520 1 2 2 0 1 1 1 0 1 1 11111 1 1 1 2 1 2 1 2 2 2 66666 0 0 0 0 0 0 1 1 1 1 85558 2 2 0 2 1 1 1 1 1 1 88888 2 2 2 1 1 1 0 1 1 0 99999 0 0 0 0 0 0 0 0 0 0 125651 2 1 1 1 1 1 1 1 0 1 678923 0 0 0 0 0 1 0 0 0 0

groupby

使用timeit进行性能测试

coluns = [colum for colum in df.columns if 'Fee' in colum]
df[coluns] = df[coluns].lt(0)
df.groupby('Item').sum()

每个循环5.91 ms±115 µs（平均±标准偏差，共运行7次，每个循环100个循环）

使用新方法：

%timeit df.groupby('Item').apply(lambda x: (x < 0).sum())

每个循环1.25 ms±53.1 µs（平均±标准偏差，共运行7次，每个循环1000次）

创建一个熊猫DataFrame的条件groupby（）。count，其中单元格<> 0以创建频率

1 个答案:

编辑

使用timeit进行性能测试