如果满足条件,熊猫迭代不能在迭代过程中跳过行

时间:2019-01-16 20:09:49

标签: python pandas loops

我正在使用iterrows遍历数据帧并将第n行与第n + 1行进行比较。 算法如下:

if columns 0,1,2 of row_n != columns 0,1,2 of row_n+1
output row = row_n 
then check row_n+1 vs row_n+2...

if columns 0,1,2 of row_n == columns 0,1,2 of row_n+1
output row columns 0,1,2,3 = row_n columns 0,1,2,3
output row column 4 = (row_n column 4 + row_n+1 column 4)
then "skip one row" and check row_n+2 vs row_n+3...

我当前的代码可用于第一次比较,但随后会停止。我的猜测是,当我试图“跳过一行”时,问题正在发生。我正在尝试使用index = index + 1,但输出看起来不符合预期。我该如何解决?

    row_iterator = TSG_table_sorted.iterrows()
    _, row_n1 = row_iterator.__next__()

    for index, row_n0 in row_iterator:
        Terminal_ID_n0 = row_n0['Terminal_ID'];
        TSG_n0 = row_n0['TSG'];
        Date_n0 = row_n0['Date'];
        Vol_n0 = row_n0['Vol'];     

        Terminal_no_n0 = row_n0['Terminal_no'];

        Terminal_ID_n1 = row_n1['Terminal_ID'];
        TSG_n1 = row_n1['TSG'];
        Date_n1 = row_n1['Date'];
        Vol_n1 = row_n1['Vol'];        

        if (Terminal_ID_n0==Terminal_ID_n1 and TSG_n0==TSG_n1 and Date_n0==Date_n1):
            new_vol=Vol_n0+Vol_n1;
            output_table.loc[i]=[Terminal_ID_n0,TSG_n0,Date_n0,Terminal_no_n0,new_vol]
            i=i+1;
        else:
            output_table.loc[i]=[Terminal_ID_n0,TSG_n0,Date_n0,Terminal_no_n0,Vol_n0]    
            i=i+1;
            index=index+1;



    input
          Terminal_ID                TSG        Date Terminal_no  Vol
    508     t_tel_003          CashCheck   10/1/2018         003   61
    9605    t_tel_003          CashCheck   10/1/2018         003    3
    2309    t_tel_003  CommercialDeposit   10/1/2018         003   12
    4439    t_tel_003  CommercialDeposit   10/1/2018         003   10
    9513    t_tel_003  CommercialDeposit   10/1/2018         003  122
    12282   t_tel_003  CommercialDeposit   10/1/2018         003    1

    current output
          Terminal_ID                TSG        Date Terminal_no  Vol
    0       t_tel_003          CashCheck   10/1/2018         003   64
    1       t_tel_003  CommercialDeposit   10/1/2018         003   12
    2       t_tel_003  CommercialDeposit   10/1/2018         003   10
    3       t_tel_003  CommercialDeposit   10/1/2018         003  122
    4       t_tel_003  CommercialDeposit   10/1/2018         003    1

    expected output
          Terminal_ID                TSG        Date Terminal_no  Vol
    0       t_tel_003          CashCheck   10/1/2018         003   64
    1       t_tel_003  CommercialDeposit   10/1/2018         003   22
    3       t_tel_003  CommercialDeposit   10/1/2018         003  123

1 个答案:

答案 0 :(得分:0)

假设您的数据帧看起来像(我在底部添加了2行,因为您的示例没有任何内容可以复制代码的else部分):

    Terminal_ID TSG                 Date       Terminal_no  Vol
0   t_tel_003   CashCheck           2018-01-10  3           61
1   t_tel_003   CashCheck           2018-01-10  3           3
2   t_tel_003   CommercialDeposit   2018-01-10  3           12
3   t_tel_003   CommercialDeposit   2018-01-10  3           10
4   t_tel_003   CommercialDeposit   2018-01-10  3           122
5   t_tel_003   CommercialDeposit   2018-01-10  3           1
6   t_tel_004   CommercialDeposit   2018-01-10  3           1
7   t_tel_003   CommercialDeposit   2018-01-10  4           1

如您所见,最后两行完全不同,并且考虑到所有4列都没有匹配项(因此,输出应按原样具有这两行):

使用以下内容:

df_dup = df.groupby([df.index//2,'Terminal_ID','TSG','Date','Terminal_no'])[df.columns].apply(lambda x : x[x[x.columns[:-1]].duplicated(keep=False)]['Vol'].sum()).reset_index().rename(columns={0:'Vol'}).drop('level_0',axis=1).replace(0,np.nan).dropna()
df_uniq =df[~df[df.columns[:-1]].duplicated(keep=False)]

pd.concat([df_dup,df_uniq],ignore_index=True)

输出

    Terminal_ID TSG                 Date       Terminal_no  Vol
0   t_tel_003   CashCheck           2018-01-10  3           64.0
1   t_tel_003   CommercialDeposit   2018-01-10  3           22.0
2   t_tel_003   CommercialDeposit   2018-01-10  3           123.0
3   t_tel_004   CommercialDeposit   2018-01-10  3           1.0
4   t_tel_003   CommercialDeposit   2018-01-10  4           1.0

说明 df_dup 使用groupby下的df.index//2每2行分组一次,然后对每个组应用函数,以检查是否每个组(此处为2行)排除最后一列Vol)是相同的,然后在Vol列上求和。

df_uniq :过滤完全唯一的值。 最后同时合并两者,以获得所需的输出。

希望这会有所帮助。让我知道是否。