Question

我正在使用iterrows遍历数据帧并将第n行与第n + 1行进行比较。算法如下：

if columns 0,1,2 of row_n != columns 0,1,2 of row_n+1
output row = row_n 
then check row_n+1 vs row_n+2...

if columns 0,1,2 of row_n == columns 0,1,2 of row_n+1
output row columns 0,1,2,3 = row_n columns 0,1,2,3
output row column 4 = (row_n column 4 + row_n+1 column 4)
then "skip one row" and check row_n+2 vs row_n+3...

我当前的代码可用于第一次比较，但随后会停止。我的猜测是，当我试图“跳过一行”时，问题正在发生。我正在尝试使用index = index + 1，但输出看起来不符合预期。我该如何解决？

    row_iterator = TSG_table_sorted.iterrows()
    _, row_n1 = row_iterator.__next__()

    for index, row_n0 in row_iterator:
        Terminal_ID_n0 = row_n0['Terminal_ID'];
        TSG_n0 = row_n0['TSG'];
        Date_n0 = row_n0['Date'];
        Vol_n0 = row_n0['Vol'];     

        Terminal_no_n0 = row_n0['Terminal_no'];

        Terminal_ID_n1 = row_n1['Terminal_ID'];
        TSG_n1 = row_n1['TSG'];
        Date_n1 = row_n1['Date'];
        Vol_n1 = row_n1['Vol'];        

        if (Terminal_ID_n0==Terminal_ID_n1 and TSG_n0==TSG_n1 and Date_n0==Date_n1):
            new_vol=Vol_n0+Vol_n1;
            output_table.loc[i]=[Terminal_ID_n0,TSG_n0,Date_n0,Terminal_no_n0,new_vol]
            i=i+1;
        else:
            output_table.loc[i]=[Terminal_ID_n0,TSG_n0,Date_n0,Terminal_no_n0,Vol_n0]    
            i=i+1;
            index=index+1;



    input
          Terminal_ID                TSG        Date Terminal_no  Vol
    508     t_tel_003          CashCheck   10/1/2018         003   61
    9605    t_tel_003          CashCheck   10/1/2018         003    3
    2309    t_tel_003  CommercialDeposit   10/1/2018         003   12
    4439    t_tel_003  CommercialDeposit   10/1/2018         003   10
    9513    t_tel_003  CommercialDeposit   10/1/2018         003  122
    12282   t_tel_003  CommercialDeposit   10/1/2018         003    1

    current output
          Terminal_ID                TSG        Date Terminal_no  Vol
    0       t_tel_003          CashCheck   10/1/2018         003   64
    1       t_tel_003  CommercialDeposit   10/1/2018         003   12
    2       t_tel_003  CommercialDeposit   10/1/2018         003   10
    3       t_tel_003  CommercialDeposit   10/1/2018         003  122
    4       t_tel_003  CommercialDeposit   10/1/2018         003    1

    expected output
          Terminal_ID                TSG        Date Terminal_no  Vol
    0       t_tel_003          CashCheck   10/1/2018         003   64
    1       t_tel_003  CommercialDeposit   10/1/2018         003   22
    3       t_tel_003  CommercialDeposit   10/1/2018         003  123

Answer 1

假设您的数据帧看起来像（我在底部添加了2行，因为您的示例没有任何内容可以复制代码的else部分）：

    Terminal_ID TSG                 Date       Terminal_no  Vol
0   t_tel_003   CashCheck           2018-01-10  3           61
1   t_tel_003   CashCheck           2018-01-10  3           3
2   t_tel_003   CommercialDeposit   2018-01-10  3           12
3   t_tel_003   CommercialDeposit   2018-01-10  3           10
4   t_tel_003   CommercialDeposit   2018-01-10  3           122
5   t_tel_003   CommercialDeposit   2018-01-10  3           1
6   t_tel_004   CommercialDeposit   2018-01-10  3           1
7   t_tel_003   CommercialDeposit   2018-01-10  4           1

如您所见，最后两行完全不同，并且考虑到所有4列都没有匹配项（因此，输出应按原样具有这两行）：

使用以下内容：

df_dup = df.groupby([df.index//2,'Terminal_ID','TSG','Date','Terminal_no'])[df.columns].apply(lambda x : x[x[x.columns[:-1]].duplicated(keep=False)]['Vol'].sum()).reset_index().rename(columns={0:'Vol'}).drop('level_0',axis=1).replace(0,np.nan).dropna()
df_uniq =df[~df[df.columns[:-1]].duplicated(keep=False)]

pd.concat([df_dup,df_uniq],ignore_index=True)

输出

    Terminal_ID TSG                 Date       Terminal_no  Vol
0   t_tel_003   CashCheck           2018-01-10  3           64.0
1   t_tel_003   CommercialDeposit   2018-01-10  3           22.0
2   t_tel_003   CommercialDeposit   2018-01-10  3           123.0
3   t_tel_004   CommercialDeposit   2018-01-10  3           1.0
4   t_tel_003   CommercialDeposit   2018-01-10  4           1.0

说明 df_dup 使用groupby下的df.index//2每2行分组一次，然后对每个组应用函数，以检查是否每个组（此处为2行）排除最后一列Vol）是相同的，然后在Vol列上求和。

df_uniq ：过滤完全唯一的值。最后同时合并两者，以获得所需的输出。

希望这会有所帮助。让我知道是否。

如果满足条件，熊猫迭代不能在迭代过程中跳过行

1 个答案: