我正在使用iterrows遍历数据帧并将第n行与第n + 1行进行比较。 算法如下:
if columns 0,1,2 of row_n != columns 0,1,2 of row_n+1
output row = row_n
then check row_n+1 vs row_n+2...
if columns 0,1,2 of row_n == columns 0,1,2 of row_n+1
output row columns 0,1,2,3 = row_n columns 0,1,2,3
output row column 4 = (row_n column 4 + row_n+1 column 4)
then "skip one row" and check row_n+2 vs row_n+3...
我当前的代码可用于第一次比较,但随后会停止。我的猜测是,当我试图“跳过一行”时,问题正在发生。我正在尝试使用index = index + 1,但输出看起来不符合预期。我该如何解决?
row_iterator = TSG_table_sorted.iterrows()
_, row_n1 = row_iterator.__next__()
for index, row_n0 in row_iterator:
Terminal_ID_n0 = row_n0['Terminal_ID'];
TSG_n0 = row_n0['TSG'];
Date_n0 = row_n0['Date'];
Vol_n0 = row_n0['Vol'];
Terminal_no_n0 = row_n0['Terminal_no'];
Terminal_ID_n1 = row_n1['Terminal_ID'];
TSG_n1 = row_n1['TSG'];
Date_n1 = row_n1['Date'];
Vol_n1 = row_n1['Vol'];
if (Terminal_ID_n0==Terminal_ID_n1 and TSG_n0==TSG_n1 and Date_n0==Date_n1):
new_vol=Vol_n0+Vol_n1;
output_table.loc[i]=[Terminal_ID_n0,TSG_n0,Date_n0,Terminal_no_n0,new_vol]
i=i+1;
else:
output_table.loc[i]=[Terminal_ID_n0,TSG_n0,Date_n0,Terminal_no_n0,Vol_n0]
i=i+1;
index=index+1;
input
Terminal_ID TSG Date Terminal_no Vol
508 t_tel_003 CashCheck 10/1/2018 003 61
9605 t_tel_003 CashCheck 10/1/2018 003 3
2309 t_tel_003 CommercialDeposit 10/1/2018 003 12
4439 t_tel_003 CommercialDeposit 10/1/2018 003 10
9513 t_tel_003 CommercialDeposit 10/1/2018 003 122
12282 t_tel_003 CommercialDeposit 10/1/2018 003 1
current output
Terminal_ID TSG Date Terminal_no Vol
0 t_tel_003 CashCheck 10/1/2018 003 64
1 t_tel_003 CommercialDeposit 10/1/2018 003 12
2 t_tel_003 CommercialDeposit 10/1/2018 003 10
3 t_tel_003 CommercialDeposit 10/1/2018 003 122
4 t_tel_003 CommercialDeposit 10/1/2018 003 1
expected output
Terminal_ID TSG Date Terminal_no Vol
0 t_tel_003 CashCheck 10/1/2018 003 64
1 t_tel_003 CommercialDeposit 10/1/2018 003 22
3 t_tel_003 CommercialDeposit 10/1/2018 003 123
答案 0 :(得分:0)
假设您的数据帧看起来像(我在底部添加了2行,因为您的示例没有任何内容可以复制代码的else部分):
Terminal_ID TSG Date Terminal_no Vol
0 t_tel_003 CashCheck 2018-01-10 3 61
1 t_tel_003 CashCheck 2018-01-10 3 3
2 t_tel_003 CommercialDeposit 2018-01-10 3 12
3 t_tel_003 CommercialDeposit 2018-01-10 3 10
4 t_tel_003 CommercialDeposit 2018-01-10 3 122
5 t_tel_003 CommercialDeposit 2018-01-10 3 1
6 t_tel_004 CommercialDeposit 2018-01-10 3 1
7 t_tel_003 CommercialDeposit 2018-01-10 4 1
如您所见,最后两行完全不同,并且考虑到所有4列都没有匹配项(因此,输出应按原样具有这两行):
使用以下内容:
df_dup = df.groupby([df.index//2,'Terminal_ID','TSG','Date','Terminal_no'])[df.columns].apply(lambda x : x[x[x.columns[:-1]].duplicated(keep=False)]['Vol'].sum()).reset_index().rename(columns={0:'Vol'}).drop('level_0',axis=1).replace(0,np.nan).dropna()
df_uniq =df[~df[df.columns[:-1]].duplicated(keep=False)]
pd.concat([df_dup,df_uniq],ignore_index=True)
输出
Terminal_ID TSG Date Terminal_no Vol
0 t_tel_003 CashCheck 2018-01-10 3 64.0
1 t_tel_003 CommercialDeposit 2018-01-10 3 22.0
2 t_tel_003 CommercialDeposit 2018-01-10 3 123.0
3 t_tel_004 CommercialDeposit 2018-01-10 3 1.0
4 t_tel_003 CommercialDeposit 2018-01-10 4 1.0
说明 df_dup
使用groupby下的df.index//2
每2行分组一次,然后对每个组应用函数,以检查是否每个组(此处为2行)排除最后一列Vol
)是相同的,然后在Vol
列上求和。
df_uniq
:过滤完全唯一的值。
最后同时合并两者,以获得所需的输出。
希望这会有所帮助。让我知道是否。