Question

我有一份发给客户的发票清单。但是，有时会发送不良发票，稍后会取消。我的Pandas Dataframe看起来像这样，除了更大（约300万行）

index | customer | invoice_nr | amount | date
---------------------------------------------------
0     | 1        | 1          | 10     | 01-01-2016
1     | 1        | 1          | -10    | 01-01-2016
2     | 1        | 1          | 11     | 01-01-2016
3     | 1        | 2          | 10     | 02-01-2016
4     | 2        | 3          | 7      | 01-01-2016
5     | 2        | 4          | 12     | 02-01-2016
6     | 2        | 4          | 8      | 02-01-2016
7     | 2        | 4          | -12    | 02-01-2016
8     | 2        | 4          | 4      | 02-01-2016
...   | ...      | ...        | ...    | ...
...   | ...      | ...        | ...    | ...

现在，我想删除customer，invoice_nr和date相同的所有行，但amount具有相反的值。
发票的更正始终在同一天使用相同的发票编号进行。发票编号唯一地绑定到客户，并始终对应于一个事务（可以包含多个组件，例如customer = 2，invoice_nr = 4）。只有更改amount收费或在较小的组件中拆分amount，才能更正发票。因此，取消的值不会在同一invoice_nr上重复。

非常感谢任何有关如何编程的帮助。

Answer 1

def remove_cancelled_transactions(df):
    trans_neg = df.amount < 0
    return df.loc[~(trans_neg | trans_neg.shift(-1))]

groups = [df.customer, df.invoice_nr, df.date, df.amount.abs()]
df.groupby(groups, as_index=False, group_keys=False) \
  .apply(remove_cancelled_transactions)

Answer 2

您可以使用filter所有值，其中每个值的值为TypeError: Cannot read property 'toLowerCase' of undefined at r.$validators.mustMatch，0的模数为2：

通过评论编辑：

如果实际数据中的一张发票和一个客户以及一个日期不重复，那么您可以这样使用：

print (df.groupby([df.customer, df.invoice_nr, df.date, df.amount.abs()])
        .filter(lambda x: (len(x.amount.abs()) % 2 == 0 ) and (x.amount.sum() == 0)))

       customer  invoice_nr  amount        date
index                                          
0             1           1      10  01-01-2016
1             1           1     -10  01-01-2016
5             2           4      12  02-01-2016
6             2           4     -12  02-01-2016

idx = df.groupby([df.customer, df.invoice_nr, df.date, df.amount.abs()])
        .filter(lambda x: (len(x.amount.abs()) % 2 == 0 ) and (x.amount.sum() == 0)).index

print (idx)      
Int64Index([0, 1, 5, 6], dtype='int64', name='index')

print (df.drop(idx))  
       customer  invoice_nr  amount        date
index                                          
2             1           1      11  01-01-2016
3             1           2      10  02-01-2016
4             2           3       7  01-01-2016
7             2           4       8  02-01-2016
8             2           4       4  02-01-2016

Answer 3

如果您只对所有3个字段进行分组怎么办？由此产生的金额将扣除任何已取消的发票：

df2 = df.groupby(['customer','invoice_nr','date']).sum()

结果

customer invoice_nr date
1        1          2016/01/01      11
         2          2016/02/01      10
2        3          2016/01/01       7

从Pandas Dataframe中删除取消行

3 个答案: