Question

我在熊猫中有一个巨大的数据框，格式如下：

period  from_       to_        value
2020-07 Jonny       Karl       15.00
2020-08 Matt        Jonny      5.00
2020-08 Matt        Karl       5.00
2020-08 Matt        Karl       10.00
2020-08 Jonny       Matt       10.00

在我有一个值的情况下，一个人需要在一年中支付给另一个人。这些人的名字在数据集中重复出现。

因此，我想查看一年中每个人对另一个人的欠款。为此，我可以简单地使用：

sum_df = df.groupby([ "period", "from_", "to_"]).agg({"value": 'sum'})

但这是我的问题所在，因为我试图找出一种执行此汇总功能的有效方式，以使其可以“识别”：如果person A欠person B 5美元。并且person B欠person A 10美元。它应将person B欠期间$ 5归还person A。结果为以下数据框：

period  from_       to_        value
2020-07 Jonny       Karl       15.00
2020-08 Matt        Karl       15.00
2020-08 Jonny       Matt       5.00

有人可以给我一个方向，我可以遵循该方向吗？

Answer 1

让我在此处发布一个解决方案供您探索。我待会再添加说明。

pairs = df[['from_','to_']]
sorted_pairs = np.sort(df[['from_','to_']].values, axis=1)

(df['value'].mul(np.where((pairs==sorted_pairs).all(1), 1, -1))
     .groupby([df['period'],sorted_pairs[:,0], sorted_pairs[:,1]])
     .sum()
     .reset_index(name='value')
)

输出：

    period level_1 level_2  value
0  2020-07   Jonny    Karl   15.0
1  2020-08   Jonny    Matt    5.0
2  2020-08    Karl    Matt  -15.0

Answer 2

我的建议非常棘手。首先merge本身来自分组依据，但比较from_的列to_和to_的列from_。从生成的value和value_y列中减去值，并将其保存在变量中。使用此变量，可以使用update

更新原始DF中的列

df1 = df.groupby(['period','from_','to_'])['value'].sum().reset_index()

temp = df1.reset_index().merge(df1, 
                               left_on=['period', 'from_', 'to_'], 
                               right_on=['period', 'to_', 'from_'], 
                               suffixes=['', '_y'])

temp['value'] = temp['value'] - temp['value_y']
temp = temp[['index','period', 'from_', 'to_', 'value']]

temp.set_index('index', inplace=True)
df1.update(temp)

df1.head()
    period  from_   to_     value
0   2020-07 Jonny   Karl    15.0
1   2020-08 Jonny   Matt    5.0
2   2020-08 Matt    Jonny   -5.0
3   2020-08 Matt    Karl    15.0

在这里您可以决定如何处理不欠任何人的人的数据。如果它们已从DF中删除，或将列value设置为零

#remove rows where value is equal to or less than zero
df1.loc[df1['value'] > 0]
#output:
    period  from_   to_     value
0   2020-07 Jonny   Karl    15.0
1   2020-08 Jonny   Matt    5.0
3   2020-08 Matt    Karl    15.0

#setting the value column to zero where it is negative
df1.loc[df1['value'] < 0, 'value'] = 0
#output:
    period  from_   to_     value
0   2020-07 Jonny   Karl    15.0
1   2020-08 Jonny   Matt    5.0
2   2020-08 Matt    Jonny   0.0
3   2020-08 Matt    Karl    15.0

熊猫在grouby函数中寻找对

2 个答案: