我有一个大型的DataFrame(150,000 x 25)财务交易。此DataFrame代表一种金融控股帐户,因此交易通常会“通过”此分类帐。例如(下图),位置0的行显示-$ 123.21交易。位置2中的行是+($ 123.21)的对应(或“耦合”)事务,并且与类别,类型和源匹配。
我的目标是创建一个新列来标识“已耦合”事务的键。因此,第0行的“耦合键”是第2行的键,反之亦然。
请注意,位置9-14中的行排除了搜索最小和最大匹配项的解决方案(@David Erickson previously provided是沿这些行的绝佳答案)。位置9的行显示了一笔+ $ 10的交易。它与在位置11中找到的第一个-$ 10结合(而不是在位置14中找到的交易)。这样,每笔交易都与零笔交易或另一笔交易(但不超过一笔)耦合。
import pandas as pd
d_in = {'key' : ['80000001', '80000002', '80000003', '80000004', '80000005', '80000006', '80000007', '80000008', '80000009', '80000010', '80000011', '80000012', '80000013', '80000014', '80000015'],
'date' : ['20200901', '20200901', '20200902', '20200902', '20200902','20200903', '20200904', '20200905', '20200905', '20200906', '20200906', '20200906', '20200906', '20200906', '20200906'],
'category' : ['Z293', 'B993', 'Z293', 'B993', 'W884', 'C123', 'V332', 'C123', 'V332', 'Z213', 'Z213', 'Z213', 'Z213', 'Z213', 'Z213'],
'type' : ['tools', 'supplies', 'tools', 'supplies', 'repairs', 'custom', 'misc', 'custom', 'misc', 'technology', 'technology', 'technology', 'technology', 'technology', 'technology'],
'source' : ['Q112', 'E443', 'Q112', 'E443', 'P443', 'B334', 'E449', 'B334', 'E449', 'QQ32', 'QQ32', 'QQ32', 'QQ32', 'QQ32', 'QQ32'],
'amount' : [-123.21, 3.12, 123.21, -3.12, 9312.00, 312.23, -13.23, -312.23, 13.23, 10, 10, -10, -10, 10, -10]}
df_in = pd.DataFrame(data=d_in)
d_out = {'key' : ['80000001', '80000002', '80000003', '80000004', '80000005', '80000006', '80000007', '80000008', '80000009', '80000010', '80000011', '80000012', '80000013', '80000014', '80000015'],
'date' : ['20200901', '20200901', '20200902', '20200902', '20200902','20200903', '20200904', '20200905', '20200905', '20200906', '20200906', '20200906', '20200906', '20200906', '20200906'],
'category' : ['Z293', 'B993', 'Z293', 'B993', 'W884', 'C123', 'V332', 'C123', 'V332', 'Z213', 'Z213', 'Z213', 'Z213', 'Z213', 'Z213'],
'type' : ['tools', 'supplies', 'tools', 'supplies', 'repairs', 'custom', 'misc', 'custom', 'misc', 'technology', 'technology', 'technology', 'technology', 'technology', 'technology'],
'source' : ['Q112', 'E443', 'Q112', 'E443', 'P443', 'B334', 'E449', 'B334', 'E449', 'QQ32', 'QQ32', 'QQ32', 'QQ32', 'QQ32', 'QQ32'],
'amount' : [-123.21, 3.12, 123.21, -3.12, 9312.00, 312.23, -13.23, -312.23, 13.23, 10, 10, -10, -10, 10, -10],
'coupling_key' : ['80000003', '80000004', '80000001', '80000002', 'none', '80000008', '80000009', '80000006', '80000007', '80000012', '80000013', '80000010', '80000011', '80000015', '80000014']}
df_out = pd.DataFrame(data=d_out)
我研究过的大多数解决方案都涉及熊猫groupby函数。我目前正在考虑groupby(...)。nth(...)函数。我怀疑解决方案也可能涉及.mask或.duplicated()。
答案 0 :(得分:2)
您可以执行以下操作:
步骤1 :设置transform
功能:
def coupling(ser):
keys = ser.index
values = ser.values
couples = [None] * len(ser)
free = {*range(len(ser))}
while free:
i = min(free)
j = i + 1
while j < len(ser):
if (values[j] == -values[i]
and j in free):
couples[i], couples[j] = keys[j], keys[i]
free.remove(j)
break
j += 1
free.remove(i)
return couples
第2步:应用于组:
df_out = df_in.set_index('key')
group = ['category', 'type', 'source']
df_out['coupling_key'] = (df_out[group + ['amount']]
.groupby(group)
.transform(coupling))
df_out.reset_index(drop=False, inplace=True)
结果:
key date category type source amount coupling_key
0 80000001 20200901 Z293 tools Q112 -123.21 80000003
1 80000002 20200901 B993 supplies E443 3.12 80000004
2 80000003 20200902 Z293 tools Q112 123.21 80000001
3 80000004 20200902 B993 supplies E443 -3.12 80000002
4 80000005 20200902 W884 repairs P443 9312.00 None
5 80000006 20200903 C123 custom B334 312.23 80000008
6 80000007 20200904 V332 misc E449 -13.23 80000009
7 80000008 20200905 C123 custom B334 -312.23 80000006
8 80000009 20200905 V332 misc E449 13.23 80000007
9 80000010 20200906 Z213 technology QQ32 10.00 80000012
10 80000011 20200906 Z213 technology QQ32 10.00 80000013
11 80000012 20200906 Z213 technology QQ32 -10.00 80000010
12 80000013 20200906 Z213 technology QQ32 -10.00 80000011
13 80000014 20200906 Z213 technology QQ32 10.00 80000015
14 80000015 20200906 Z213 technology QQ32 -10.00 80000014
(我假设date
列的排列方式与示例中相同。)
答案 1 :(得分:1)
另一种解决方案,尝试保留“纯熊猫”功能(无论如何!)
要了解以下内容,请按以下步骤操作
cumcount()
)然后reversed(..)
转换为数据框,并加入到原始数据框中第5步可能可以更优雅地完成,但这可以实现
match = []
for _, df2 in df_in.groupby([df_in['category'], df_in['type'], df_in['source'], df_in['amount'].abs()], as_index=False):
group_match = df2.groupby(df2.groupby(['amount']).cumcount())['key'].apply(list)
match.extend(group_match)
match.extend([list(reversed(m)) for m in group_match])
match_df = pd.DataFrame(data = match, columns = ['key', 'coupling_key']).drop_duplicates()
df_out = df_in.merge(match_df, on='key')
生成所需的df_out:
key date category type source amount coupling_key
0 80000001 20200901 Z293 tools Q112 -123.21 80000003
1 80000002 20200901 B993 supplies E443 3.12 80000004
2 80000003 20200902 Z293 tools Q112 123.21 80000001
3 80000004 20200902 B993 supplies E443 -3.12 80000002
4 80000005 20200902 W884 repairs P443 9312.00 None
5 80000006 20200903 C123 custom B334 312.23 80000008
6 80000007 20200904 V332 misc E449 -13.23 80000009
7 80000008 20200905 C123 custom B334 -312.23 80000006
8 80000009 20200905 V332 misc E449 13.23 80000007
9 80000010 20200906 Z213 technology QQ32 10.00 80000012
10 80000011 20200906 Z213 technology QQ32 10.00 80000013
11 80000012 20200906 Z213 technology QQ32 -10.00 80000010
12 80000013 20200906 Z213 technology QQ32 -10.00 80000011
13 80000014 20200906 Z213 technology QQ32 10.00 80000015
14 80000015 20200906 Z213 technology QQ32 -10.00 80000014
如果amount
列中有零,并且应该按照以下注释进行匹配,我们可以按如下所示修改循环
for _, df2 in df_in.groupby([df_in['category'], df_in['type'], df_in['source'], df_in['amount'].abs()], as_index=False):
if (df2['amount'].iloc[0] == 0):
group_match = df2.groupby([i//2 for i in range(len(df2))])['key'].apply(list)
else:
group_match = df2.groupby(df2.groupby(['amount']).cumcount())['key'].apply(list)
match.extend(group_match)
match.extend([list(reversed(m)) for m in group_match])
以df_in
进行扩展(请注意最后三列0行:
d_in = {'key' : ['80000001', '80000002', '80000003', '80000004', '80000005', '80000006', '80000007', '80000008', '80000009', '80000010', '80000011', '80000012', '80000013', '80000014', '80000015', '1', '2', '3'],
'date' : ['20200901', '20200901', '20200902', '20200902', '20200902','20200903', '20200904', '20200905', '20200905', '20200906', '20200906', '20200906', '20200906', '20200906', '20200906', '20200906', '20200906', '20200906'],
'category' : ['Z293', 'B993', 'Z293', 'B993', 'W884', 'C123', 'V332', 'C123', 'V332', 'Z213', 'Z213', 'Z213', 'Z213', 'Z213', 'Z213', 'Z213', 'Z213', 'Z213'],
'type' : ['tools', 'supplies', 'tools', 'supplies', 'repairs', 'custom', 'misc', 'custom', 'misc', 'technology', 'technology', 'technology', 'technology', 'technology', 'technology','technology', 'technology', 'technology'],
'source' : ['Q112', 'E443', 'Q112', 'E443', 'P443', 'B334', 'E449', 'B334', 'E449', 'QQ32', 'QQ32', 'QQ32', 'QQ32', 'QQ32', 'QQ32', 'QQ32', 'QQ32', 'QQ32'],
'amount' : [-123.21, 3.12, 123.21, -3.12, 9312.00, 312.23, -13.23, -312.23, 13.23, 10, 10, -10, -10, 10, -10,0,0,0]}
我们得到(忽略与以前相同的行)
key date category type source amount coupling_key
15 1 20200906 Z213 technology QQ32 0.00 2
16 2 20200906 Z213 technology QQ32 0.00 1
17 3 20200906 Z213 technology QQ32 0.00 None