我必须将摘要/小计表与明细表匹配,同时在明细表中标识不匹配的项目。
由于详细信息记录与小计记录之间的时间差异显着(+/-),所以我使用pandas merge_asof()应用的方法虽然即使基于给定的timedelta值进行匹配,但仍然不够好,因此不会检查是否两个表之间的数量相等。 有没有一种编码方法,可以考虑小计表中的每个值,从明细表中计算小计,如果匹配,则移至小计表中的下一项,然后从明细表中的下一项开始小计。 感谢有人可以在这个问题上提供帮助。
import pandas as pd
import datetime as dt
subtotal = pd.DataFrame(data = {'Date':['21/09/2018 17:45:27','21/09/2018 19:10:24','21/09/2018 21:42:03'],
'Amount':[2000,3000,6000],
'Ref':[1,2,3]},columns=['Date', 'Amount', 'Ref'])
detail = pd.DataFrame(data = {'Date':['21/09/2018 17:37:05','21/09/2018 17:56:22','21/09/2018 17:56:53','21/09/2018 18:54:56','21/09/2018 19:12:56','21/09/2018 19:15:30 ','21/09/2018 21:35:59','21/09/2018 21:36:20','21/09/2018 21:43:32 '],
'Amount':[1000,500,500,1000,3000,12000,1000,2000,3000]},
columns=['Date', 'Amount'])
subtotal['Date'] = pd.to_datetime(subtotal['Date'])
detail['Date'] = pd.to_datetime(detail['Date'])
# Code i tried with pandas .merge_asof()
subtotal_sorted = subtotal.sort_values(by='Date')
detail_sorted = detail.sort_values(by='Date')
subtotal_sorted.index = subtotal_sorted['Date']
detail_sorted.index = detail_sorted['Date']
tol = pd.Timedelta('15 minute')
result = pd.merge_asof(left=detail_sorted,right=subtotal_sorted, right_index=True,left_index=True,direction='nearest',tolerance=tol)
“我期望结果表与此相似。”但是,仅使用熊猫mergeasof()不能匹配小计值。所以我不得不考虑另一种方法。
Ref DateTime Value Result Ref_1 DateTime_1 Value_1
1 09/21/2018 17:37 1,000.00 Index1 1 09/21/2018 17:45 2000
2 09/21/2018 17:56 500.00 Index1 1 09/21/2018 17:45 2000
3 09/21/2018 17:56 500.00 Index1 1 09/21/2018 17:45 2000
4 09/21/2018 18:54 1,000.00 Index2 2 09/21/2018 19:10 3000
5 09/21/2018 19:12 2,000.00 Index2 2 09/21/2018 19:10 3000
6 09/21/2018 19:15 12,000.00 No Match
7 09/21/2018 21:35 1,000.00 Index3 3 09/21/2018 21:42 6000
8 09/21/2018 21:36 2,000.00 Index3 3 09/21/2018 21:42 6000
9 09/21/2018 21:43 3,000.00 Index3 3 09/21/2018 21:42 6000"
答案 0 :(得分:1)
我怀疑这是要在所有内容上实现精确匹配都不容易的问题之一。无论如何,我都为此开枪。
首先让我们定义一个执行合并的函数。这几乎与您已经做的一样,只是对所有匹配的小计加了Amount_detail
的总和,仅保持总和匹配的行。
def merge(subtotal, detail, tol):
subtotal.sort_values(by='Date', inplace=True)
detail.sort_values(by='Date', inplace=True)
# We merge using merge_asof as before
result = pd.merge_asof(left=detail,right=subtotal, on='Date',
direction='nearest',tolerance=tol)
# We total amount_detail over the matching ref
result['sum_amount_detail'] = result.groupby(['Ref'])['Amount_detail'].transform('sum')
# If sum_amount_detail == Amount_subtotal we have a match!!
match = result[result['sum_amount_detail'] == result['Amount_subtotal']]
# Otherwise... no
no_match = result[result['sum_amount_detail'] != result['Amount_subtotal']]
detail_match = match[['Date', 'Amount_detail', 'Ref']].copy()
detail_no_match = no_match[['Date', 'Amount_detail']].copy()
subtotal_match = subtotal[subtotal['Ref'].isin(detail_match['Ref'].unique())].copy()
subtotal_no_match = subtotal[~subtotal['Ref'].isin(detail_match['Ref'].unique())].copy()
return detail_match, subtotal_match, detail_no_match, subtotal_no_match
现在将此功能与您的原始标准配合使用(容忍15分钟)
import pandas as pd
subtotal = pd.DataFrame(data = {'Date':['21/09/2018 17:45:27','21/09/2018 19:10:24','21/09/2018 21:42:03'],
'Amount_subtotal':[2000,3000,6000],
'Ref':[1,2,3]},columns=['Date', 'Amount_subtotal', 'Ref'])
detail = pd.DataFrame(data = {'Date':['21/09/2018 17:37:05','21/09/2018 17:56:22','21/09/2018 17:56:53','21/09/2018 18:54:56','21/09/2018 19:12:56','21/09/2018 19:15:30 ','21/09/2018 21:35:59','21/09/2018 21:36:20','21/09/2018 21:43:32 '],
'Amount_detail':[1000,500,500,1000,3000,12000,1000,2000,3000]},
columns=['Date', 'Amount_detail'])
subtotal['Date'] = pd.to_datetime(subtotal['Date'])
detail['Date'] = pd.to_datetime(detail['Date'])
tol = pd.Timedelta('15 minute')
detail_match, subtotal_match, detail_no_match, subtotal_no_match = merge(subtotal, detail, tol)
这很好,除了有一个明显的缺陷,就是不包括发生在2018-09-21 19:10:24
(3000)上的小计。这是因为它也用另一个值进行了数学运算,因此总超调量小计。
一种解决方法是在循环中进行合并,在此循环中我们不断提高容忍度……这样,我们首先获得最接近的匹配项……然后匹配项就变得更远了。它不漂亮,但是可以用。
tolerances = [pd.Timedelta('5 minute'), pd.Timedelta('10 minute'), pd.Timedelta('15 minute')]
subtotal_no_match = subtotal.copy()
detail_no_match = detail.copy()
detail_list = []
subtotal_list = []
for tol in tolerances:
detail_match, subtotal_match, detail_no_match, subtotal_no_match = merge(subtotal_no_match, detail_no_match, tol)
if len(detail_match) > 0:
detail_list.append(detail_match)
subtotal_list.append(subtotal_match)
if len(subtotal_no_match)==0:
# We have matched everything in subtotal
break
detail_final = pd.concat(detail_list)
subtotal_final = pd.concat(subtotal_list)
detail_final
Out[5]:
Date Amount_detail Ref
4 2018-09-21 19:12:56 3000 2.0
5 2018-09-21 21:35:59 1000 3.0
6 2018-09-21 21:36:20 2000 3.0
7 2018-09-21 21:43:32 3000 3.0
0 2018-09-21 17:37:05 1000 1.0
1 2018-09-21 17:56:22 500 1.0
2 2018-09-21 17:56:53 500 1.0
subtotal_final
Out[6]:
Date Amount_subtotal Ref
1 2018-09-21 19:10:24 3000 2
2 2018-09-21 21:42:03 6000 3
0 2018-09-21 17:45:27 2000 1