我有以下DataFrame:
df1 = pd.DataFrame(columns=["DATE","QTY1", "QTY2", "ID", "CODE"])
df1["DATE"] = ["2000-05-01", "2000-05-03", "2001-01-15", "2001-02-20", "2001-02-22"]
df1["QTY1"] = [10, 11, 12,5,4]
df1["QTY2"] = [100, 101, 102,15,14]
df1["ID"] = [1,2,3,4,5]
df1["CODE"] = ["A", "B", "C", "D", "E"]
df2 = pd.DataFrame(columns=["DATE","QTY1", "QTY2", "ID", "CODE"])
df2["DATE"] = ["2000-05-02", "2000-05-04", "2001-01-12", "2001-03-28", "2001-08-21", "2005-07-01"]
df2["QTY1"] = [9, 101, 11,5.1,100, 10]
df2["QTY2"] = [99, 12, 1000,6,3, 1]
df2["ID"] = [1,2,3,8,5, 9]
df2["CODE"] = ["F", "G", "H", "I", "L", "M"]
df1:
DATE QTY1 QTY2 ID CODE
0 2000-05-01 10 100 1 A
1 2000-05-03 11 101 2 B
2 2001-01-15 12 102 3 C
3 2001-02-20 5 15 4 D
4 2001-02-22 4 14 5 E
df2
DATE QTY1 QTY2 ID CODE
0 2000-05-02 9.0 99 1 F
1 2000-05-04 101.0 12 2 G
2 2001-01-12 11.0 1000 3 H
3 2001-03-28 5.1 6 8 I
4 2001-08-21 100.0 3 5 L
5 2005-07-01 10 1 9 M
我的目标是为每一行创建一个具有一定容差的签名,并匹配两个DF中属于这种间隔的行。 每行的签名结构如下:
因此,例如,上述DF的匹配结果将返回按签名分组的以下行(每个DF的第一行):
Signature1 2000-05-01 10 100 1 A
2000-05-02 9.0 99 1 F
所有其他行不符合一个或多个容差。
目前我正在使用经典的for循环使用iterrows()并检查所有字段,但对于大型DF,性能非常差。 我想知道是否有更多类似熊猫的方法可以帮助我加快速度。 感谢