查找索引

Question

目标：

我想使用python以有效的方式合并两个数据帧df1和df2。 df1具有形状（1,2）并且df2具有形状（p，13），其中l <1。 m＆lt;页。我的具有形状（m，13）的目标数据帧df3应该包含公差内的所有匹配，而不仅仅是最接近的匹配。

我想将df1的Col0与df2的Col2合并，并具有容差“容差”。

示例：

DF1：

Index, Col0, Col1
0, 1008.5155, n01

DF2：

Index, Col0, Col1, Col2, Col3, Col4, Col5, Col6, ...
0, 0, 0, 510.0103, k03, 0, k05, k06, ... 
1, 0, 0, 1007.6176, k13, 0, k15, k16, ...
2, 0, 0, 1008.6248, k123, 0, k25, k26, ...

DF3：

Index, Col0, Col1, Col2, Col3, Col4, Col5, Col6, ...
0, 1008.5155, 0.8979, 1007.6176, k03, n01, k05, k06, ...
1, 1008.5155, 0.1093, 1008.6248, k13, n01, k15, k16, ...

为了可视化，df3的col1给出了df1和df2各自值的差异。因此，它必须小于公差。

我目前的解决方案需要花费大量时间并且需要大量内存。

 # Create empty list to collect matches
df3_list = []
df3_array = np.asarray(df3_list)

# loops to find matches. Fills array with matches
df3_row = np.asarray([0.0, 0.0, 0.0, 0.0, 0.0, 0, 0, 0, 0, 0, 0, 0, 0])

for n in range(len(df1)):
    for k in range(len(df2)):
        if abs(df1.iloc[n,0]-df2.iloc[k,2]) < tolerance:
            df3_row[0] = df1.iloc[n,0]
            df3_row[1] = abs(df1.iloc[n,0]-df2.iloc[k,2])
            df3_row[2] = df2.iloc[k,2]
            df3_row[3] = df2.iloc[k,3]
            df3_row[4] = df1.iloc[n,1]
            df3_row[5] = df2.iloc[k,5]
                       .
                       .
                       .

            df3_array = np.append(df3_array, df3_row)

# convert list into dataframe
df3 = pd.DataFrame(df3_array.T.reshape(-1,13), columns = header)

我也试图用

一次性获得两个指数

[[n, k]  for n, k in zip(range(len(df1)), range(len(df2))) if abs(df1.iloc[n,0]-df2.iloc[k,2]) < tolerance]

但是，它只给我一个空数组，所以我做错了。

对于各个阵列，我也尝试使用

np.nonzero(np.isclose(df2_array[:, 2], df1_array[:,:,None], atol=tolerance))[-1]

然而，np.isclose + np.nonzero只获得了df2的索引，并且比我的循环密集型方法还要多。没有相应的df1指数，我有点迷茫。我认为最后一种方法是最有希望的，但我似乎无法合并数据集，因为这些值并不完全匹配，因为最接近的匹配并不总是正确的解决方案。任何想法如何克服这个问题？

Answer 1

您需要在部分中划分此问题

找到相应的关闭指数
在这些索引上加入DataFrame
做额外的计算

查找索引

使用np.isclose，这是一个非常简单的生成函数，它产生一个DataFrame，其中包含df1和df2的索引，它们对{{1}的每一行都很接近}}

df1

然后我们可以轻松地连接这些以使用包含不同索引的帮助器DataFrame。

def find_close(df1, df1_col, df2, df2_col, tolerance=1):
    for index, value in df1[df1_col].items():
        indices = df2.index[np.isclose(df2[df2_col].values, value, atol=tolerance)]
        s = pd.DataFrame(data={'idx1': index, 'idx2': indices.values})
        yield s

为了测试这个，我在df_idx = pd.concat(find_close(df1, 'Col0', df2, 'Col2'), ignore_index=True)

df1

df1_str = '''Index, Col0, Col1
0, 1008.5155, n01
1, 510, n03'''

加入DataFrames

使用idx1 idx2 0 0 1 1 0 2 2 1 0

pd.merge

df1_close = pd.merge(df_idx, df1, left_on='idx1', right_index=True).reindex(columns=df1.columns)
df2_close = pd.merge(df_idx, df2, left_on='idx2', right_index=True).reindex(columns=df2.columns)
df_merged = pd.merge(df1_close, df2_close, left_index=True, right_index=True)

进行额外的计算

您需要重命名几列，并在它们之间分配差异，但这应该是微不足道的

如何有效地合并两个数据帧与python中的容差

1 个答案:

查找索引

加入DataFrames

进行额外的计算