Question

我有两个用于分类问题的数据框。 df_x（数据，未完成的难题，未填充位置为零）和df_y（标签，完成的难题）。

数据帧有几十万行，因此效率很重要。

问题是我无法保证df_x的第i个索引对应于df_y的第i个索引。我想修复数据框，使它们的索引匹配。

我的执行效率很低，但是我不敢保留它。

x2y = [].
no_label = []
for i in df_x.index:
    a = df_x[i:i+1] #receives one line of df_x at a time.
    a = a.loc[:, (a != 0).any(axis=0)] #excludes the zeros (unfilled parts of the puzzle)
    match = True 
    for j in df_y.index: #loops over all lines of df_y
        for a_i in a:
            if (a[0:1][a_i].item() != df_y[j:j+1][a_i].item()):
                match = False #if one element is not present in the final solution, than it goes to the next line in df_y
                break
        if match:
            x2y.append((i,j)) 
            df_y[i:i+1] = df_y[j:j+1] #replace label at the position of interest
            break
    if not match:
        no_label.append(i) #unsolved puzzles with no label

这就是数据框的样子：

df_x.head()
Out[58]: 
    0    1    2      3    4      5   ...   75   76     77     78   79     80
0  0.0  0.0  0.0    0.0  0.0  168.0  ...  0.0  0.0  886.0    0.0  0.0  973.0
1  0.0  0.0  0.0    0.0  0.0  168.0  ...  0.0  0.0  886.0  899.0  0.0  973.0
2  0.0  0.0  0.0    0.0  0.0  168.0  ...  0.0  0.0  886.0  899.0  0.0  973.0
3  0.0  0.0  0.0    0.0  0.0  168.0  ...  0.0  0.0  886.0  899.0  0.0  973.0
4  0.0  0.0  0.0  149.0  0.0  168.0  ...  0.0  0.0  886.0  899.0  0.0  973.0

[5 rows x 81 columns]

df_y.head()
Out[59]: 
      0      1      2      3      4   ...     76     77     78     79     80
0  112.0  126.0  137.0  149.0  154.0  ...  956.0  961.0  973.0  982.0  997.0
1  112.0  126.0  137.0  149.0  154.0  ...  956.0  961.0  973.0  982.0  997.0
2  112.0  126.0  137.0  149.0  154.0  ...  956.0  961.0  973.0  982.0  997.0
3  112.0  126.0  137.0  149.0  154.0  ...  956.0  961.0  973.0  982.0  997.0
4  112.0  126.0  137.0  149.0  154.0  ...  956.0  961.0  973.0  982.0  997.0

[5 rows x 81 columns]

我从熊猫开始，所以请保持柔和！

编辑，其中一项评论要求提供一个示例，以说明匹配数据框的外观。因此，以下是一个手工制作的示例：

df_x.head()
Out[59]: 
      0      1      2      3      4   ...     76     77     78     79     80
0    0.0  126.0    0.0  149.0    0.0  ...    0.0    0.0    0.0    0.0  997.0
1  111.0    0.0    0.0    0.0  152.0  ...  953.0    0.0    0.0  984.0    0.0
2  112.0    0.0  137.0    0.0    0.0  ...    0.0  961.0    0.0    0.0  997.0
3    0.0  121.0    0.0    0.0    0.0  ...    0.0  962.0  973.0  984.0    0.0
4    0.0    0.0  133.0  144.0  155.0  ...  956.0    0.0  978.0    0.0    0.0

df_y.head()
Out[59]: 
      0      1      2      3      4   ...     76     77     78     79     80
0  112.0  126.0  137.0  149.0  154.0  ...  956.0  961.0  973.0  982.0  997.0
1  111.0  123.0  139.0  147.0  152.0  ...  955.0  968.0  973.0  984.0  991.0
2  112.0  126.0  137.0  149.0  154.0  ...  956.0  961.0  973.0  982.0  997.0
3  119.0  121.0  138.0  147.0  156.0  ...  959.0  962.0  973.0  984.0  995.0
4  116.0  127.0  133.0  144.0  155.0  ...  956.0  962.0  978.0  989.0  992.0

Answer 1

欢迎来到pandas！这是一个非常棘手的问题，因为看起来您想进行1e5 * 1e5比较，无论我们做什么都不会很快，所以让我们尝试并尽可能地限制它。首先，尽最大可能合理地期望匹配的索引接近。其次，这是一些代码，可以使您的匹配更加容易。

对于两个系列x_row和y_row：

> x_row = pd.Series([1, 2, 0, 4])
> y_row = pd.Series([1, 2, 3, 4])
> ((x_row == y_row) | (x_row == 0)).all()
True

最后一行是两次检查之间的按位或（|）：首先，如果每个值都与另一个系列（T, T, F, T）中的对应值匹配，或者x_row中的值为零（F F T F）。这两个布尔系列的按位或为T T T T，因此结果为.all()为True。

这里是在上下文中使用该示例的一个示例，并试图通过找到匹配的y_df仅从运行中取出一行x2y = [] unmatched_x = [] unmatched_y = df_y.index.tolist() for x_idx, x_row in df_x: match = False for y_idx in unmatched_y: if ((x_row == df_y.loc[y_idx]) | (x_row == 0)).all(): match = True break if match: unmatched_y.remove(y_idx) x2y.append(x_idx, y_idx) else: unmatched_x.append(x_idx)来限制比较次数。在理想情况下，此操作将只运行与行数相同的次数。

matches = ((df_x == df_y) | (df_x == 0)).all(axis=1)

如果您认为其中的最个匹配，则可以通过运行来进行分类

df_x

这可以做同样的事情，但是要在整个数据帧上一次。它将返回一系列布尔值，对应于df_y的每一行是否与df_x[matches]的相应行匹配。然后，您可以将那些没有的分类。
df_x[~matches]只是匹配的行，或者// Router router.get('/foo', myMiddleware, (req, res) => { ... }); // Router Error Handler router.use(function (err, req, res, next) { });是不匹配的行。

根据完整的数据帧对一个不完整的数据帧进行排序

1 个答案: