Question

我有一个大型数据框，其中包含来自不同设备的大量信息及其ID。我想用另一个数据框中的ID过滤此数据框。我知道我可以通过命令联接轻松地做到这一点，但是我想通过命令过滤器尝试一下。

我也在尝试它，因为我已经知道命令过滤器比联接更有效，有人可以对此有所了解吗？

谢谢

我已经尝试过了：

    public function onContentAfterSave($context, $article, $isNew)
    {
        if ($context == 'com_media.file') {
            // JFactory::getApplication()->enqueueMessage('Filepath: ' . $article->filepath);
            if ($article->type == 'image/jpeg' or $article->type == 'image/png') {
                $orig_image = new JImage($article->filepath);
                // 750 will be set accordind to image ratio depending on 1000
                $resized_image = $orig_image->resize(1000, 750, true, JImage::SCALE_INSIDE);
                $resized_image->toFile($article->filepath);
            }
        }
    }

但是出现以下错误：

val DfFiltered = DF1.filter(col("Id").isin(DF2.rdd.map(r => r(0)).collect())

Answer 1

我已经假设Id列中的数据是Integer数据类型。

val list = DF2.select("Id").as[Int] collect()

val DfFiltered = DF1.filter($"Id".isin(list: _*))

DfFiltered collect()

Answer 2

High Performance Spark一书中的解释是：

联接数据是我们许多管道的重要组成部分，Spark Core和SQL都支持相同的基本联接类型。尽管联接非常常见且功能强大，但是它们需要特别考虑性能，因为它们可能需要大型网络传输，甚至可能创建超出我们处理能力的数据集。1在核心Spark中，考虑操作顺序可能更为重要，因为DAG与SQL优化器不同，优化器无法重新排序或下推过滤器。

因此，选择过滤器而不是联接似乎是一个不错的选择

Answer 3

您只需在代码中添加（：_ *），就可以正常工作。

sizeof(char)

如何使用命令过滤器使用来自其他数据框的信息过滤数据框

3 个答案: