根据另一个数据帧scala过滤数据帧

时间:2018-01-16 22:35:43

标签: sql scala apache-spark dataframe apache-spark-sql

目前我在做:

class UrlBuilder {
    // ...
    _iterateAndBuild(current_val, index, array) {
      // ...
    }

    buildServiceArr() {
        DATA.forEach(this._iterateAndBuild, this);
    }
}

但这很慢。有人可以建议我有更好的方法来实现这一目标吗?基本上我是从不在val DF = sqlSession.sql("select itemIdDig as itemId, " + "title" + "timestamp as time " + "from itemTable ") val tempDF = sqlSession.sql("select itemIdDig as itemId " + "from itemTable " + "group by itemIdDig HAVING count(*) >= 10 ").rdd.map(r => r(0)).collect() //keep itemIds which are not in DF DF.filter(!col("itemId").isin(tempDF : _*)).toDF 的行中查找的(我尝试使用group,它给了我唯一的tempDF,但我想保留重复项)

1 个答案:

答案 0 :(得分:2)

半连接:

DF.join(tempDF,  Seq("itemId"), "leftanti")