根据两列条件过滤数据帧

时间:2017-08-29 08:18:22

标签: scala apache-spark apache-spark-sql

我想通过两个特殊的cloumns过滤数据帧,我需要验证数据帧range_id_Test中的take数据应该包含“range_id”以及数据帧familyid_Test中的“family_id”。

  val range_id_Test = newArticlesGold.select("range_id").except(article_ranges.select("id").distinct())

    val familyid_Test = newArticlesGold.select("family_id").except(article_family.select("id").distinct())

    val addedData = newArticlesGold.filter($"range_id" === range_id_Test("range_id") || $"family_id" === familyid_Test("family_id"))

以下是数据样本

Range_Test

   |range_id|
    +--------+
    |      -1|
    +--------+

Family_test

|family_id|
+---------+
|       -1|
+---------+

和newArticlesGold

 +-----------+-------------+--------------------------------------------------+--------+---+------+------+--------+---+--------+---------+
    |CODEARTICLE|STRUCTURE    |DES                                               |TYPEMARK|TYP|IMPLOC|MARQUE|GAMME   |TAR|range_id|family_id|
    +-----------+-------------+--------------------------------------------------+--------+---+------+------+--------+---+--------+---------+
    |662180137  |1173201099902|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|2       |9  |Local |      |        |   |1173    |1173201  |
    |662180717  |1173201099902|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|2       |9  |Local |      |        |   |1173    |1173201  |
    |435160050  |1443609010306|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|7       |7  |Local |      |60900010|   |1443    |1443609  |
    |435160060  |1443609010306|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|7       |7  |Local |      |60900010|   |1443    |1443609  |
    |553260040  |1428659020203|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|7       |7  |Local |      |        |   |-1   |-1  |
    +-----------+-------------+--------------------------------------------------+--------+---+------+------+--------+---+--------+---------+

我想摆脱最后一行 任何帮助将不胜感激。

2 个答案:

答案 0 :(得分:0)

您正在尝试进行某种“嵌套循环”,但Spark中不允许这样做。相反,您需要使用join。像这样:

import spark.implicits._

val range_id_Test = newArticlesGold
  .select('range_id as "r_id")
  .except(article_ranges.select("id").distinct())

val familyid_Test = newArticlesGold
  .select('family_id as "f_id")
  .except(article_family.select("id").distinct())

val excludedData = newArticlesGold
  .join(range_id_Test, 'range_id === 'r_id)
  .join(familyid_Test, 'family_id === 'f_id)
  .drop("r_id", "f_id")

val result = newArticlesGold.except(excludedRows)

答案 1 :(得分:0)

根据我在您的问题中所理解的情况,innerjoin range_id_Testfamilyid_Test newArticlesGold range_id_Test可以得到您想要的输出结果如下(你可以看到)我已更改为familyid_Testval range_id_Test = newArticlesGold.select("range_id".as("range_id_1")).except(article_ranges.select("id").distinct()) val familyid_Test = newArticlesGold.select("family_id".as("family_id_1")).except(article_family.select("id").distinct()) val addedData = newArticlesGold.join(range_id_Test, $"range_id" =!= range_id_Test("range_id_1")) .join(familyid_Test, $"family_id" =!= familyid_Test("family_id_1")) .select(newArticlesGold.columns.map(col): _*) 表的列名。

 <DeviceCapability Name="bluetooth.rfcomm">

  <Device Id="any">

   <Function Type ="name:serialPort"/>

  </Device>  

</DeviceCapability>   

我希望这就是你要找的东西。