如果不相等则更新数据帧

时间:2017-09-10 15:37:30

标签: scala apache-spark apache-spark-sql

我有两个数据帧,其中我想知道它们是否根据列作为键是不同的,否则将更新不同的数据帧以使它们变得相等

val TMP_SITE = spark.load("jdbc", Map("url" -> "jdbc:oracle:thin:System/maher@//localhost:1521/XE", "dbtable" -> "IPTECH.TMP_SITE"))
      .withColumn("SITE",'SITE.cast(LongType))

val local_pos = spark.load("jdbc", Map("url" -> url, "dbtable" -> "pos")).select("id","name")

TMP_SITE.printSchema()
local_pos.printSchema()

val join = TMP_SITE.join(local_pos, 'SITE === 'id, "inner")
root
 |-- SITE: long (nullable = true)
 |-- LIBELLE: string (nullable = false)

root
 |-- id: long (nullable = false)
 |-- name: string (nullable = true)

加入的结果是

  |id |name                  |SITE|LIBELLE               |
+---+----------------------+----+----------------------+
|51 |Ezzahra               |51  |Ezzahra               |
|7  |BENIKHALLED           |7   |BENIKHALLED           |
|15 |Kram                  |15  |Kram                  |
|54 |El Mourouj            |54  |El Mourouj            |
|11 |LE BARDO              |11  |LE BARDO              |
|29 |Mini M Ksar said      |29  |Mini M Ksar said      |
|69 |ZAGHOUAN              |69  |ZAGHOUAN              |
|42 |BEB EL KHADHRA        |42  |BEB EL KHADHRA        |
|73 |Zaouit Kontech        |73  |Zaouit Kontech        |
|87 |Aouina                |87  |Aouina                |
|64 |Sousse I I            |64  |Sousse I I            |
|3  |SAHRA CONFORT : KORBA |3   |SAHRA CONFORT : KORBA |
|34 |SOUKRA SQUARE         |34  |SOUKRA SQUARE         |
|59 |SAHRA CONFORT : ZARZIS|59  |SAHRA CONFORT : ZARZIS|
|8  |Jerba                 |8   |Jerba                 |
|22 |Moknine               |22  |Moknine               |
|28 |RDAYEF                |28  |RDAYEF                |
|85 |MONASTIR ABSORBA      |85  |MONASTIR ABSORBA      |
|16 |BARDO HANAYA          |16  |BARDO HANAYA          |
|35 |Mini M Agba           |35  |Mini M Agba           |
+---+----------------------+----+----------------------+

我做了这个

val temp = join.withColumn("changes", when($"LIBELLE" === $"name", lit("nothing")).otherwise("need an update"))

我得到了这个

|id |name                  |SITE|LIBELLE               |changes       |
+---+----------------------+----+----------------------+--------------+
|51 |Ezzahra               |51  |Ezzahra               |nothing       |
|7  |BENIKHALLED           |7   |BENIKHALLED           |nothing       |
|15 |Kram                  |15  |Kram                  |nothing       |
|54 |El Mourouj            |54  |El Mourouj            |nothing       |
|11 |LE BARDO              |11  |LE BARDO              |nothing       |
|29 |Mini M Ksar said      |29  |Mini M Ksar said      |nothing       |
|69 |ZAGHOUAN              |69  |ZAGHOUAN              |nothing       |
|42 |BEB EL KHADHRA        |42  |BEB EL KHADHRA        |nothing       |
|73 |Zaouit Kontech        |73  |Zaouit Kontech        |need an update|
|87 |Aouina                |87  |Aouina                |nothing       |
|64 |Sousse I I            |64  |Sousse I I            |nothing       |
|3  |SAHRA CONFORT : KORBA |3   |SAHRA CONFORT : KORBA |nothing       |
|34 |SOUKRA SQUARE         |34  |SOUKRA SQUARE         |nothing       |
|59 |SAHRA CONFORT : ZARZIS|59  |SAHRA CONFORT : ZARZIS|nothing       |
|8  |Jerba                 |8   |Jerba                 |nothing       |
|22 |Moknine               |22  |Moknine               |need an update|
|28 |RDAYEF                |28  |RDAYEF                |nothing       |
|85 |MONASTIR ABSORBA      |85  |MONASTIR ABSORBA      |nothing       |
|16 |BARDO HANAYA          |16  |BARDO HANAYA          |nothing       |
|35 |Mini M Agba           |35  |Mini M Agba           |nothing       |
+---+----------------------+----+----------------------+--------------+

我不知道为什么它们需要更新,因为它们是相同的。 虽然它应该对所有人说什么,因为它们是平等的

1 个答案:

答案 0 :(得分:1)

获得dataframe后,使用columnsrows即可轻松玩。

所以您在dataframe

后跟随join
+----+---------------------+----+---------------------+
|SITE|LIBELLE              |id  |name                 |
+----+---------------------+----+---------------------+
|48  |Mini M Boumhel       |48  |Mini M Boumhel       |
|67  |Lac                  |67  |Lac                  |
|992 |test2                |992 |test                 |
|44  |KAIROUAN             |44  |KAIROUAN             |
|61  |Tunis                |61  |Tunis                |
|9001|MONOPRIX             |9001|MONOPRIX             |
|3   |SAHRA CONFORT : KORBA|3   |SAHRA CONFORT : KORBA|
|37  |Mini M Borj Lozir    |37  |Mini M Borj Lozir    |
|83  |Jendouba             |83  |Jendouba             |
|12  |Bigro                |12  |Bigro                |
+----+---------------------+----+---------------------+

您可以使用您编写的逻辑创建另一列,但使用when函数作为

import org.apache.spark.sql.functions._
val temp = join.withColumn("changes", when($"LIBELLE" === $"name", lit("nothing")).otherwise("need an update"))

temp dataframe将是

+----+---------------------+----+---------------------+--------------+
|SITE|LIBELLE              |id  |name                 |changes       |
+----+---------------------+----+---------------------+--------------+
|48  |Mini M Boumhel       |48  |Mini M Boumhel       |nothing       |
|67  |Lac                  |67  |Lac                  |nothing       |
|992 |test2                |992 |test                 |need an update|
|44  |KAIROUAN             |44  |KAIROUAN             |nothing       |
|61  |Tunis                |61  |Tunis                |nothing       |
|9001|MONOPRIX             |9001|MONOPRIX             |nothing       |
|3   |SAHRA CONFORT : KORBA|3   |SAHRA CONFORT : KORBA|nothing       |
|37  |Mini M Borj Lozir    |37  |Mini M Borj Lozir    |nothing       |
|83  |Jendouba             |83  |Jendouba             |nothing       |
|12  |Bigro                |12  |Bigro                |nothing       |
+----+---------------------+----+---------------------+--------------+

现在您可以在filter上使用dataframe方法作为

temp.filter($"changes" === "need an update").show(false)

应该给你

+----+-------+---+----+--------------+
|SITE|LIBELLE|id |name|changes       |
+----+-------+---+----+--------------+
|992 |test2  |992|test|need an update|
+----+-------+---+----+--------------+

您只需使用selectgroupByaggregationsfilters和其他内置函数或使用udf函数等来使用列。您甚至可以像在示例中一样转换为rddtuples