基于连接两列的两个数据帧获取id

时间:2017-09-04 11:06:21

标签: apache-spark dataframe

我们有两个数据框

val pos_articles_Gold = spark.load("jdbc", Map("url" -> "jdbc:oracle:thin:System/maher@//localhost:1521/XE", "dbtable" -> "IPTECH.TMP_PRIXVENTEPERM")).select("SITE", "REFART", "PRIXV", "CTVA").limit(20)
val pos_articles = spark.load("jdbc", Map("url" -> url, "dbtable" -> "pos_articles")).select("id","article_id","pos_id")

pos_articles_Gold.printSchema()
pos_articles.printSchema()

root
 |-- SITE: decimal(5,0) (nullable = false)
 |-- REFART: string (nullable = false)
 |-- PRIXV: decimal(13,3) (nullable = false)
 |-- CTVA: decimal(5,2) (nullable = false)

root
 |-- id: integer (nullable = false)
 |-- article_id: long (nullable = true)
 |-- pos_id: long (nullable = false)

pos_article

id,article_id,pos_id
17,434004740,96
18,395090520,12
19,395090520,1
20,395090520,10
21,395090520,7
24,20100160,2

pos_articles_gold

SITE,REFART,PRIXV,CTVA
96,434004740,1.250,18.00
12,395090520,999.000,18.00
1,395090520,999.000,18.00
10,395090520,999.000,18.00
7,395090520,999.000,18.00

结果应该是

id,article_id,pos_id
24,20100160,2

我想做的是

从pos_articles中选择id,其中article_id!= REFART和pos_id!= SITE 这里我到目前为止尝试了,以避免做出选择,然后再做一个

val exluded_Id = pos_articles.join(pos_articles_Gold, $"article_id" === $"REFART" && $"pos_id" === $"SITE","left")
val deletedrows=pos_articles.select("id").except(exluded_Id)

我想我需要加入pos_articles_Gold数据帧和pos_articles数据帧,任何帮助都将受到赞赏

2 个答案:

答案 0 :(得分:0)

一个选项是在数据帧上创建临时视图并运行SQL查询以获取所需的结果,如下所示:

import sparkSession.sqlContext.implicits._

val sparkConf = new SparkConf().setAppName("Test")
sparkConf.set("spark.sql.crossJoin.enabled", "true")

val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

pos_articles_Gold.createOrReplaceTempView("pos_articles_Gold")
pos_articles.createOrReplaceTempView("pos_articles")

val dataFrame = sparkSession.sql("SELECT * FROM pos_articles WHERE id NOT IN (SELECT id FROM pos_articles, pos_articles_Gold WHERE article_id =REFART AND pos_id=SITE)")
dataFrame.show

<强>输出:

+---+----------+------+
| id|article_id|pos_id|
+---+----------+------+
| 24|  20100160|     2|
+---+----------+------+

答案 1 :(得分:0)

您的方法正在运行,而不是left join使用inner join作为

val exluded_Id = pos_articles.join(pos_articles_Gold, pos_articles("article_id") === pos_articles_Gold("REFART") && pos_articles("pos_id") === pos_articles_Gold("SITE"))
  .select("id", "article_id", "pos_id")
pos_articles.except(exluded_Id).show(false)

另一种方式是

pos_articles.except(
  pos_articles.join(pos_articles_Gold, pos_articles("article_id") === pos_articles_Gold("REFART") && pos_articles("pos_id") === pos_articles_Gold("SITE"))
  .select("id", "article_id", "pos_id")
)

你应该得到你想要的结果