我们有两个数据框
val pos_articles_Gold = spark.load("jdbc", Map("url" -> "jdbc:oracle:thin:System/maher@//localhost:1521/XE", "dbtable" -> "IPTECH.TMP_PRIXVENTEPERM")).select("SITE", "REFART", "PRIXV", "CTVA").limit(20)
val pos_articles = spark.load("jdbc", Map("url" -> url, "dbtable" -> "pos_articles")).select("id","article_id","pos_id")
pos_articles_Gold.printSchema()
pos_articles.printSchema()
root
|-- SITE: decimal(5,0) (nullable = false)
|-- REFART: string (nullable = false)
|-- PRIXV: decimal(13,3) (nullable = false)
|-- CTVA: decimal(5,2) (nullable = false)
root
|-- id: integer (nullable = false)
|-- article_id: long (nullable = true)
|-- pos_id: long (nullable = false)
pos_article
id,article_id,pos_id
17,434004740,96
18,395090520,12
19,395090520,1
20,395090520,10
21,395090520,7
24,20100160,2
pos_articles_gold
SITE,REFART,PRIXV,CTVA
96,434004740,1.250,18.00
12,395090520,999.000,18.00
1,395090520,999.000,18.00
10,395090520,999.000,18.00
7,395090520,999.000,18.00
结果应该是
id,article_id,pos_id
24,20100160,2
我想做的是
从pos_articles中选择id,其中article_id!= REFART和pos_id!= SITE 这里我到目前为止尝试了,以避免做出选择,然后再做一个
val exluded_Id = pos_articles.join(pos_articles_Gold, $"article_id" === $"REFART" && $"pos_id" === $"SITE","left")
val deletedrows=pos_articles.select("id").except(exluded_Id)
我想我需要加入pos_articles_Gold数据帧和pos_articles数据帧,任何帮助都将受到赞赏
答案 0 :(得分:0)
一个选项是在数据帧上创建临时视图并运行SQL查询以获取所需的结果,如下所示:
import sparkSession.sqlContext.implicits._
val sparkConf = new SparkConf().setAppName("Test")
sparkConf.set("spark.sql.crossJoin.enabled", "true")
val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
pos_articles_Gold.createOrReplaceTempView("pos_articles_Gold")
pos_articles.createOrReplaceTempView("pos_articles")
val dataFrame = sparkSession.sql("SELECT * FROM pos_articles WHERE id NOT IN (SELECT id FROM pos_articles, pos_articles_Gold WHERE article_id =REFART AND pos_id=SITE)")
dataFrame.show
<强>输出:强>
+---+----------+------+
| id|article_id|pos_id|
+---+----------+------+
| 24| 20100160| 2|
+---+----------+------+
答案 1 :(得分:0)
您的方法正在运行,而不是left
join
使用inner join
作为
val exluded_Id = pos_articles.join(pos_articles_Gold, pos_articles("article_id") === pos_articles_Gold("REFART") && pos_articles("pos_id") === pos_articles_Gold("SITE"))
.select("id", "article_id", "pos_id")
pos_articles.except(exluded_Id).show(false)
另一种方式是
pos_articles.except(
pos_articles.join(pos_articles_Gold, pos_articles("article_id") === pos_articles_Gold("REFART") && pos_articles("pos_id") === pos_articles_Gold("SITE"))
.select("id", "article_id", "pos_id")
)
你应该得到你想要的结果