我有以下Spark SQL测试查询:
Seq("france").toDF.createOrReplaceTempView("countries")
SELECT CASE WHEN country = 'italy' THEN 'Italy'
ELSE ( CASE WHEN country IN (FROM countries) THEN upperCase(country) ELSE country END )
END AS country FROM users
会引发以下错误:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
IN/EXISTS predicate sub-queries can only be used in a Filter
查询CASE WHEN country IN (FROM countries)
的以下部分是这样做的原因。
Spark SQL中是否存在任何变通办法,以便在所选条件下模拟country IN (FROM countries)
?我对纯SQL实现感兴趣,而不对通过API实现感兴趣。
答案 0 :(得分:2)
这是正确的SQL查询:
import sparkSession.implicits._
Seq("france").toDF("country").createOrReplaceTempView("countries")
Seq(("user1", "france"), ("user2", "italy"), ("user2", "usa"))
.toDF("user", "country").createOrReplaceTempView("users")
val query =
s"""
|SELECT
| CASE
| WHEN u.country = 'italy' THEN 'Italy'
| ELSE (
| CASE
| WHEN u.country = c.country THEN upper(u.country)
| ELSE u.country
| END
| ) END AS country
|FROM users u
|LEFT JOIN countries c
| ON u.country = c.country
""".stripMargin
sparkSession.sql(query).show()
结果:
+-------+
|country|
+-------+
| FRANCE|
| Italy|
| usa|
+-------+
只能在谓词中使用IN/EXISTS
sql运算符在幕后的原因是:投影中的逻辑(在我们的示例中为CASE-WHEN
)对数据集中的每一行进行了评估从选择中返回。
考虑到这一点,最好不要对CASE WHEN country IN (SELECT * FROM countries)
表中的每一行运行等效于users
的行。因此,SQL在语言级别(SQL解析器引擎)上防止了这种情况。
答案 1 :(得分:0)
您也可以使用
withColumn()
和
when()
功能(来自spark.sql.functions):
val users = Seq(("1", "france"), ("2", "Italy"), ("3", "italy")).toDF("userId", "country")
val countriesList = Seq("france", "italy", "germany").toList
val result = users.withColumn("country", when(col("country") === "italy", "Italy")
.when(col("country") isin(countriesList:_*), upper(col("country"))).otherwise(col("country")))
result.show()
结果:
+------+-------+
|userId|country|
+------+-------+
| 1| FRANCE|
| 2| Italy|
| 3| Italy|
+------+-------+