嗨,我正在尝试根据条件条件过滤数据帧,然后在匹配时应用架构,否则将其保留。
val schema = ArrayType(StructType(StructField("packQty",FloatType,true):: StructField("gtin",StringType,true) :: Nil))
+--------------+---------+-----------+-----+--------------------------------------+
|orderupcnumber|enrichqty|allocoutqty|allocatedqty|gtins
|
+--------------+---------+-----------+--------------------------------------------+
|5203754 |15.0 |1.0 |5.0 |[{"packQty":120.0,"gtin":"00052000042276"}]|
|5203754 |15.0 |1.0 |2.0 |[{"packQty":120.0,"gtin":"00052000042276"}|
|5243700 |25.0 |1.0 |2.0 |na
|
+--------------+---------+-----------+------------+-------------------------------+
如果gtins列不是“ na”,我正在尝试基于架构添加一列,如果我正在添加0,但是会抛出错误提示
df.withColumn("jsonData",when($"gtins"=!="na",from_json($"gtins",schema)).otherwise(0))
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'CASE
WHEN contains(`gtins`, 'na') THEN 0 ELSE jsontostructs(`gtins`) END' due to data type
mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;;
df.select($"orderupcnumber",$"enrichqty",$"allocoutqty",$"allocatedqty",explode($"jsonData").as("jsonData"))
+--------------+---------+-----------+-----+--------------+
|orderupcnumber|enrichqty|allocoutqty|allocatedqty|gtins|JsonData
+--------------+---------+-----------+--------------------+
|5203754 |15.0|1.0|5.0|[{"packQty":120.0,"gtin":"00052000042276”}]|[120.0, 00052000042276]
|5203754 |15.0|1.0 |2.0|[{"packQty":120.0,"gtin":"00052000042276”}|[120.0,00052000042276]
|5243700 |25.0 |1.0|2.0 |na |null
+--------------+---------+-----------+------------+----+
df.select($"orderupcnumber",$"enrichqty",$"allocoutqty",$"allocatedqty",$"jsonData.packQty".as("packQty"),$"jsonData.gtin".as("gtin")
此选择仅选择jsonData不为null的数据
+---------+-----------+------------+-------+--------------+
orderupcnumber |enrichqty|allocoutqty|allocatedqty|packQty|gtin |
+-----------+------------+----------------+------------+
5203754|15.0 |1.0 |5.0 |120.0 |00052000042276|
5203754|15.0 |1.0 |5.0 |144.0 |00052000042283|
5243700|25.0 |1.0 |5.0 | | |
+-----------+------------+----------------+------------+----------
我怎么也可以包含一个null的
答案 0 :(得分:1)
Exception
在线程“主”中org.apache.spark.sql.AnalysisException:无法解析'CASE
当包含({gtins
,'na')然后由于数据类型而0 ELSE jsontostructs(gtins
)END'
不匹配:那么THEN和ELSE表达式应该都属于同一类型或可以强制转换为通用类型;
要修复上述异常
您必须将na
的值转换为json array
类型以匹配其他值。
请检查以下代码。
scala> df.withColumn("gtins",when($"gtins" === "na",to_json(array($"gtins"))).otherwise($"gtins")).withColumn("jsonData",from_json($"gtins",schema)).show(false)
+-------------------------------------------+-------------------------+
|gtins |jsonData |
+-------------------------------------------+-------------------------+
|[{"packQty":120.0,"gtin":"00052000042276"}]|[[120.0, 00052000042276]]|
|[{"packQty":120.0,"gtin":"00052000042276"}]|[[120.0, 00052000042276]]|
|["na"] |null |
+-------------------------------------------+-------------------------+
scala> df.withColumn("gtins",when($"gtins" === "na",to_json(array($"gtins"))).otherwise($"gtins")).withColumn("jsonData",from_json($"gtins",schema)).select($"gtins",$"jsonData.packQty".as("packQty"),$"jsonData.gtin".as("gtin")).show(false)
+-------------------------------------------+-------+----------------+
|gtins |packQty|gtin |
+-------------------------------------------+-------+----------------+
|[{"packQty":120.0,"gtin":"00052000042276"}]|[120.0]|[00052000042276]|
|[{"packQty":120.0,"gtin":"00052000042276"}]|[120.0]|[00052000042276]|
|["na"] |null |null |
+-------------------------------------------+-------+----------------+
答案 1 :(得分:0)
如果您的输入数据如下所示,其中多个gtins位于字符串数组中,那么您可以先将其爆炸,然后相应地应用架构和withColumn:
+--------------+---------+-----------+------------+--------------------------------------------------------------------------------------+
|orderupcnumber|enrichqty|allocoutqty|allocatedqty|gtins |
+--------------+---------+-----------+------------+--------------------------------------------------------------------------------------+
|5243754 |15.0 |1.0 |5.0 |[{"packQty":120.0,"gtin":"00052000042276"}, {"packQty":250.0,"gtin":"00052000012345"}]|
|5243700 |25.0 |1.0 |2.0 |[na] |
+--------------+---------+-----------+------------+--------------------------------------------------------------------------------------+
然后在下面使用:
val schema = StructType(StructField("packQty",FloatType,true):: StructField("gtin",StringType,true) :: Nil)
df.withColumn("gtins",explode($"gtins")).withColumn("jsonData",from_json($"gtins",schema)).withColumn("packQty",$"jsonData.packQty").withColumn("gtin",$"jsondata.gtin").show(false)
+--------------+---------+-----------+------------+-----------------------------------------+-----------------------+-------+--------------+
|orderupcnumber|enrichqty|allocoutqty|allocatedqty|gtins |jsonData |packQty|gtin |
+--------------+---------+-----------+------------+-----------------------------------------+-----------------------+-------+--------------+
|5243754 |15.0 |1.0 |5.0 |{"packQty":120.0,"gtin":"00052000042276"}|[120.0, 00052000042276]|120.0 |00052000042276|
|5243754 |15.0 |1.0 |5.0 |{"packQty":250.0,"gtin":"00052000012345"}|[250.0, 00052000012345]|250.0 |00052000012345|
|5243700 |25.0 |1.0 |2.0 |na |null |null |null |
+--------------+---------+-----------+------------+-----------------------------------------+-----------------------+-------+--------------+
答案 2 :(得分:0)
您的when and else子句出现问题,它期望与from_json
相同的返回类型,这只有在您使用具有相同模式格式的相同功能from_json
时才可能实现
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{ArrayType, FloatType, StringType, StructField, StructType}
object ApplySchema {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
//Create Dataframe from the List
val sampleDf = List((5203754,15.0,1.0,5.0,"""[{"packQty":120.0,"gtin":"00052000042276"}]"""),
(5203754,15.0,1.0,2.0,"""[{"packQty":120.0,"gtin":"00052000042276"}]""")
,(5203754,25.0,1.0,2.0,"na")
).toDF("orderupcnumber","enrichqty","allocoutqty","allocatedqty","gtins") // Map the column to data
//JSON schema
val schema = ArrayType(StructType(StructField("packQty",FloatType,true)::
StructField("gtin",StringType,true) :: Nil))
//Add column JSON parsed column "jsonData"
sampleDf.withColumn("jsonData",
when($"gtins"=!="na",from_json($"gtins",schema)) // Check if the value is NA then parse the JSON
.otherwise(from_json(lit("[]"),schema))) // Else parse an empty JSON array
.show()
}
}