我的问题是我在json文件下面,其中包含column3的结构类型数据。我可以提取行,但无法找到column3的最小值。其中column3包含带有值的动态嵌套列(动态名称)。
inputdata是:
"result": { "data" :
[ {"col1": "value1", "col2": "value2", "col3" : { "dyno" : 3, "aeio": 5 }, "col4": "value4"},
{"col1": "value11", "col2": "value22", "col3" : { "abc" : 6, "def": 9 , "aero": 2}, "col4": "value44"},
{"col1": "value12", "col2": "value23", "col3" : { "ddc" : 6}, "col4": "value43"}]
outputDate预期为:
col1 col2 col3 col4 col5(min value of col3)
value1 value2 [3,5] value4 3
value11 value22 [6,9,2] value44 2
value12 value23 [6] value43 6
我可以读取文件并将其爆炸为记录,但无法找到col3的最小值。
val bestseller_df1 = bestseller_json.withColumn("extractedresult", explode(col("result.data")))
能否请我帮忙编写代码以查找spark / scala中col3的最小值。
我的json文件是:
{"success":true, "result": { "data": [ {"col1": "value1", "col2": "value2", "col3" : { "dyno" : 3, "aeio": 5 }, "col4": "value4"},{"col1": "value11", "col2": "value22", "col3" : { "abc" : 6, "def": 9 , "aero": 2}, "col4": "value44"},{"col1": "value12", "col2": "value23", "col3" : { "ddc" : 6}, "col4": "value43"}],"total":3}}
答案 0 :(得分:1)
这是您要怎么做
scala> val df = spark.read.json("/tmp/stack/pathi.json")
df: org.apache.spark.sql.DataFrame = [result: struct<data: array<struct<col1:string,col2:string,col3:struct<abc:bigint,aeio:bigint,aero:bigint,ddc:bigint,def:bigint,dyno:bigint>,col4:string>>, total: bigint>, success: boolean]
scala> df.printSchema
root
|-- result: struct (nullable = true)
| |-- data: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- col1: string (nullable = true)
| | | |-- col2: string (nullable = true)
| | | |-- col3: struct (nullable = true)
| | | | |-- abc: long (nullable = true)
| | | | |-- aeio: long (nullable = true)
| | | | |-- aero: long (nullable = true)
| | | | |-- ddc: long (nullable = true)
| | | | |-- def: long (nullable = true)
| | | | |-- dyno: long (nullable = true)
| | | |-- col4: string (nullable = true)
| |-- total: long (nullable = true)
|-- success: boolean (nullable = true)
scala> df.show(false)
+-------------------------------------------------------------------------------------------------------------------------------+-------+
|result |success|
+-------------------------------------------------------------------------------------------------------------------------------+-------+
|[[[value1, value2, [, 5,,,, 3], value4], [value11, value22, [6,, 2,, 9,], value44], [value12, value23, [,,, 6,,], value43]], 3]|true |
+-------------------------------------------------------------------------------------------------------------------------------+-------+
scala> df.select(explode($"result.data")).show(false)
+-----------------------------------------+
|col |
+-----------------------------------------+
|[value1, value2, [, 5,,,, 3], value4] |
|[value11, value22, [6,, 2,, 9,], value44]|
|[value12, value23, [,,, 6,,], value43] |
+-----------------------------------------+
通过查看架构,现在我们知道“ col3”中可能的列的列表,因此我们可以通过如下所示的硬编码来计算所有这些值中的最小值
scala> df.select(explode($"result.data")).select(least($"col.col3.abc",$"col.col3.aeio",$"col.col3.aero",$"col.col3.ddc",$"col.col3.def",$"col.col3.dyno")).show(false)
+--------------------------------------------------------------------------------------------+
|least(col.col3.abc, col.col3.aeio, col.col3.aero, col.col3.ddc, col.col3.def, col.col3.dyno)|
+--------------------------------------------------------------------------------------------+
|3 |
|2 |
|6 |
+--------------------------------------------------------------------------------------------+
动态处理:
我假设直到col.col3,结构保持不变,因此我们通过创建另一个数据帧作为
scala> val df2 = df.withColumn("res_data",explode($"result.data")).select(col("success"),col("res_data"),$"res_data.col3.*")
df2: org.apache.spark.sql.DataFrame = [success: boolean, res_data: struct<col1: string, col2: string ... 2 more fields> ... 6 more fields]
scala> df2.show(false)
+-------+-----------------------------------------+----+----+----+----+----+----+
|success|res_data |abc |aeio|aero|ddc |def |dyno|
+-------+-----------------------------------------+----+----+----+----+----+----+
|true |[value1, value2, [, 5,,,, 3], value4] |null|5 |null|null|null|3 |
|true |[value11, value22, [6,, 2,, 9,], value44]|6 |null|2 |null|9 |null|
|true |[value12, value23, [,,, 6,,], value43] |null|null|null|6 |null|null|
+-------+-----------------------------------------+----+----+----+----+----+----+
除“ success”和“ res_data”外,其余列都是“ col3”中的动态列。
scala> val p = df2.columns
p: Array[String] = Array(success, res_data, abc, aeio, aero, ddc, def, dyno)
过滤这两个并将它们其余的映射为火花列
scala> val m = p.filter(_!="success").filter(_!="res_data").map(col(_))
m: Array[org.apache.spark.sql.Column] = Array(abc, aeio, aero, ddc, def, dyno)
现在将m:_*
作为最小函数的参数传递,您将得到如下结果
scala> df2.withColumn("minv",least(m:_*)).show(false)
+-------+-----------------------------------------+----+----+----+----+----+----+----+
|success|res_data |abc |aeio|aero|ddc |def |dyno|minv|
+-------+-----------------------------------------+----+----+----+----+----+----+----+
|true |[value1, value2, [, 5,,,, 3], value4] |null|5 |null|null|null|3 |3 |
|true |[value11, value22, [6,, 2,, 9,], value44]|6 |null|2 |null|9 |null|2 |
|true |[value12, value23, [,,, 6,,], value43] |null|null|null|6 |null|null|6 |
+-------+-----------------------------------------+----+----+----+----+----+----+----+
scala>
希望这会有所帮助。
答案 1 :(得分:0)
dbutils.fs.put(“ / tmp / test.json”,“”“
{“ col1”:“ value1”,“ col2”:“ value2”,“ col3”:{“ dyno”:3,“ aeio”:5},“ col4”:“ value4”},
{“ col1”:“ value11”,“ col2”:“ value22”,“ col3”:{“ abc”:6,“ def”:9,“ aero”:2},“ col4”:“ value44 “}
{“ col1”:“ value12”,“ col2”:“ value23”,“ col3”:{“ ddc”:6},“ col4”:“ value43”}]} “”“,是真的)
val df_json = spark.read.json(“ / tmp / test.json”)
val tf = df_json.withColumn(“ col3”,explode(array($“ col3。*”)))。toDF
val tmp_group = tf.groupBy(“ col1”)。agg(min(tf.col(“ col3”))。alias(“ col3”))
val top_rows = tf.join(tmp_group,Seq(“ col3”,“ col1”),“内部”)
top_rows.select(“ col1”,“ col2”,“ col3”,“ col4”)。show()
写入282个字节。
+ ------- + ------- + ---- + ------- +
| col1 | col2 | col3 | col4 |
+ ------- + ------- + ---- + ------- +
|值1 |值2 | 3 | value4 |
| value11 | value22 | 2 | value44 |
| value12 | value23 | 6 | value43 |
+ ------- + ------- + ---- + ------- +