我正在尝试使用以下数据对df
进行操作:
+---+----------------------------------------------------+
|ka |readingsWFreq |
+---+----------------------------------------------------+
|列 |[[[列,つ],220], [[列,れっ],353], [[列,れつ],47074]] |
|制 |[[[制,せい],235579]] |
以下结构:
root
|-- ka: string (nullable = true)
|-- readingsWFreq: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- furigana: struct (nullable = true)
| | | |-- _1: string (nullable = true)
| | | |-- _2: string (nullable = true)
| | |-- Occ: long (nullable = true)
我的目标是将readingsWFreq
的值分成三个不同的列。为此,我尝试使用udf
,如下所示:
val uExtractK = udf((kWFreq:Seq[((String, String), Long)]) => kWFreq.map(_._1._1))
val uExtractR = udf((kWFreq:Seq[((String, String), Long)]) => kWFreq.map(_._1._2))
val uExtractN = udf((kWFreq:Seq[((String, String), Long)]) => kWFreq.map(_._2)
val df2 = df.withColumn("K", uExtractK('readingsWFreq))
.withColumn("R", uExtractR('readingsWFreq))
.withColumn("N", uExtractN('readingsWFreq))
.drop('readingsWFreq)
但是,我收到了与udf
s:
[error] (run-main-0) org.apache.spark.sql.AnalysisException: cannot resolve
'UDF(readingsWFreq)' due to data type mismatch: argument 1 requires
array<struct<_1:struct<_1:string,_2:string>,_2:bigint>> type, however,
'`readingsWFreq`' is of
array<struct<furigana:struct<_1:string,_2:string>,Occ:bigint>> type.;;
我的问题是,如何操作数据框以便产生以下结果?
+---+----------------------------------------------------+
|ka |K |R |N |
+---+----------------------------------------------------+
|列 |[列, 列, 列] | [つ, れっ, れつ] | [220, 353, 47074] |
|制 |[制] | [せい] | [235579] |
答案 0 :(得分:5)
Dataframe API方法:
您不需要UDF,只需执行:
df.select(
$"readingsWFreq.furigana._1".as("K"),
$"readingsWFreq.furigana._2".as("R"),
$"i.Occ".as("N")
)
这里的诀窍是类型.
的列上的array
也充当映射/投影运算符。在struct
类型的列上,此运算符用于选择元素。
<强> UDF的方法强>
您不能将元组传递给UDF,而是需要将它们作为Row
传递,请参阅例如Using Spark UDFs with struct sequences
在您的情况下,您有嵌套元组,因此您需要将行分解两次:
import org.apache.spark.sql.Row
val uExtractK = udf((kWFreq:Seq[Row]) => kWFreq.map(r => r.getAs[Row](0).getAs[String](0)))
val uExtractR = udf((kWFreq:Seq[Row]) => kWFreq.map(r => r.getAs[Row](0).getAs[String](1)))
val uExtractN = udf((kWFreq:Seq[Row]) => kWFreq.map(r => r.getAs[Long](1)))
或Row
上的模式匹配:
val uExtractK = udf((kWFreq:Seq[Row]) => kWFreq.map{case Row(kr:Row,n:Long) => kr match {case Row(k:String,r:String) => k}})
val uExtractR = udf((kWFreq:Seq[Row]) => kWFreq.map{case Row(kr:Row,n:Long) => kr match {case Row(k:String,r:String) => r}})
val uExtractN = udf((kWFreq:Seq[Row]) => kWFreq.map{case Row(kr:Row,n:Long) => n})
答案 1 :(得分:1)
您可以先explode
外部array
然后再获取每个值group
,然后将其作为列表collect_list
收集。
val df1 = df.withColumn("readingsWFreq", explode($"readingsWFreq"))
df1.select("ka", "readingsWFreq.furigana.*", "readingsWFreq.Occ")
.groupBy("ka").agg(collect_list("_1").as("K"),
collect_list("_2").as("R"),
collect_list("Occ").as("N")
)
希望这有帮助!