数据框应基于SPC列爆炸。下面是示例
我的输入数据框。
ID Name Level SPC Rating salry
23 sam 3 HBS 3.5 4000
43 Nair 4 KSTk 4 5000
56 Rom 5 MNC 3 3000
我的输出应该是:
ID Name level SPC Rating Salary
23 sam 3 H 3.5 4000
23 sam 3 B 3.5 4000
23 sam 3 S 3.5 4000
43 Nair 4 K 4 5000
43 Nair 4 S 4 5000
43 Nair 4 T 4 5000
43 Nair 4 k 4 5000
如何用Scala或Java代码解决此问题?
答案 0 :(得分:0)
如果您有一个数据框/数据集为
+---+----+-----+----+------+------+
|ID |Name|Level|SPC |Rating|salary|
+---+----+-----+----+------+------+
|23 |sam |3 |HBS |3.5 |4000 |
|43 |Nair|4 |KSTk|4.0 |5000 |
|56 |Rom |5 |MNC |3.0 |3000 |
+---+----+-----+----+------+------+
然后,您可以编写一个udf
函数以将SPC
列字符串值转换为每个字符数组作为字符串,然后使用explode
函数作为
import org.apache.spark.sql.functions._
def flattenStringUdf = udf((spc: String) => spc.toList.map(_.toString))
df.withColumn("SPC", explode(flattenStringUdf(col("SPC")))).show(false)
应该给您
+---+----+-----+---+------+------+
|ID |Name|Level|SPC|Rating|salary|
+---+----+-----+---+------+------+
|23 |sam |3 |H |3.5 |4000 |
|23 |sam |3 |B |3.5 |4000 |
|23 |sam |3 |S |3.5 |4000 |
|43 |Nair|4 |K |4.0 |5000 |
|43 |Nair|4 |S |4.0 |5000 |
|43 |Nair|4 |T |4.0 |5000 |
|43 |Nair|4 |k |4.0 |5000 |
|56 |Rom |5 |M |3.0 |3000 |
|56 |Rom |5 |N |3.0 |3000 |
|56 |Rom |5 |C |3.0 |3000 |
+---+----+-----+---+------+------+
我希望答案会有所帮助
答案 1 :(得分:0)
尝试使用flatMap方法。
示例(尚未检查是否编译):
val output = input.flatMap(row =>
row.SPC.toList.map(ch =>
new MyRow(row.ID, row.Name, row.level, ch, row.Rating, row.Salaray))