如何根据列数据将数据与数据框中的行一起展平(或分解)?

时间:2018-07-01 05:02:10

标签: apache-spark dataframe

数据框应基于SPC列爆炸。下面是示例

我的输入数据框。

ID Name Level SPC Rating salry                                                     
23 sam     3  HBS  3.5    4000                
43 Nair 4     KSTk    4   5000           
56 Rom  5     MNC    3    3000

我的输出应该是:

ID Name level SPC Rating Salary                                                    
23 sam   3    H    3.5    4000    
23 sam   3    B    3.5    4000        
23 sam   3    S    3.5    4000      
43 Nair  4    K    4      5000      
43 Nair  4    S    4      5000      
43 Nair  4    T    4      5000     
43 Nair  4    k    4      5000  

如何用Scala或Java代码解决此问题?

2 个答案:

答案 0 :(得分:0)

如果您有一个数据框/数据集为

+---+----+-----+----+------+------+
|ID |Name|Level|SPC |Rating|salary|
+---+----+-----+----+------+------+
|23 |sam |3    |HBS |3.5   |4000  |
|43 |Nair|4    |KSTk|4.0   |5000  |
|56 |Rom |5    |MNC |3.0   |3000  |
+---+----+-----+----+------+------+

然后,您可以编写一个udf函数以将SPC列字符串值转换为每个字符数组作为字符串,然后使用explode函数作为

import org.apache.spark.sql.functions._
def flattenStringUdf = udf((spc: String) => spc.toList.map(_.toString))

df.withColumn("SPC", explode(flattenStringUdf(col("SPC")))).show(false)

应该给您

+---+----+-----+---+------+------+
|ID |Name|Level|SPC|Rating|salary|
+---+----+-----+---+------+------+
|23 |sam |3    |H  |3.5   |4000  |
|23 |sam |3    |B  |3.5   |4000  |
|23 |sam |3    |S  |3.5   |4000  |
|43 |Nair|4    |K  |4.0   |5000  |
|43 |Nair|4    |S  |4.0   |5000  |
|43 |Nair|4    |T  |4.0   |5000  |
|43 |Nair|4    |k  |4.0   |5000  |
|56 |Rom |5    |M  |3.0   |3000  |
|56 |Rom |5    |N  |3.0   |3000  |
|56 |Rom |5    |C  |3.0   |3000  |
+---+----+-----+---+------+------+

我希望答案会有所帮助

答案 1 :(得分:0)

尝试使用flatMap方法。

示例(尚未检查是否编译):

val output = input.flatMap(row =>
    row.SPC.toList.map(ch =>
        new MyRow(row.ID, row.Name, row.level, ch, row.Rating, row.Salaray))