使用子字符串将主数据帧转换为具有特定列的子数据帧

时间:2017-04-29 14:59:37

标签: scala apache-spark spark-dataframe

我正在尝试将master拆分为子数据帧,同时拆分主数据帧我只得到一个masterDF列,而我正在尝试拆分成多列。

ChildDF=
K0059122016022YU165754000000  000100000 L0000026009011    00020000           00007020149600001050000000N                         
K0059122016022YU100000000000  000200000 90800035433174    00010000           00009390150200001410000000N                         
K0059122016022YU160000000000  000100000 90800034921015    000100000000000000000014600000000000000000000N                         
K0059122016022YU165752000000  000100000 90800028370118    00020000           00011110000000000000000000N                         
K0059122016022YU100000000000  920161206083824VS122400000000000000000000000000000000000000020161206083824
K0059122016022YU165000000000  0001IVASQ S0000931025555    00020000           00004460000000000000000000N

listIs=List(Map(type->A,value1->1,value2->1),Map(type->B,value1->2,value2->6),Map(type->C,value1->8,value2->7),Map(type->D,value1->15,value2->2),Map(type->E,value1->17,value2->8),Map(type->F,value1->25,value2->8))


listIs.foreach(iteam => 
ChildDF.withColumn(iteam("type"),substring(ChildDF("masterDF"),iteam("value1").asInstanceOf[Int],iteam("value2").asInstanceOf[Int]))
)
ChildDF.createOrReplaceTempView("ChildTable")
val queryDF = "SELECT * from ChildTable"
sparkSession.sql(queryDF).cache().toDF().show()

输出

masterDF
K0059122016022YU165754....
K0059122016022YU100000....
K0059122016022YU160000....
K0059122016022YU165752....
K0059122016022YU100000....
K0059122016022YU165000....

预期输出(XXXXXX是分割值)

    masterDF                   A          B        C
K0059122016022YU165754....   XXXXXX     XXXXXX  XXXXXX
K0059122016022YU100000....   XXXXXX     XXXXXX  XXXXXX
K0059122016022YU160000....   XXXXXX     XXXXXX  XXXXXX
K0059122016022YU165752....   XXXXXX     XXXXXX  XXXXXX
K0059122016022YU100000....   XXXXXX     XXXXXX  XXXXXX
K0059122016022YU165000....   XXXXXX     XXXXXX  XXXXXX

1 个答案:

答案 0 :(得分:0)

使用地图而不是foreach。 withColumn将生成新的数据框。

val newChildDF = listIs.map(iteam =>  
ChildDF.withColumn(iteam("type"),substring(ChildDF("masterDF"),iteam("value1").asInstanceOf[Int],iteam("value2").asInstanceOf[Int]))
)

newChildDF.createOrReplaceTempView("ChildTable")

val queryDF = "SELECT * from ChildTable"

sparkSession.sql(queryDF).cache().toDF().show()