在Spark 1.6中将Spark数据帧拆分为多列

时间:2018-08-10 18:04:57

标签: pyspark

我有以下格式的spark数据框:

     +--------------------+
     |value               |
     +--------------------+
     |Id,date             |
     |000027,2017-11-14   |
     |000045,2017-11-15   |
     |000056,2018-09-09   |
     |C000056,2018-07-01  |
     +--------------------+

我需要遍历每一行,用逗号(,)分隔,然后将值放在不同的列中(Id和date为两个单独的列)。

我是新手,不确定是否可以通过lambda函数来完成。任何建议,将不胜感激。

1 个答案:

答案 0 :(得分:-2)

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
 import org.apache.spark.sql.SparkSession

val spark=SparkSession.builder().appName("Demo").getOrCreate()
var df=Seq("a,b,c,f","d,f,g,h").toDF("value")
df.show  //show the dataFrame 
+-------+
|  value|
+-------+
|a,b,c,f|
|d,f,g,h|
+-------+

  //splitting out the dataFrame with "," delimeter and creating rdd[Row]
   var rdd=df.rdd.map(x=>Row(x.getString(0).split(","):_*))
 var schema= StructType(Array("name","class","rank","grade").map(x=>StructField(x,StringType,true)))
spark.createDataFrame(rdd,schema).show
 +----+-----+----+-----+
 |name|class|rank|grade|
 +----+-----+----+-----+
 |   a|    b|   c|    f|
 |   d|    f|   g|    h|
 +----+-----+----+-----+