我有以下格式的spark数据框:
+--------------------+
|value |
+--------------------+
|Id,date |
|000027,2017-11-14 |
|000045,2017-11-15 |
|000056,2018-09-09 |
|C000056,2018-07-01 |
+--------------------+
我需要遍历每一行,用逗号(,)分隔,然后将值放在不同的列中(Id和date为两个单独的列)。
我是新手,不确定是否可以通过lambda函数来完成。任何建议,将不胜感激。
答案 0 :(得分:-2)
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.SparkSession
val spark=SparkSession.builder().appName("Demo").getOrCreate()
var df=Seq("a,b,c,f","d,f,g,h").toDF("value")
df.show //show the dataFrame
+-------+
| value|
+-------+
|a,b,c,f|
|d,f,g,h|
+-------+
//splitting out the dataFrame with "," delimeter and creating rdd[Row]
var rdd=df.rdd.map(x=>Row(x.getString(0).split(","):_*))
var schema= StructType(Array("name","class","rank","grade").map(x=>StructField(x,StringType,true)))
spark.createDataFrame(rdd,schema).show
+----+-----+----+-----+
|name|class|rank|grade|
+----+-----+----+-----+
| a| b| c| f|
| d| f| g| h|
+----+-----+----+-----+