我有一个以下格式的文本文件。
id##name##subjects$$$
1##a##science
english$$$
2##b##social
mathematics$$$
我想创建一个DataFrame
之类的
id | name | subject
1 | a | science
| | english
执行此Scala时,我只会得到RDD[String]
。如何将RDD[String]
转换为DataFrame
val rdd = sc.textFile(fileLocation)
val a = rdd.reduce((a, b) => a + " " + b).split("\\$\\$\\$").map(f => f.replaceAll("##","")
答案 0 :(得分:0)
给出您提供的文本文件,并假设您希望将所有示例文件都转换为以下内容(将示例文本放入文件example.txt中)
+---+----+-----------+
| id|name| subjects|
+---+----+-----------+
| 1| a| science|
| | | english|
| 2| b| social|
| | |mathematics|
+---+----+-----------+
您可以运行以下代码(spark 2.3.2)
val fileLocation="example.txt"
val rdd = sc.textFile(fileLocation)
def format(x : (String, String, String)) : String = {
val a = if ("".equals(x._1)) "| " else x._1 + " | "
val b = if ("".equals(x._2)) "| " else x._2 + " | "
val c = if ("".equals(x._3)) "" else x._3
return a + b + c
}
var rdd2 = rdd.filter(x => x.length != 0).map(s => s.split("##")).map(a => {
a match {
case Array(x) =>
("", "", x.split("\\$\\$\\$")(0))
case Array(x, y, z) =>
(x, y, z.split("\\$\\$\\$")(0))
}
})
rdd2.foreach(x => println(format(x)))
val header = rdd2.first()
val df = rdd2.filter(row => row != header).toDF(header._1, header._2, header._3)
df.show
val ds = rdd2.filter(row => row != header).toDS.withColumnRenamed("_1", header._1).withColumnRenamed("_2", header._2).withColumnRenamed("_3", header._3)
ds.show