我正在学习scala-spark,并且想知道如何根据列名从无序数据中提取所需的列?下面的详细信息-
输入数据:RDD [Array [String]]
id=1,country=USA,age=20,name=abc
name=def,country=USA,id=2,age=30
name=ghi,id=3,age=40,country=USA
必需的输出:
Name,id
abc,1
def,2
ghi,3
任何帮助将不胜感激。谢谢!
答案 0 :(得分:1)
如果您有RDD[Array[String]]
,则可以通过
您可以将案例类定义为
case class Data(Name: String, Id: Long)
然后将每一行解析为case class
val df = rdd.map( row => {
//split the line and convert to map so you can extract the data
val data = row.split(",").map(x => (x.split("=")(0),x.split("=")(1))).toMap
Data(data("name"), data("id").toLong)
})
转换为数据框并显示
df.toDF().show(false)
输出:
+----+---+
|Name|Id |
+----+---+
|abc |1 |
|def |2 |
|ghi |3 |
+----+---+
这里是读取文件的完整代码
case class Data(Name: String, Id: Long)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("xyz").master("local[*]").getOrCreate()
import spark.implicits._
val rdd = spark.sparkContext.textFile("path to file ")
val df = rdd.map(row => {
val data = row.split(",").map(x => (x.split("=")(0), x.split("=")(1))).toMap
Data(data("name"), data("id").toLong)
})
df.toDF().show(false)
}