我有一些元组序列,通过它们我可以进行RDD并将其转换为数据帧。如下所示。
val rdd = sc.parallelize(Seq((1, "User1"), (2, "user2"), (3, "user3")))
import spark.implicits._
val df = rdd.toDF("Id", "firstname")
现在我想从df创建数据集。我该怎么办?
答案 0 :(得分:2)
只需df.as[(Int, String)]
就可以了。请在此处查看完整示例。
package com.examples
import org.apache.log4j.Level
import org.apache.spark.sql.{Dataset, SparkSession}
object SeqTuplesToDataSet {
org.apache.log4j.Logger.getLogger("org").setLevel(Level.ERROR)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName(this.getClass.getName).config("spark.master", "local").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val rdd = spark.sparkContext.parallelize(Seq((1, "User1"), (2, "user2"), (3, "user3")))
import spark.implicits._
val df = rdd.toDF("Id", "firstname")
val myds: Dataset[(Int, String)] = df.as[(Int, String)]
myds.show()
}
}
结果:
+---+---------+
| Id|firstname|
+---+---------+
| 1| User1|
| 2| user2|
| 3| user3|
+---+---------+