将RDD转换为2.0中的Dataframe

时间:2016-11-16 19:24:44

标签: apache-spark apache-spark-sql spark-dataframe

我正在尝试将rdd转换为Spark2.0中的dataframe

val conf=new SparkConf().setAppName("dataframes").setMaster("local")
val sc=new SparkContext(conf)
val sqlCon=new SQLContext(sc)
import sqlCon.implicits._
val rdd=sc.textFile("/home/cloudera/alpha.dat").persist()
val row=rdd.first()
val data=rdd.filter { x => !x.contains(row) }

data.foreach { x => println(x) }


case class person(name:String,age:Int,city:String)
val rdd2=data.map { x => x.split(",") }
val rdd3=rdd2.map { x => person(x(0),x(1).toInt,x(2)) }
val df=rdd3.toDF()


df.printSchema();
df.registerTempTable("alpha")
val df1=sqlCon.sql("select * from alpha")
df1.foreach { x => println(x) }

但是我在toDF()处得到以下错误。 ---> " val df = rdd3.toDF()"

Multiple markers at this line:
- Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case 
 classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
- Implicit conversion found: rdd3 ⇒ rddToDatasetHolder(rdd3): (implicit evidence$4: 
 org.apache.spark.sql.Encoder[person])org.apache.spark.sql.DatasetHolder[person]

如何使用toDF()

将上述内容转换为Dataframe

2 个答案:

答案 0 :(得分:1)

Cloudera& Spark 2.0?嗯,我认为我们不支持这个:)

无论如何,首先你不需要在你的RDD上调用.persist(),所以你可以删除那个位。其次,由于Person是一个案例类,你应该将其名称大写。

最后,在Spark 2.0中,您不再调用import sqlContext.implicits._隐式构建DataFrame架构,现在调用import spark.implicits._。您的错误消息暗示了这一点。

答案 1 :(得分:0)

我在main方法中定义了case类,这是一个简单的错误。删除后,我可以将RDD转换为DataFrame。

package sparksql

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoders
import org.apache.spark.SparkContext

object asw {

case class Person(name:String,age:Int,city:String)
def main(args: Array[String]): Unit = {

val conf=new SparkConf().setMaster("local").setAppName("Dataframe")
  val sc=new SparkContext(conf)
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._


val rdd1=sc.textFile("/home/cloudera/alpha.dat")
val row=rdd1.first()
val data=rdd1.filter { x => !x.contains(row) }
val rdd2=data.map { x => x.split(",") }
val df=rdd2.map { x => Person(x(0),x(1).toInt,x(2)) }.toDF()
df.createOrReplaceTempView("rdd21")
spark.sql("select * from rdd21").show()

}
}