将HadoopRDD转换为DataFrame

时间:2016-05-07 23:24:56

标签: scala apache-spark apache-spark-sql rdd spark-dataframe

在EMR Spark中,我有一个HadoopRDD

org.apache.spark.rdd.RDD[(org.apache.hadoop.io.Text, org.apache.hadoop.dynamodb.DynamoDBItemWritable)] = HadoopRDD[0] at hadoopRDD

我想将其转换为DataFrame org.apache.spark.sql.DataFrame

有谁知道怎么做?

1 个答案:

答案 0 :(得分:2)

首先将其转换为简单类型。假设您的DynamoDBItemWritable只有一个字符串列:

val simple: RDD[(String, String)] = rdd.map {
  case (text, dbwritable) => (text.toString, dbwritable.getString(0))
}

然后,您可以使用toDF获取DataFrame:

import sqlContext.implicits._
val df: DataFrame = simple.toDF()