如何将kafka流转换为spark RDD或Spark Dataframe

时间:2016-02-03 13:05:30

标签: scala apache-spark apache-kafka spark-streaming

我尝试从Kafka加载数据,这是成功但我无法转换为spark RDD,

val kafkaParams = Map("metadata.broker.list" -> "IP:6667,IP:6667")
    val offsetRanges = Array(
      OffsetRange("first_topic", 0,1,1000)
    )
val ssc = new StreamingContext(new SparkConf, Seconds(60))
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)  

现在如何读取此流对象???我的意思是将其转换为Spark Dataframe并执行一些计算

我尝试过转换为dataframe

stream.foreachRDD { rdd =>
 println("Hello")
  import sqlContext.implicits._
  val dataFrame = rdd.map {case (key, value) => Row(key, value)}.toDf()
}

但是toDf没有工作错误:值toDf不是org.apache.spark.rdd.RDD的成员[org.apache.spark.sql.Row]

2 个答案:

答案 0 :(得分:1)

它已经过时了,但我认为您在从行创建df时忘记添加架构:

val df =  sc.parallelize(List(1,2,3)).toDF("a")
val someRDD = df.rdd
val newDF = spark.createDataFrame(someRDD, df.schema)

(在spark-shell 2.2.0中测试)

答案 1 :(得分:1)

val kafkaParams = Map("metadata.broker.list" -> "IP:6667,IP:6667")
val offsetRanges = Array(
    OffsetRange("first_topic", 0,1,1000)
  )
val ssc = new StreamingContext(new SparkConf, Seconds(60))
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) 

val lines = stream.map(_.value)
val words = lines.flatMap(_.split(" ")).print()   //def createDataFrame(words: RDD[Row], Schema: StructType)

// Start your computation then
ssc.start()
ssc.awaitTermination()