Question

我发现有两种方法可以在Spark Streaming（Spark 2.0）中使用Kafka主题：

1）每隔k秒使用KafkaUtils.createDirectStream获取DStream，请参阅this document

2）使用kafka: sqlContext.read.format(“json”).stream(“kafka://KAFKA_HOST”)为Spark 2.0的新功能创建无限的DataFrame：Structured Streaming，相关文档is here

方法1）有效，但2）没有，我得到以下错误

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrameReader.stream(Ljava/lang/String;)Lorg/apache/spark/sql/Dataset;
...
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:729)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

我的问题是：
“kafka://KAFKA_HOST”指的是什么？我该如何解决这个问题？

提前谢谢！

Answer 1

Spark 2.0尚不支持Kafka作为无限DataFrames / Sets的来源。计划在2.1

中添加支持

编辑：（6.12.2016）

Kafka 0.10现在是expiramentaly supported in Spark 2.0.2：

val ds1 = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1")
  .load()

ds1
  .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]

在Spark Streaming中使用Kafka（Spark 2.0）

1 个答案:

编辑：（6.12.2016）