为什么我的Spark Streaming应用程序立即关闭(而不是处理任何Kafka记录)?

时间:2016-11-26 22:19:02

标签: apache-spark pyspark apache-kafka spark-streaming

我在Python中创建了一个Spark应用程序,遵循Spark Streaming + Kafka Integration Guide (Kafka broker version 0.8.2.1 or higher)中描述的示例,使用Apache Spark传输Kafka消息,但是在我有机会发送任何消息之前它正在关闭。

这是关闭部分从输出开始的地方。

16/11/26 17:11:06 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 1********6, 58045)
16/11/26 17:11:06 INFO VerifiableProperties: Verifying properties
16/11/26 17:11:06 INFO VerifiableProperties: Property group.id is overridden to 
16/11/26 17:11:06 INFO VerifiableProperties: Property zookeeper.connect is overridden to 
16/11/26 17:11:07 INFO SparkContext: Invoking stop() from shutdown hook
16/11/26 17:11:07 INFO SparkUI: Stopped Spark web UI at http://192.168.1.16:4040
16/11/26 17:11:07 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/11/26 17:11:07 INFO MemoryStore: MemoryStore cleared
16/11/26 17:11:07 INFO BlockManager: BlockManager stopped
16/11/26 17:11:07 INFO BlockManagerMaster: BlockManagerMaster stopped
16/11/26 17:11:07 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/11/26 17:11:07 INFO SparkContext: Successfully stopped SparkContext
16/11/26 17:11:07 INFO ShutdownHookManager: Shutdown hook called
16/11/26 17:11:07 INFO ShutdownHookManager: Deleting directory /private/var/folders/yn/t3pvrk7s231_11ff2lqr4jhr0000gn/T/spark-1876feee-9b71-413e-a505-99c414aafabf/pyspark-1d97c3dd-0889-42ed-b559-d0fd473faa22
16/11/26 17:11:07 INFO ShutdownHookManager: Deleting directory /private/var/folders/yn/t3pvrk7s231_11ff2lqr4jhr0000gn/T/spark-1876feee-9b71-413e-a505-99c414aafabf

有没有办法告诉它等待或我遗失了什么?

完整代码:

from pyspark.streaming.kafka import KafkaUtils
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext("local[2]", "TwitterWordCount")
ssc = StreamingContext(sc, 1)

directKafkaStream = KafkaUtils.createDirectStream(ssc, ["next"], {"metadata.broker.list": "localhost:9092"})

offsetRanges = []

def storeOffsetRanges(rdd):
    global offsetRanges
    offsetRanges = rdd.offsetRanges()
    return rdd

def printOffsetRanges(rdd):
    for o in offsetRanges:
        print("Printing! %s %s %s %s" % o.topic, o.partition, o.fromOffset, o.untilOffset)

directKafkaStream\
    .transform(storeOffsetRanges)\
    .foreachRDD(printOffsetRanges)

这是运行它的命令,以防有用。

spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 producer.py

2 个答案:

答案 0 :(得分:1)

您还需要启动流式上下文。看看这个例子。 http://spark.apache.org/docs/latest/streaming-programming-guide.html#a-quick-example

ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate

答案 1 :(得分:0)

对于Scala,当以群集模式提交到yarn时,我不得不使用awaitAnyTermination:

RoomDatabase appDatabase;
...

public LiveData<Cursor> getCursorLiveData(){
    return appDatabase.getInvalidationTracker().createLiveData(new String[]{"table1","table2"}, true, ()->{
       Cursor cursor = ...;
       return cursor;
    }
}

此处Structured Streaming Guide中的文档的种类()。