KafkaUtils.createStream过一会儿停止捕获数据

时间:2018-08-29 06:12:34

标签: apache-spark pyspark apache-kafka spark-streaming

我建立了一个Kafka用户,该用户从Kafka写入中获取Elasticsearch,该程序按预期运行一两天,然后火花停止捕获数据。正在生成Kafka日志,并且Spark流正在运行,但未捕获任何数据。下面是使用的代码:

# For Spark
from pyspark import SparkContext,SparkConf
from pyspark.streaming import StreamingContext

# For Kafka
from pyspark.streaming.kafka import KafkaUtils

# Name of Spark App
conf = SparkConf().setAppName("test_topic")

# Spark and Spark streaming configuration
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 1)

# Kafka Enpoints
zkQuorum = '192.0.23.1:2181'
topic = 'test_topic'


# Elastic Search write endpoint
es_write_conf = {
    "es.nodes" : "192.000.0.1",
    "es.port" : "9200",
    "es.resource" : "test_index/test_type",
    "es.input.json": "true",
    "es.nodes.ingest.only": "true"
}


# Create a kafka Stream
kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, "cyd-pcs-bro-streaming-consumer", {topic: 1})

# Print stream to console
kafkaStream_json = kafkaStream.map(lambda x: x[1])
kafkaStream_json.pprint()

#Write Stream to ElasticSearch
kafkaStream.foreachRDD(lambda rdd: rdd.saveAsNewAPIHadoopFile(
    path='-',
    outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
    keyClass="org.apache.hadoop.io.NullWritable",
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
    conf=es_write_conf)
)


# Start the stream and keep it running unless terminated
ssc.start()
ssc.awaitTermination()
  1. 我的代码还有其他需要做的事情吗?还是一种可以更深入地研究问题的方法(日志没有表明任何内容)?
  2. 此外,如果我每个主题可以有一个Spark App,我还有其他原因要使用KafkaUtils.createDirectStream吗?因为我不想管理偏移量。

使用的语言: Pyspark

代码运行:

sudo $SPARK_HOME/spark-submit --master local[2] --jars /home/user/jars/elasticsearch-hadoop-6.3.2.jar,/home/user/jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.1.jar /home/user/code/test_stream.py

这是没有捕获任何数据时流的输出:

-------------------------------------------
Time: 2018-08-29 12:23:46
-------------------------------------------

18/08/29 12:23:46 INFO JobScheduler: Finished job streaming job 1535525626000 ms.0 from job set of time 1535525626000 ms
18/08/29 12:23:46 INFO JobScheduler: Total delay: 0.030 s for time 1535525626000 ms (execution: 0.007 s)
18/08/29 12:23:46 INFO PythonRDD: Removing RDD 115 from persistence list
18/08/29 12:23:46 INFO BlockManager: Removing RDD 115
18/08/29 12:23:46 INFO BlockRDD: Removing RDD 114 from persistence list
18/08/29 12:23:46 INFO BlockManager: Removing RDD 114
18/08/29 12:23:46 INFO KafkaInputDStream: Removing blocks of RDD BlockRDD[114] at createStream at NativeMethodAccessorImpl.java:0 of time 1535525626000 ms
18/08/29 12:23:46 INFO ReceivedBlockTracker: Deleting batches: 1535525624000 ms
18/08/29 12:23:46 INFO InputInfoTracker: remove old batch metadata: 1535525624000 ms
18/08/29 12:23:47 INFO JobScheduler: Added jobs for time 1535525627000 ms
18/08/29 12:23:47 INFO JobScheduler: Starting job streaming job 1535525627000 ms.0 from job set of time 1535525627000 ms
-------------------------------------------
Time: 2018-08-29 12:23:47
-------------------------------------------

18/08/29 12:23:47 INFO JobScheduler: Finished job streaming job 1535525627000 ms.0 from job set of time 1535525627000 ms
18/08/29 12:23:47 INFO JobScheduler: Total delay: 0.025 s for time 1535525627000 ms (execution: 0.005 s)
18/08/29 12:23:47 INFO PythonRDD: Removing RDD 117 from persistence list
18/08/29 12:23:47 INFO BlockRDD: Removing RDD 116 from persistence list
18/08/29 12:23:47 INFO BlockManager: Removing RDD 117
18/08/29 12:23:47 INFO BlockManager: Removing RDD 116
18/08/29 12:23:47 INFO KafkaInputDStream: Removing blocks of RDD BlockRDD[116] at createStream at NativeMethodAccessorImpl.java:0 of time 1535525627000 ms
18/08/29 12:23:47 INFO ReceivedBlockTracker: Deleting batches: 1535525625000 ms
18/08/29 12:23:47 INFO InputInfoTracker: remove old batch metadata: 1535525625000 ms
18/08/29 12:23:48 INFO JobScheduler: Added jobs for time 1535525628000 ms
18/08/29 12:23:48 INFO JobScheduler: Starting job streaming job 1535525628000 ms.0 from job set of time 1535525628000 ms

0 个答案:

没有答案