SPARK无法使用AWS Kinesis流

时间:2019-02-15 13:23:48

标签: apache-spark pyspark spark-streaming amazon-emr amazon-kinesis

Environment : EMR
AWS Kinesis Steam
Language : PySpark

我有传入的AWS Kinesis流,并且能够使用Python使用流(因此EMR能够获取流)。当我尝试通过 PySpark Streaming 使用时,我无法获取流,而是仅打印日志。我没有做任何转换,只是尝试读取流并打印。有人可以指导我吗?

from __future__ import print_function
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
appName = 'kinesis_myreal_time_stream'
streamName = 'kinesis_myreal_time_stream'
endpointUrl = 'apigateway.us-east-1.amazonaws.com'
regionName = 'us-east-1'
sc = SparkContext()
ssc = StreamingContext(sc, 10)
lines = KinesisUtils.createStream(ssc = ssc, kinesisAppName = appName, streamName = streamName,
                                  endpointUrl = endpointUrl, regionName = regionName,
                                  initialPositionInStream = InitialPositionInStream.LATEST, checkpointInterval = 2)
# counts = lines.flatMap(lambda line: line.split("}{")) \
#     .map(lambda word: (word, 1)) \
#     .reduceByKey(lambda a, b: a+b)
# counts.pprint()
lines.pprint()
ssc.start()
ssc.awaitTermination()

获取以下日志

-------------------------------------------
Time: 2019-02-15 13:17:10
-------------------------------------------

19/02/15 13:17:10 INFO JobScheduler: Finished job streaming job 1550236630000 ms.0 from job set of time 1550236630000 ms
19/02/15 13:17:10 INFO PythonRDD: Removing RDD 59 from persistence list
19/02/15 13:17:10 INFO JobScheduler: Total delay: 0.014 s for time 1550236630000 ms (execution: 0.002 s)
19/02/15 13:17:10 INFO BlockManager: Removing RDD 59
19/02/15 13:17:10 INFO KinesisBackedBlockRDD: Removing RDD 58 from persistence list
19/02/15 13:17:10 INFO BlockManager: Removing RDD 58
19/02/15 13:17:10 INFO KinesisInputDStream: Removing blocks of RDD KinesisBackedBlockRDD[58] at createStream at NativeMethodAccessorImpl.java:0 of time 1550236630000 ms
19/02/15 13:17:10 INFO ReceivedBlockTracker: Deleting batches: 1550236610000 ms
19/02/15 13:17:10 INFO InputInfoTracker: remove old batch metadata: 1550236610000 ms
19/02/15 13:17:20 INFO JobScheduler: Added jobs for time 1550236640000 ms
19/02/15 13:17:20 INFO JobScheduler: Starting job streaming job 1550236640000 ms.0 from job set of time 1550236640000 ms
-------------------------------------------
Time: 2019-02-15 13:17:20
-------------------------------------------

19/02/15 13:17:20 INFO JobScheduler: Finished job streaming job 1550236640000 ms.0 from job set of time 1550236640000 ms
19/02/15 13:17:20 INFO PythonRDD: Removing RDD 61 from persistence list
19/02/15 13:17:20 INFO JobScheduler: Total delay: 0.018 s for time 1550236640000 ms (execution: 0.001 s)
19/02/15 13:17:20 INFO BlockManager: Removing RDD 61
19/02/15 13:17:20 INFO KinesisBackedBlockRDD: Removing RDD 60 from persistence list
19/02/15 13:17:20 INFO BlockManager: Removing RDD 60
19/02/15 13:17:20 INFO KinesisInputDStream: Removing blocks of RDD KinesisBackedBlockRDD[60] at createStream at NativeMethodAccessorImpl.java:0 of time 1550236640000 ms
19/02/15 13:17:20 INFO ReceivedBlockTracker: Deleting batches: 1550236620000 ms
19/02/15 13:17:20 INFO InputInfoTracker: remove old batch metadata: 1550236620000 ms
19/02/15 13:17:30 INFO JobScheduler: Added jobs for time 1550236650000 ms
19/02/15 13:17:30 INFO JobScheduler: Starting job streaming job 1550236650000 ms.0 from job set of time 1550236650000 ms
-------------------------------------------
Time: 2019-02-15 13:17:30
-------------------------------------------

1 个答案:

答案 0 :(得分:0)

我认为您将粘贴的错误端点URL复制到了您的应用中。另外,我认为您不必总是通过它。您正在传递apigateway服务网址。

它应该与此示例相似

@param endpointUrl  Url of Kinesis service (e.g., https://kinesis.us-east-1.amazonaws.com)

https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisUtils.scala#L90