pyspark流式传输1.5.0与kinesis jar丢失

时间:2015-10-22 11:57:50

标签: python apache-spark pyspark spark-streaming amazon-kinesis

我正在使用EMR(使用EMR-4.1.0),包括spark 1.5.0发布

我尝试使用spark streaming(python)来使用kinesis中的数据,使用github中的示例代码(https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/python/examples/streaming/kinesis_wordcount_asl.py

出于某种原因,我得到一个错误,其中火花流媒体kinesis jar不可用,即使我可以在/ usr / lib / spark / extras / lib中看到它与所有其他流媒体jar。 (见附件)

-----------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-2-477c4a5455a1> in <module>()
     86     regionName= 'eu-west-1'
     87     lines = KinesisUtils.createStream(
---> 88         ssc, appName, streamName, endpointUrl, regionName, InitialPositionInStream.TRIM_HORIZON, 2)
     89 
     90     words.foreachRDD(process)

/usr/lib/spark/python/pyspark/streaming/kinesis.py in createStream(ssc, kinesisAppName, streamName, endpointUrl, regionName, initialPositionInStream, checkpointInterval, storageLevel, awsAccessKeyId, awsSecretKey, decoder)
     85             if 'ClassNotFoundException' in str(e.java_exception):
     86                 KinesisUtils._printErrorMsg(ssc.sparkContext)
---> 87             raise e
     88         stream = DStream(jstream, ssc, NoOpSerializer())
     89         return stream.map(lambda v: decoder(v))

Py4JJavaError: An error occurred while calling o35.loadClass.
: java.lang.ClassNotFoundException: org.apache.spark.streaming.kinesis.KinesisUtilsPythonHelper
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

当我尝试将jar(我从Maven下载)添加到spark-submit(spark-submit --jars

我收到以下错误:

“必须指定主要资源(JAR或Python或R文件)”

是否可以解决这个问题?

谢谢,

1 个答案:

答案 0 :(得分:1)

/usr/bin/spark-submit --jars /usr/lib/spark/extras/lib/spark-streaming-kinesis-asl.jar可能会更好,因为它链接到最新版本AFAIK