为什么我不能用Datastax Enterprise运行我的spark python脚本?

时间:2016-02-18 08:42:21

标签: python apache-spark pyspark datastax datastax-enterprise

这是我的测试代码,我根本无法弄清楚为什么我不能用DSE运行它,但没有它似乎不是一个问题。

这是我的python代码:

from future import print_function
import sys

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

if name == "__main__":
    if len(sys.argv) != 3:
        print("Usage: kafka_wordcount.py <zk> <topic>", file=sys.stderr)
        exit(-1)

    sc = SparkContext(appName="PythonStreamingKafkaWordCount")
    ssc = StreamingContext(sc, 1)

    zkQuorum, topic = sys.argv[1:]
    kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
    lines = kvs.map(lambda x: x[1])
    counts = lines.flatMap(lambda line: line.split(" ")) \
        .map(lambda word: (word, 1)) \
        .reduceByKey(lambda a, b: a+b)
    counts.pprint()

    ssc.start()
    ssc.awaitTermination()

当我像这样编写代码时,它就行不通。我不明白为什么? 这是我的DSE错误代码:

# dse pyspark --jars /root/spark-streaming-kafka-assembly_2.10-1.4.1.jar /root/kafka_wordcount.py localhost:2181 wordcount
WARNING: Running python applications through 'pyspark' is deprecated as of Spark 1.0.
Use ./bin/spark-submit <python file>
Traceback (most recent call last):
  File "/root/kafka_wordcount.py", line 43, in <module>
    sc = SparkContext(appName="PythonStreamingKafkaWordCount")
  File "/usr/share/dse/resources/spark/python/lib/pyspark.zip/pyspark/context.py", line 113, in init
  File "/usr/share/dse/resources/spark/python/lib/pyspark.zip/pyspark/context.py", line 165, in _do_init
  File "/usr/share/dse/resources/spark/python/lib/pyspark.zip/pyspark/context.py", line 219, in _initialize_context
  File "/usr/share/dse/resources/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in call
  File "/usr/share/dse/resources/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.datastax.bdp.spark.DseSparkContext.apply.
: java.lang.ExceptionInInitializerError
 at org.apache.spark.util.Utils$.createTempDir(Utils.scala:225)
 at org.apache.spark.util.Utils$$anonfun$getOrCreateLocalRootDirsImpl$2.apply(Utils.scala:653)
 at org.apache.spark.util.Utils$$anonfun$getOrCreateLocalRootDirsImpl$2.apply(Utils.scala:649)
 at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
 at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108)
 at org.apache.spark.util.Utils$.getOrCreateLocalRootDirsImpl(Utils.scala:649)
 at org.apache.spark.util.Utils$.getOrCreateLocalRootDirs(Utils.scala:626)
 at org.apache.spark.storage.DiskBlockManager.createLocalDirs(DiskBlockManager.scala:128)
 at org.apache.spark.storage.DiskBlockManager.<init>(DiskBlockManager.scala:45)
 at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:75)
 at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:173)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:338)
 at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:188)
 at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:267)
 at org.apache.spark.SparkContext.<init>(SparkContext.scala:424)
 at com.datastax.bdp.spark.DseSparkContext$.apply(DseSparkContext.scala:42)
 at com.datastax.bdp.spark.DseSparkContext.apply(DseSparkContext.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
 at py4j.Gateway.invoke(Gateway.java:259)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoSuchFieldException: SHUTDOWN_HOOK_PRIORITY
 at java.lang.Class.getField(Class.java:1703)
 at org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:222)
 at org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50)
 at org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48)
 at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:191)
 at org.apache.spark.util.ShutdownHookManager$.<init>(ShutdownHookManager.scala:58)
 at org.apache.spark.util.ShutdownHookManager$.<clinit>(ShutdownHookManager.scala)
 ... 32 more

编辑:

在使用你的建议之后,现在似乎没有类定义错误:

java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition
 at java.lang.Class.getDeclaredMethods0(Native Method)
 at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
 at java.lang.Class.privateGetPublicMethods(Class.java:2902)
 at java.lang.Class.getMethods(Class.java:1615)
 at py4j.reflection.ReflectionEngine.getMethodsByNameAndLength(ReflectionEngine.java:365)
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:317)
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
 at py4j.Gateway.invoke(Gateway.java:251)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: kafka.common.TopicAndPartition
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 12 more

感谢您的帮助!

1 个答案:

答案 0 :(得分:2)

您所看到的错误是它的核心

Caused by: java.lang.NoSuchFieldException: SHUTDOWN_HOOK_PRIORITY

这让我们知道Spark已经在Classpath上找到了Hadoop2库。 Dse 4.8不支持Hadoop2,只有hadoop1库。

您对kafka的使用让我怀疑您在CP上的某处包含spark-kafka-assembly

--jars /root/spark-streaming-kafka-assembly_2.10-1.4.1.jar

此jar中包含Hadoop2 Libs,会导致CP问题。尝试使用spark-kafka的非装配jar,一切都应该没问题。

http://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka_2.10/1.4.1