如何在Jupyter Notebook中使用PySpark时包含外部Spark库

时间:2018-06-29 15:53:42

标签: python apache-spark pyspark jupyter-notebook jupyter

我正在尝试在Jupyter Notebook中运行以下PySpark-Kafka流example。这是我在笔记本中使用的代码的第一部分:

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

sc = pyspark.SparkContext(master='local[*]',appName="PySpark streaming")
ssc = StreamingContext(sc, 2)

topic = "my-topic"
brokers = "localhost:9092"
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})

如果运行单元格,则会收到以下错误/说明:

Spark Streaming's Kafka libraries not found in class path. Try one of the following.

1. Include the Kafka library and its dependencies with in the
 spark-submit command as

$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.3.0 ...

2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.3.0.
Then, include the jar in the spark-submit command as

$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...

我的问题是:如何将--jars或--package参数传递给Jupyter Notebook?或者,我可以下载此软件包并将其永久链接到Python / Jupyter(也许通过.bashrc中的导出)吗?

1 个答案:

答案 0 :(得分:0)

至少有两种方法,大致对应于错误消息中建议的两个选项:

第一种方法是相应地更新您各自的Jupyter内核(如果您尚未使用Jupyter内核,则应-请参阅this answer,以了解在Jupyter中使用内核的详细常规信息,以用于Pyspark)。

更具体地说,您应该使用kernel.json下的以下条目更新Pyspark的env配置文件(如果您使用的是--master local以外的内容,请相应地进行修改):

"PYSPARK_SUBMIT_ARGS": "--master local --packages org.apache.spark:spark-streaming-kafka-0-8:2.3.0 pyspark-shell"

第二种方式是将以下条目放入您的spark-defaults.conf文件中:

spark.jars.packages org.apache.spark:spark-streaming-kafka-0-8:2.3.0

在两种情况下,您都无需手动下载任何内容-首次使用更新后的配置运行Pyspark时,将下载必要的文件并将其放置在适当的目录中。