PySpark 2.4:以编程方式添加Maven JAR坐标停止工作

时间:2019-01-17 01:27:03

标签: python maven apache-spark pyspark apache-kafka

以下是我的PySpark启动片段,非常可靠(我已经使用了很长时间)。今天,我添加了spark.jars.packages选项中显示的两个Maven坐标(在Kafka支持中有效地“插入”)。现在可以正常触发依赖项下载(由Spark自动执行):

import sys, os, multiprocessing
from pyspark.sql import DataFrame, DataFrameStatFunctions, DataFrameNaFunctions
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import functions as sFn
from pyspark.sql.types import *
from pyspark.sql.types import Row
  # ------------------------------------------
  # Note: Row() in .../pyspark/sql/types.py
  # isn't included in '__all__' list(), so
  # we must import it by name here.
  # ------------------------------------------

num_cpus = multiprocessing.cpu_count()        # Number of CPUs for SPARK Local mode.
os.environ.pop('SPARK_MASTER_HOST', None)     # Since we're using pip/pySpark these three ENVs
os.environ.pop('SPARK_MASTER_POST', None)     # aren't needed; and we ensure pySpark doesn't
os.environ.pop('SPARK_HOME',        None)     # get confused by them, should they be set.
os.environ.pop('PYTHONSTARTUP',     None)     # Just in case pySpark 2.x attempts to read this.
os.environ['PYSPARK_PYTHON'] = sys.executable # Make SPARK Workers use same Python as Master.
os.environ['JAVA_HOME'] = '/usr/lib/jvm/jre'  # Oracle JAVA for our pip/python3/pySpark 2.4 (CDH's JRE won't work).
JARS_IVE_REPO = '/home/jdoe/SPARK.JARS.REPO.d/'

# ======================================================================
# Maven Coordinates for JARs (and their dependencies) needed to plug
# extra functionality into Spark 2.x (e.g. Kafka SQL and Streaming)
# A one-time internet connection is necessary for Spark to autimatically
# download JARs specified by the coordinates (and dependencies).
# ======================================================================
spark_jars_packages = ','.join(['org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.0',
                                'org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0',])
# ======================================================================
spark_conf = SparkConf()
spark_conf.setAll([('spark.master', 'local[{}]'.format(num_cpus)),
                   ('spark.app.name', 'myApp'),
                   ('spark.submit.deployMode', 'client'),
                   ('spark.ui.showConsoleProgress', 'true'),
                   ('spark.eventLog.enabled', 'false'),
                   ('spark.logConf', 'false'),
                   ('spark.jars.repositories', 'file:/' + JARS_IVE_REPO),
                   ('spark.jars.ivy', JARS_IVE_REPO),
                   ('spark.jars.packages', spark_jars_packages), ])

spark_sesn            = SparkSession.builder.config(conf = spark_conf).getOrCreate()
spark_ctxt            = spark_sesn.sparkContext
spark_reader          = spark_sesn.read
spark_streamReader    = spark_sesn.readStream
spark_ctxt.setLogLevel("WARN")

但是,当我运行代码段(例如./python -i init_spark.py)时,插件并没有按照要求进行下载和/或加载。

该机制曾经起作用,但是随后停止了。我想念什么?

提前谢谢!

1 个答案:

答案 0 :(得分:0)

在这种帖子中,QUESTION比ANSWER更有价值,因为上面的代码可以工作,但在Spark 2.x文档或示例中找不到任何地方。

以上是我通过Maven坐标以编程方式向Spark 2.x添加功能的方式。我有这个工作,但后来停止了工作。为什么?

当我在jupyter notebook中运行上述代码时,笔记本在后台已经通过我的PYTHONSTARTUP脚本运行了相同的代码片段。该PYTHONSTARTUP脚本具有与上面相同的代码,但省略了Maven坐标(按意图)。

那么,这个细微问题是如何出现的:

spark_sesn = SparkSession.builder.config(conf = spark_conf).getOrCreate()

因为已经存在一个Spark会话,所以上面的语句仅重用了没有加载jar /库的现有会话(.getOrCreate())(再次,因为我的PYTHONSTARTUP脚本有意省略了它们)。这就是为什么最好将打印语句放在PYTHONSTARTUP脚本中的原因(否则该脚本将保持沉默)。

最后,我只是忘了这样做:$ unset PYTHONSTARTUP在启动JupyterLab / Notebook守护程序之前。

我希望问题可以帮助其他人,因为那是如何以编程方式向Spark 2.x(在本例中为Kafka)添加功能。请注意,您需要Internet连接才能从Maven Central一次性下载指定的jar和递归依赖项。