我在pyspark中使用了graphframes框架,运行了一段时间是正常的(我已经使用了graphframes模块),但是过了一会儿我得到了一个错误:“没有名为'graphframes'的模块。”
这种错误有时是偶然的,有时他可以完成运行,有时则不能。
pyspar-version:2.2.1
graphframe:0.6
错误:
19/06/05 02:22:17 ERROR Executor: Exception in task 641.3 in stage 216.0 (TID 123244)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/appcom/spark-2.2.1/python/lib/pyspark.zip/pyspark/worker.py", line 166, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/appcom/spark-2.2.1/python/lib/pyspark.zip/pyspark/worker.py", line 55, in read_command
command = serializer._read_with_length(file)
File "/appcom/spark-2.2.1/python/lib/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length
return self.loads(obj)
File "/appcom/spark-2.2.1/python/lib/pyspark.zip/pyspark/serializers.py", line 455, in loads
return pickle.loads(obj, encoding=encoding)
File "/data/data08/nm-local-dir/usercache/hduser0011/appcache/application_1547810698423_82435/container_1547810698423_82435_02_000041/ares_detect.zip/ares_detect/task/communication_detect.py", line 11, in <module>
from graphframes import GraphFrame
ModuleNotFoundError: No module named 'graphframes'
命令:
spark-submit --master yarn-cluster \
--name ad_com_detect_${app_arr[$i]}_${scenario_arr[$i]}_${txParameter_app_arr[$i]} \
--executor-cores 4 \
--num-executors 8 \
--executor-memory 35g \
--driver-memory 2g \
--conf spark.sql.shuffle.partitions=800 \
--conf spark.default.parallelism=1000 \
--conf spark.yarn.executor.memoryOverhead=2048 \
--conf spark.sql.execution.arrow.enabled=true \
--jars org.scala-lang_scala-reflect-2.10.4.jar,\
org.slf4j_slf4j-api-1.7.7.jar,\
com.typesafe.scala-logging_scala-logging-api_2.10-2.1.2.jar,\
com.typesafe.scala-logging_scala-logging-slf4j_2.10-2.1.2.jar,\
graphframes-0.6.0-spark2.2-s_2.11.jar \
--py-files ***.zip \
***/***/****.py &
当pyspark内存不足时,它们会删除这些jar吗?
答案 0 :(得分:0)
尝试通过package命令添加jar。
spark-submit \
--packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 \
my_py_script.py
它同时同时适用于两个参数
spark-submit \
--packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 \
--jars patth_to_your_jars/graphframes-0.7.0-spark2.4-s_2.11.jar \
my_py_script.py
这为我解决了这个问题
通常有4个命令可将文件添加到Spark,有关命令的说明,请参见
spark-submit --help
--jars JARS Comma-separated list of jars to include on the driver and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .pyfiles to place on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName).