Question

我正在尝试运行Spark工作。这是我的shell脚本，位于/home/full/path/to/file/shell/my_shell_script.sh：

confLocation=../conf/my_config_file.conf &&
executors=8 &&
memory=2G &&
entry_function=my_function_in_python &&
dos2unix $confLocation &&
spark-submit \
        --master yarn-client \
        --num-executors $executors \
        --executor-memory $memory \
        --py-files /home/full/path/to/file/python/my_python_file.py $entry_function $confLocation

当我运行时，我收到一条错误消息：

错误：无法从JAR文件加载主类：/ home / full / path / to / file / shell / my_function_in_python

我的印象是它在错误的位置（python文件位于python目录中，而不是shell目录中）。

Answer 1

--py-files标志用于程序中使用的其他 python文件依赖项;你可以看到here in SparkSubmit.scala它使用了所谓的＆＃34;主要参数＆＃34;，意思是第一个非标志参数，来确定是否进行＆＃34;提交jar文件＆＃34;模式或＆＃34;提交python main＆＃34;模式。

这就是为什么你看到它试图加载你的＆＃34; $ entry_function＆＃34;作为一个不存在的jar文件，因为它只假设你运行Python，如果主要参数以＆＃34; .py＆＃34;结尾，否则默认假设你有.jar文件。

不要使用--py-files，而只需将/home/full/path/to/file/python/my_python_file.py作为主要参数;然后你可以做花式python来进入＆＃34;输入功能＆＃34;作为程序参数，或者只是在python文件本身的main函数中调用你的入口函数。

或者，您仍然可以使用--py-files然后创建一个新的主.py文件来调用您的入口函数，然后将该主.py文件作为主要参数传递。

Answer 2

对我来说有用的是简单地在没有--py-files命令的情况下传入python文件。看起来像这样：

confLocation=../conf/my_config_file.conf &&
executors=8 &&
memory=2G &&
entry_function=my_function_in_python &&
dos2unix $confLocation &&
spark-submit \
        --master yarn-client \
        --num-executors $executors \
        --executor-memory $memory \
        /home/full/path/to/file/python/my_python_file.py $entry_function $confLocation

Answer 3

将元素添加到--py-files时，请使用逗号分隔它们，而不会留有任何空格。试试这个：

confLocation=../conf/my_config_file.conf &&
executors=8 &&
memory=2G &&
entry_function=my_function_in_python &&
dos2unix $confLocation &&
spark-submit \
        --master yarn-client \
        --num-executors $executors \
        --executor-memory $memory \
        --py-files /home/full/path/to/file/python/my_python_file.py,$entry_function,$confLocation

无法从Spark Submit中的JAR文件加载主类

3 个答案: