如何在pyspark中使用外部(自定义)包?

时间:2017-11-03 17:28:29

标签: apache-spark pyspark yarn

我试图复制这里给出的灵魂https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_python.html 在pypspark中导入外部包。但它失败了。

我的代码:

spark_distro.py

parent::__construct()

external_package.py

from pyspark import SparkContext, SparkConf

def import_my_special_package(x):
    from external_package import external
    return external.fun(x)

conf = SparkConf()
sc = SparkContext()
int_rdd = sc.parallelize([1, 2, 3, 4])
int_rdd.map(lambda x: import_my_special_package(x)).collect()

spark submit命令:

class external:

    def __init__(self,in):
        self.in = in

    def fun(self,in):
        return self.in*3

实际错误:

spark-submit \
   --master yarn \
  /path to script/spark_distro.py  \
  --py-files /path to script/external_package.py \
  1000

预期产出:

Actual:
  vs = list(itertools.islice(iterator, batch))
  File "/home/gsurapur/pyspark_examples/spark_distro.py", line 13, in <lambda>
  File "/home/gsurapur/pyspark_examples/spark_distro.py", line 6, in import_my_special_package
ImportError: No module named external_package

我也尝试了[3,6,9,12] 选项,并且遇到了同样的问题。请帮我找到问题。提前谢谢。

2 个答案:

答案 0 :(得分:3)

我知道,事后看来,这听起来很愚蠢,但spark-submit的参数顺序通常不可互换:所有与Spark相关的参数,包括--py-file,都必须要执行的脚本之前:

# your case:
spark-submit --master yarn-client /home/ctsats/scripts/SO/spark_distro.py --py-files /home/ctsats/scripts/SO/external_package.py
[...]
ImportError: No module named external_package

# correct usage:
spark-submit --master yarn-client --py-files /home/ctsats/scripts/SO/external_package.py /home/ctsats/scripts/SO/spark_distro.py
[...]
[3, 6, 9, 12]

使用如下修改的脚本进行测试:

spark_distro.py

from pyspark import SparkContext, SparkConf

def import_my_special_package(x):
    from external_package import external
    return external(x)

conf = SparkConf()
sc = SparkContext()
int_rdd = sc.parallelize([1, 2, 3, 4])
print int_rdd.map(lambda x: import_my_special_package(x)).collect()

external_package.py

def external(x):
     return x*3

这些修改可以说不会改变问题的本质......

答案 1 :(得分:2)

以下是addPyFile

的情况

spark_distro2.py

from pyspark import SparkContext, SparkConf

def import_my_special_package(x):
    from external_package import external
    return external(x)

conf = SparkConf()
sc = SparkContext()
sc.addPyFile("/home/ctsats/scripts/SO/external_package.py") # added
int_rdd = sc.parallelize([1, 2, 3, 4])
print int_rdd.map(lambda x: import_my_special_package(x)).collect()

测试:

spark-submit --master yarn-client /home/ctsats/scripts/SO/spark_distro2.py
[...]
[3, 6, 9, 12]