我正在尝试在Spark上提交一个Python脚本,该脚本需要使用pos_tag
,但是每当这样做时,我都会收到此错误:
File "/hdata/dev/sdb1/hadoop/yarn/local/usercache/harshdee/appcache/application_1551632819863_0554/container_e36_1551632819863_0554_01_000009/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/hdata/dev/sdb1/hadoop/yarn/local/usercache/harshdee/appcache/application_1551632819863_0554/container_e36_1551632819863_0554_01_000009/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/hdata/dev/sdb1/hadoop/yarn/local/usercache/harshdee/appcache/application_1551632819863_0554/container_e36_1551632819863_0554_01_000009/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
ImportError: No module named nltk.tag
我正在使用以下命令运行它:
spark-submit --master yarn --driver-memory 32G --num-executors 20 --executor-memory 16G --executor-cores 6 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./NLTK/nltk_env/bin/python --conf spark.executorEnv.PYTHON_EGG_CACHE="./.python-eggs/" --conf spark.executorEnv.PYTHON_EGG_DIR="./.python-eggs/" --conf spark.driverEnv.PYTHON_EGG_CACHE="./.python-eggs/" --conf spark.driverEnv.PYTHON_EGG_DIR="./.python-eggs/" --conf spark.yarn.appMasterEnv.NLTK_DATA=./ --conf spark.executorEnv.NLTK_DATA=./ --archives nltk_env.zip#NLTK,tokenizers.zip#tokenizers,taggers.zip#taggers --py-files helpers.py,const.py --packages com.databricks:spark-xml_2.10:0.3.5,com.databricks:spark-csv_2.10:1.5.0 <name-of-script>.py
我怀疑在这种环境下,第三方库已安装在工作程序节点上,但是nltk_data
(例如标记器和标记器)可能无法访问。
我应该使用什么命令使它更易于访问并在集群上运行?