我已经在具有2个节点的EC2机器上设置了pyspark。我正在使用命令运行pyspark
pyspark --master spark://10.0.1.13:7077 --driver-memory 5G --executor-memory 12G --total-executor-cores 10
我的python脚本专门仅在运行UDF函数时失败。我无法调试为什么只使用udf而不是脚本的任何其他部分,或者为什么不使用完整脚本呢?
路径:
(base) [ec2-user@ip-10-0-1-13 ~]$ which pyspark
~/anaconda2/bin/pyspark
(base) [ec2-user@ip-10-0-1-13 ~]$ which python
~/anaconda2/bin/python
Python脚本:
def getDateObjectYear(dateString):
dateString=dateString.strip()
return dateString
dateObjectUDFYear = udf(getDateObjectYear)
checkin_date_yelp_df=checkin_date_yelp_df.withColumn('year', dateObjectUDFYear(checkin_date_yelp_df.date))
运行checkin_date_yelp_df.show(5)
时'出现此错误
Py4JJavaError: An error occurred while calling o98.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 4 times, most recent failure: Lost task 0.3 in stage 14.0 (TID 230, 10.0.1.13, executor 0): java.io.IOException: Cannot run program "~/anaconda2/bin/python": error=2, No such file or directory
..
..
..
..
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
答案 0 :(得分:1)
原来我在.bashrc
中有2条路径配置错误
正确的方法:
export PYTHONPATH=/home/ec2-user/anaconda/bin/python
export PYSPARK_PYTHON=/home/ec2-user/anaconda/bin/python