我正在AWS EC2上运行PySpark脚本。它在Jupyter笔记本电脑上运行得很好,但是当我在IPython shell上运行它时,会出现导入错误。看起来好奇怪!有人可以帮忙吗? 这是代码段:
from __future__ import division
from pyspark import SparkContext
from pyspark.sql import SQLContext,SparkSession
from pyspark.sql.functions import lower, col,trim,udf,struct,isnan,when
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType,FloatType,ArrayType,Row
from pyspark.sql.functions import lit
import gc
import time
import pandas as pd
from collections import defaultdict
import numpy as np
sc = SparkContext(appName="Connect Spark with Redshift")
sql_context = SQLContext(sc)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", 'xyz')
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", 'pqr')
spark=SparkSession.builder.master("local").appName("Users").getOrCreate()
users=pd.read_pickle(candidate_users_path)
sqlCtx = SQLContext(sc)
users = sqlCtx.createDataFrame(users)
users.count()
在import语句(第二行)上给出错误。有趣的是,它在从同一位置启动的Jupyter笔记本上运行得非常漂亮。而且,如果我只是在IPython中执行该导入语句,则相同的导入语句也可以工作。以我的理解,此EC2充当工作程序和主服务器,那么如何在工作程序中不可用呢?
Py4JJavaError: An error occurred while calling o57.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost, executor driver): org.apache.spark.SparkException:
Error from python worker
ImportError: cannot import name 'SparkContext'
PYTHONPATH was:
/home/ubuntu/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip:/home/ubuntu/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/ubuntu/spark-2.4.3-bin-hadoop2.7/jars/spark-core_2.11-2.4.3.jar
org.apache.spark.SparkException: No port number in pyspark.daemon's stdout
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
答案 0 :(得分:0)
我发现问题是Spark正在使用旧版本的Python。在bashrc
中添加了以下行。
alias python=python3
bashrc
中的其他行包括:
export SPARK_HOME="/home/ubuntu/spark-2.4.3-bin-hadoop2.7"
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3