我正在尝试为pyspark软件包构建单元测试,因此需要在本地运行测试。一些测试工作正常,但其他测试却给我一个奇怪的错误,即JVM无法找到“ python”。下面有一个简单的示例以及一部分堆栈跟踪。
我正在通过anaconda,Python 3.5,Spark 2.4.0和Debian Windows Linux子系统上运行pyspark。我还尝试将SPARK_HOME
,PYTHONPATH
和PYSPARK_PYTHON
环境变量设置为我能想到的每个选项。有什么想法吗?
示例
spark = SparkSession.builder.master('local[*]').appName('testAPP').getOrCreate()
df_test = pd.DataFrame({
'A': [1, 2, 3, 4, 5, 6],
'B': ['a', 'b', 'c', 'd', 'e', 'f'],
'C': [1.23, 4.56, 7.89, 1.01, 11.2, 13.4],
'D': [True, True, False, False, True, False]
})
df_spark = spark.createDataFrame(df_test)
print(df_spark.count())
Stacktrace部分
File "/tmp/nht_core_project/spark_test.py", line 15, in <module>
print(df_spark.count())
File "/home/ross/anaconda3/envs/py35/lib/python3.5/site-packages/pyspark/sql/dataframe.py", line 455, in count
return int(self._jdf.count())
File "/home/ross/anaconda3/envs/py35/lib/python3.5/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/ross/anaconda3/envs/py35/lib/python3.5/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/ross/anaconda3/envs/py35/lib/python3.5/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o44.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.io.IOException: Cannot run program "python": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:174)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:100)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:74)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:86)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)