我正在运行Windows 10和Python 3.7和Spark 2.4。
我是Spark和Hadoop生态系统的新手,但是我们正在朝这个方向发展我们的堆栈,需要一些Spark工具来处理Parquet文件。
我成功使用this tutorial在机器上设置了Spark。在命令提示符下从bin\pyspark
目录运行%SPARK_HOME%
时,我看到:
C:\spark\spark-2.4.3-bin-hadoop2.7>bin\pyspark
Python 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)] ::
Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
19/06/06 12:48:51 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.3
/_/
Using Python version 3.7.1 (default, Dec 10 2018 22:54:23)
SparkSession available as 'spark'.
>>>
表明它正在成功运行。我需要能够在Spyder环境中使用PySpark建立SparkContext进行开发。我目前没有Hadoop群集,因此我试图在本地计算机上以独立模式运行。
我一直在用以下测试脚本进行测试:
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
conf.setMaster('spark://localhost:7077')
conf.setAppName('spark-basic')
sc = SparkContext(conf=conf)
def mod(x):
import numpy as np
return (x, np.mod(x, 2))
rdd = sc.parallelize(range(1000)).map(mod).take(10)
print(rdd)
然后出现以下错误:
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NullPointerException
at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:64)
at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:248)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:510)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Unknown Source)
是否有人对此错误有任何见解,或者我可能做错了什么让PySpark在Spyder中运行?
谢谢。