所以我一直在寻找如何在本地计算机上开发代码(在我的情况下是ubuntu 16.04),使用spyder IDE(随Anaconda一起提供)上的IPython控制台并在集群上处理它(例如,在Azure HDInsight上创建)。我可以在本地运行pyspark而没有任何问题(通过spark-shell和spyder),但我想知道是否可以在Spark / Yarn(?)集群上运行代码以加快处理速度,并且结果显示在spyder上的IPython控制台中。我发现这篇文章在堆栈溢出(Running PySpark on and IDE like Spyder?)上有一些感觉它可以解决问题,但我收到一个错误。我通过“正常”启动spyder时以及通过“spark-submit spyder.py”命令启动spyder时出现错误:
sc = SparkContext(conf=conf)
Traceback (most recent call last):
File "<ipython-input-3-6b825dbb354c>", line 1, in <module>
sc = SparkContext(conf=conf)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 115, in __init__
conf, jsc, profiler_cls)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 172, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 235, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 1062, in __call__
answer = self._gateway_client.send_command(command)
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 631, in send_command
response = self.send_command(command)
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 624, in send_command
connection = self._get_connection()
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 579, in _get_connection
connection = self._create_connection()
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 585, in _create_connection
connection.start()
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 697, in start
raise Py4JNetworkError(msg, e)
Py4JNetworkError: An error occurred while trying to connect to the Java server
这是我的代码:
import os
import sys
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-oracle/"
os.environ["SPARK_HOME"] = "/usr/local/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
os.environ["PYSPARK_PYTHON"] = "python2.7"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
############################################################################
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = SparkConf().setMaster('spark://xx.x.x.xx:xxxxx').setAppName("building a warehouse")
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
sentenceData = sqlCtx.createDataFrame([
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
for features_label in rescaledData.select("features", "label").take(3):
print(features_label)
我在Azure HDInsight上创建了集群,不确定是否从正确的位置检索了IP和端口,或者是否必须创建SSH隧道。这非常令人困惑。
希望有人可以帮助我。提前谢谢!