我目前正在使用Apache Spark(v 2.2.1)的pyspark。我的IDE是Jupyter Notebook。 我正在尝试计算行矩阵的相似性。我遵循了这个文档:http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html?highlight=rowmatrix#pyspark.mllib.linalg.distributed.RowMatrix
我的代码是:
import pyspark
import random
from pyspark.mllib.linalg.distributed import CoordinateMatrix, RowMatrix
conf = pyspark.SparkConf().setAll([('spark.executor.memory', '3g'), ('spark.executor.cores', '8'), ('spark.cores.max', '24'), ('spark.driver.memory', '9g')])
sc = pyspark.SparkContext.getOrCreate()
sc.stop()
sc = pyspark.SparkContext(appName="AugustinJob", master="spark://10.2.48.88:7077", conf=conf)
spark = SparkSession(sc)
#Similarities
rows = sc.parallelize([[1, 2], [1, 5]])
mat = RowMatrix(rows)
sims = mat.columnSimilarities()
sims.entries.first().value`
我有这个错误:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-27-69da006f7939> in <module>()
13 mat = RowMatrix(rows)
14 sims = mat.columnSimilarities()
---> 15 sims.entries.first().value
/usr/local/apache/spark-2.2.1-bin-hadoop2.6/python/pyspark/rdd.py in first(self)
1359 ValueError: RDD is empty
1360 """
-> 1361 rs = self.take(1)
1362 if rs:
1363 return rs[0]
/usr/local/apache/spark-2.2.1-bin-hadoop2.6/python/pyspark/rdd.py in take(self, num)
1341
1342 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1343 res = self.context.runJob(self, takeUpToNumLeft, p)
1344
1345 items += res
/usr/local/apache/spark-2.2.1-bin-hadoop2.6/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
990 # SparkContext#runJob.
991 mappedRDD = rdd.mapPartitions(partitionFunc)
--> 992 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
993 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
994
AttributeError: 'NoneType' object has no attribute 'sc'