无法访问pyspark

时间:2017-12-09 22:18:22

标签: numpy apache-spark pyspark rdd

我正在尝试使用pyspark从头开始实现K-means。我正在rdd上执行各种操作,但是当我尝试显示最终处理的rdd的结果时,会出现一些错误,例如" Pipelined RDD无法迭代"或类似的东西,像.collect()这样的东西因为而无法再次使用 piplined rdd问题。

from __future__ import print_function
import sys
import numpy as np
def closestPoint(p, centers):
    bestIndex = 0
    closest = float("+inf")
    for i in range(len(centers)):
        tempDist = np.sum((p - centers[i]) ** 2)
        if tempDist < closest:
            closest = tempDist
            bestIndex = i
    return bestIndex

data=SC.parallelize([1, 2, 3,5,7,3,5,7,3,6,4,66,33,66,22,55,77])

K = 3
convergeDist = float(0.1)

kPoints = data.takeSample(False, K, 1)
tempDist = 1.0

while tempDist > convergeDist:
    closest = data.map(
        lambda p: (closestPoint(p, kPoints), (p, 1)))



    pointStats = closest.reduceByKey(
        lambda p1_c1, p2_c2: (p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))

    newPoints = pointStats.map(
        lambda st: (st[0], st[1][0] / st[1][1]))
    print(newPoints)


    tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints).collect()

       # tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)




    for (iK, p) in newPoints:
        kPoints[iK] = p

print("Final centers: " + str(kPoints))

我得到的错误是:

  

TypeError:&#39; PipelinedRDD&#39;对象不可迭代

1 个答案:

答案 0 :(得分:1)

您无法迭代RDD,首先需要调用操作以将数据恢复到驱动程序。快速样本:

>>> test = sc.parallelize([1,2,3])
>>> for i in test:
...    print i
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'RDD' object is not iterable

这不起作用,因为测试是RDD。另一方面,如果您通过操作将数据带回驱动程序,现在它将成为您可以迭代的对象,例如:

>>> for i in test.collect():
...    print i
1                                                                               
2
3

你去,调用一个动作并将数据带回驱动程序,小心没有太多数据,或者你可能会出现内存不足的异常