我正在尝试使用pyspark从头开始实现K-means。我正在rdd上执行各种操作,但是当我尝试显示最终处理的rdd的结果时,会出现一些错误,例如" Pipelined RDD无法迭代"或类似的东西,像.collect()这样的东西因为而无法再次使用 piplined rdd问题。
from __future__ import print_function
import sys
import numpy as np
def closestPoint(p, centers):
bestIndex = 0
closest = float("+inf")
for i in range(len(centers)):
tempDist = np.sum((p - centers[i]) ** 2)
if tempDist < closest:
closest = tempDist
bestIndex = i
return bestIndex
data=SC.parallelize([1, 2, 3,5,7,3,5,7,3,6,4,66,33,66,22,55,77])
K = 3
convergeDist = float(0.1)
kPoints = data.takeSample(False, K, 1)
tempDist = 1.0
while tempDist > convergeDist:
closest = data.map(
lambda p: (closestPoint(p, kPoints), (p, 1)))
pointStats = closest.reduceByKey(
lambda p1_c1, p2_c2: (p1_c1[0] + p2_c2[0], p1_c1[1] + p2_c2[1]))
newPoints = pointStats.map(
lambda st: (st[0], st[1][0] / st[1][1]))
print(newPoints)
tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints).collect()
# tempDist = sum(np.sum((kPoints[iK] - p) ** 2) for (iK, p) in newPoints)
for (iK, p) in newPoints:
kPoints[iK] = p
print("Final centers: " + str(kPoints))
我得到的错误是:
TypeError:&#39; PipelinedRDD&#39;对象不可迭代
答案 0 :(得分:1)
您无法迭代RDD,首先需要调用操作以将数据恢复到驱动程序。快速样本:
>>> test = sc.parallelize([1,2,3])
>>> for i in test:
... print i
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'RDD' object is not iterable
这不起作用,因为测试是RDD。另一方面,如果您通过操作将数据带回驱动程序,现在它将成为您可以迭代的对象,例如:
>>> for i in test.collect():
... print i
1
2
3
你去,调用一个动作并将数据带回驱动程序,小心没有太多数据,或者你可能会出现内存不足的异常