我想在PySpark上应用k表示聚类。 但是我得到类型错误:float()参数必须是字符串或数字。有没有人可以马上帮助我?
lines = lines.map(lambda line: line.split(" "))
new = lines.map(lambda x: (str(x[2]), str(x[3]), str(x[4]), str(x[5]), str(x[6])))
new.take(4)
Sample input (new):
[('-13', '7', '-0.573824415813', '0', '1'),
('-20', '13', '-0.728721307165', '0', '1'),
('-27', '14', '-1.18661648046', '0', '1'),
('-29', '10', '-0.757241996939', '0', '1')]
k = 10 # cluster size for k-means
kmeans_iteration = 40000
estimator = KMeans(init='k-means++', n_clusters=k, n_init=10)
estimator.fit(new)