Question

我在使用Apache Spark ML模块的StreamingKMeans中标记火车数据时遇到问题。

在文档（https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means）中，有一个包含测试和训练数据的示例。我可以运行该示例，但是我什么也没得到，或者有点……没用。

如果测试数据位于群集1或群集2中，则我看不到任何值来获取信息，如果我无法获取有关此群集的任何信息。我想获得信息，即群集中有多少个对象/样本以及一些标签，以便能够识别出这些样本的来源，但是如果我尝试这样做：

def parse_row(row):
    label = int(row.split(',', 1)[0])
    vec = Vectors.dense(row[row.find('[') + 1: row.find(']')].split(','))

    return LabeledPoint(label, vec)

testingStream = ssc.textFileStream("./test").map(parse_row)

model = StreamingKMeans(k=2, decayFactor=1.0) \
    .setRandomCenters(4, 100.0, 20)

model.trainOn(testingStream)

我收到错误消息：
TypeError: Cannot convert type <class 'pyspark.mllib.regression.LabeledPoint'> into Vector

根据我在Spark代码中看到的内容。 trainOn()使用以下功能来训练模型：

def _convert_to_vector(l):
    if isinstance(l, Vector):
        return l
    elif type(l) in (array.array, np.array, np.ndarray, list, tuple, xrange):
        return DenseVector(l)
    elif _have_scipy and scipy.sparse.issparse(l):
        assert l.shape[1] == 1, "Expected column vector"
        # Make sure the converted csc_matrix has sorted indices.
        csc = l.tocsc()
        if not csc.has_sorted_indices:
            csc.sort_indices()
        return SparseVector(l.shape[0], csc.indices, csc.data)
    else:
        raise TypeError("Cannot convert type %s into Vector" % type(l))

因此，我该如何标记训练数据以便以后识别。例如，我现在想要的是，数据(A, [10, 10, 10, 10])预计位于包含数据的集群＃1中：

(B, [11, 11, 10, 10])
(C, [9, 11, 12, 10])
(D, [13, 8, 9, 10])
...

Spark：如何在StreamingKMeans中标记火车数据

0 个答案: