Question

我正在尝试复制此python代码：

cond_entropy_x = np.array([entropy(x[y == v]) for v in uy])

其中x和y是向量，而uy是y上的唯一值，例如0,1。

在flink中，我有：

val uy = y.distinct.collect
val condHx = for (i ← uy)
    yield entropy(x.filterWithBcVariable(y)((_, yy) ⇒ yy == i))

但是，似乎filterWithBcVariable并没有采用y上的每个值，而只是采用了第一个值。

我也尝试过：

for (i ← values) yield y.join(x).where(a ⇒ a).equalTo(_ ⇒ i)

但是我内存不足。

如何根据x上的值过滤y？

类似x.zip(y)的方法可以做到，但不支持。

有什么想法吗？

Answer 1

我提出了一个解决方案，可能不是最好的解决方案，但至少是可行的。

现在，我没有传递x和y作为分隔的DataSets，而是传递了DataSet[LabeledVector]仅包含一列：

val xy = input.map(lv ⇒ LabeledVector(lv.label, DenseVector(lv.vector(0))))

然后我将xy传递给函数：

def conditionalEntropy(xy: DataSet[LabeledVector]): Double = {
    // Get the label
    val y = xy map (_.label)
    // Get probs for the label
    val p = probs(y).toArray.asBreeze
    // Get unique values in label
    val values = y.distinct.collect
    // Compute Conditional Entropy
    val condH = for (i ← values)
      yield entropy(xy.filter(_.label == i))
    p.dot(seq2Breeze(condH))
  }

根据Scala flink中的另一个数据集过滤数据集

1 个答案: