Question

我有以下方法来计算DataSet中某个值的概率：

/**
   * Compute the probabilities of each value on the given [[DataSet]]
   *
   * @param x single colum [[DataSet]]
   * @return Sequence of probabilites for each value
   */
  private[this] def probs(x: DataSet[Double]): Seq[Double] = {
        val counts = x.groupBy(_.doubleValue)
          .reduceGroup(_.size.toDouble)
          .name("X Probs")
          .collect

        val total = counts.sum

        counts.map(_ / total)
  }

问题在于，当我提交flink作业时，使用此方法，这会导致flink由于任务TimeOut而终止该作业。我正在仅具有40.000个实例和9个属性的DataSet上为每个属性执行此方法。

有没有办法使我的代码更高效？

经过几次尝试，我使其与mapPartition一起使用，该方法是类InformationTheory的一部分，该类进行一些计算以计算熵，互信息等。因此，例如{ {1}}的计算如下：

SymmetricalUncertainty

有了这个，我可以有效地计算/** * Computes 'symmetrical uncertainty' (SU) - a symmetric mutual information measure. * * It is defined as SU(X, y) = 2 * (IG(X|Y) / (H(X) + H(Y))) * * @param xy [[DataSet]] with two features * @return SU value */ def symmetricalUncertainty(xy: DataSet[(Double, Double)]): Double = { val su = xy.mapPartitionWith { case in ⇒ val x = in map (_._2) val y = in map (_._1) val mu = mutualInformation(x, y) val Hx = entropy(x) val Hy = entropy(y) Some(2 * mu / (Hx + Hy)) } su.collect.head.head }，互信息等。问题是，它仅在并行度为1的情况下工作，问题出在entropy中。

有没有办法像我在mapPartition上所做的事情那样，但是无论并行度如何？

Answer 1

我终于做到了，不知道它是否是最好的解决方案，但是它可以在n个并行级别上工作：

def symmetricalUncertainty(xy: DataSet[(Double, Double)]): Double = {
    val su = xy.reduceGroup { in ⇒
        val invec = in.toVector
        val x = invec map (_._2)
        val y = invec map (_._1)

        val mu = mutualInformation(x, y)
        val Hx = entropy(x)
        val Hy = entropy(y)

        2 * mu / (Hx + Hy)
    }

    su.collect.head
  }

您可以在InformationTheory.scala处检查整个代码，并在InformationTheorySpec.scala处对其进行测试

优化Flink转换

1 个答案: