应用错误收集

我正在使用Pyspark来计算PMI（Point Mutual Infomation）。我在

找到了Scala代码

Computing Pointwise Mutual Information in Spark

我已经从Delip编写的原始代码中编写了一个Python版本。

如何将emeth编写的Scala代码翻译成Python版本？

这是我的代码：

counts = RDD["String", int]
twocounts = RDD["String1String2", int]

MI = twocounts.map(lambda x: (x[0][0], (x[0], x[1]))) \
          .join(counts) \
          .map(lambda x: (x[1][0][0][1], x[1]) ) \
          .join(counts) \
          .map(lambda x: (x[1][0][0][0], x[1][0][0][1], x[1][0][1], x[1][1])) \
          .map(lambda x: (x[0], computeMI(x[1], x[2], x[3])))


def computeMI(pab, pa, pb):
    return math.log(pab) - math.log(pa) - math.log(pb)

如何在Pyspark中编写这些代码？

0 个答案: