当输入数据正常分布时,pyspark.mllib.stat.KernelDensity中的KernelDensity.estimate是否有效?

时间:2017-05-19 21:21:17

标签: pyspark apache-spark-mllib

pyspark的KernelDensity.estimate对正常分布的数据集是否正常工作?我尝试的时候收到错误。我已经提交了https://issues.apache.org/jira/browse/SPARK-20803(当pyspark.mllib.stat.KernelDensity中的KernelDensity.estimate在正常分发输入数据时抛出net.razorvine.pickle.PickleException(数据未正常分发时没有错误))

示例代码:

vecRDD = sc.parallelize(colVec)
kd = KernelDensity()
kd.setSample(vecRDD)
kd.setBandwidth(3.0)
# Find density estimates for the given values
densities = kd.estimate(samplePoints)

当数据不是高斯数据时,我会得到例如 5.6654703477e-05,0.000100010001,0.000100010001,0.000100010001,.....

作为参考,使用Scala,用于高斯数据, 代码:

vecRDD = sc.parallelize(colVec)
kd = new KernelDensity().setSample(vecRDD).setBandwidth(3.0)
// Find density estimates for the given values
densities = kd.estimate(samplePoints)

我得到: [0.04113814235801906,1.0994865517293571E-163,0.0,0.0,.....

1 个答案:

答案 0 :(得分:0)

我遇到了同样的问题,并且能够将问题追踪到非常小的test case。如果你在Python中使用Numpy来生成RDD中的数据,那就是问题所在!

import numpy as np
kd = KernelDensity()
kd.setSample(sc.parallelize([0.0, 1.0, 2.0, 3.0])) # THIS WORKS
# kd.setSample(sc.parallelize([0.0, np.float32(1.0), 2.0, 3.0])) # THIS FAILS
kd.setBandwidth(0.35)
kd.estimate([0.0, 1.0])

如果这也是您的问题,只需将Numpy数据转换为Python基类型,直到修复Spark issue为止。您可以使用np.asscalar函数来执行此操作。