如何在Spark中插入和检查具有不同基数的RDD的相关性?

时间:2016-10-14 19:09:44

标签: apache-spark statistics pyspark apache-spark-mllib

我想检查两个RDD与时间的相关性,但它们没有相同的基数(即,它们具有不同数量的数据点,因为收集数据的时间戳不同)。我从Statistics API中看到,需要相同数量的分区和基数。下面是一些示例代码。我试图在Spark中寻找一些插值库来实现相同数量的分区和基数,但我没有找到。因此,我想问一下是否有其他人对此有经验或建议。感谢。

from pyspark.mllib.stat import Statistics

sc = ... # SparkContext

seriesX = ... # a series
seriesY = ... # must have the same number of partitions and cardinality as seriesX

# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
# method is not specified, Pearson's method will be used by default. 
print Statistics.corr(seriesX, seriesY, method="pearson")

data = ... # an RDD of Vectors
# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
# If a method is not specified, Pearson's method will be used by default. 
print Statistics.corr(data, method="pearson")

更新:我发现了一个Python包调用traces,可以满足我的需要。如何将其合并到Spark计算中,即将其并行应用于分区?

0 个答案:

没有答案