Question

我想检查两个RDD与时间的相关性，但它们没有相同的基数（即，它们具有不同数量的数据点，因为收集数据的时间戳不同）。我从Statistics API中看到，需要相同数量的分区和基数。下面是一些示例代码。我试图在Spark中寻找一些插值库来实现相同数量的分区和基数，但我没有找到。因此，我想问一下是否有其他人对此有经验或建议。感谢。

from pyspark.mllib.stat import Statistics

sc = ... # SparkContext

seriesX = ... # a series
seriesY = ... # must have the same number of partitions and cardinality as seriesX

# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a 
# method is not specified, Pearson's method will be used by default. 
print Statistics.corr(seriesX, seriesY, method="pearson")

data = ... # an RDD of Vectors
# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
# If a method is not specified, Pearson's method will be used by default. 
print Statistics.corr(data, method="pearson")

更新：我发现了一个Python包调用traces，可以满足我的需要。如何将其合并到Spark计算中，即将其并行应用于分区？

如何在Spark中插入和检查具有不同基数的RDD的相关性？

0 个答案: