Pyspark - 如何评估多重共线性?相当于VIF

时间:2018-04-08 23:03:53

标签: python pyspark

正如标题所述,我试图找到一种方法来评估pyspark中的多重共线性?通常,我会使用statsmodel的VIF但我在pyspark中看不到相同的功能。

关于如何计算多重共线性的任何建议都将非常受欢迎。

1 个答案:

答案 0 :(得分:0)

您可以获得相关矩阵:

from pyspark.mllib.stat import Statistics

seriesX = sc.parallelize([1.0, 2.0, 3.0, 3.0, 5.0])  # a series
# seriesY must have the same number of partitions and cardinality as seriesX
seriesY = sc.parallelize([11.0, 22.0, 33.0, 33.0, 555.0])

# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method.
# If a method is not specified, Pearson's method will be used by default.
print("Correlation is: " + str(Statistics.corr(seriesX, seriesY, method="pearson")))

data = sc.parallelize(
    [np.array([1.0, 10.0, 100.0]), np.array([2.0, 20.0, 200.0]), np.array([5.0, 33.0, 366.0])]
)  # an RDD of Vectors

# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
# If a method is not specified, Pearson's method will be used by default.
print(Statistics.corr(data, method="pearson"))

文档:https://spark.apache.org/docs/latest/mllib-statistics.html