Question

正如标题所述，我试图找到一种方法来评估pyspark中的多重共线性？通常，我会使用statsmodel的VIF但我在pyspark中看不到相同的功能。

关于如何计算多重共线性的任何建议都将非常受欢迎。

Answer 1

您可以获得相关矩阵：

from pyspark.mllib.stat import Statistics

seriesX = sc.parallelize([1.0, 2.0, 3.0, 3.0, 5.0])  # a series
# seriesY must have the same number of partitions and cardinality as seriesX
seriesY = sc.parallelize([11.0, 22.0, 33.0, 33.0, 555.0])

# Compute the correlation using Pearson's method. Enter "spearman" for Spearman's method.
# If a method is not specified, Pearson's method will be used by default.
print("Correlation is: " + str(Statistics.corr(seriesX, seriesY, method="pearson")))

data = sc.parallelize(
    [np.array([1.0, 10.0, 100.0]), np.array([2.0, 20.0, 200.0]), np.array([5.0, 33.0, 366.0])]
)  # an RDD of Vectors

# calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.
# If a method is not specified, Pearson's method will be used by default.
print(Statistics.corr(data, method="pearson"))

文档：https://spark.apache.org/docs/latest/mllib-statistics.html

Pyspark - 如何评估多重共线性？相当于VIF

1 个答案: