应用错误收集

使用大熊猫，可以指定每对列的最小观察数以创建相关矩阵。
像这样：corrMatrix = df.corr(method='pearson', min_periods=100)

我想对pyspark做同样的事情。

我设法用pyspark创建了相关矩阵，但是我不知道如何定义最小观察值。

vector_col = "corr_features"
col_names = ["col1", "col2", "col3"]

assembler = VectorAssembler(inputCols=col_names, outputCol=vector_col)
df_vector = assembler.setHandleInvalid("keep").transform(df).select(vector_col)
matrix = Correlation.corr(df_vector, vector_col)

r = matrix.collect()[0]["pearson({})".format(vector_col)].values
corrMatrix = pd.DataFrame(r.reshape(-1, len(col_names)), columns=col_names, index=col_names)

最小观察数相关矩阵pyspark

0 个答案: