对数据帧的Spark 2.2.0 adds关联支持。 有关详细信息,请参阅pull request。
MLlib 基于DataFrame的API中的新算法:
SPARK-19636:基于DataFrame的API(Scala / Java / Python)中的相关性
然而,完全不清楚如何使用此更改或与先前版本相比发生了哪些变化。
我期待的是:
df_num = spark.read.parquet('/dataframe')
df_cat.printSchema()
df_cat.show()
df_num.corr(col1='features', col2='fail_mode_meas')
root
|-- features: vector (nullable = true)
|-- fail_mode_meas: double (nullable = true)
+--------------------+--------------+
| features|fail_mode_meas|
+--------------------+--------------+
|[0.0,0.5,0.0,0.0,...| 22.7|
|[0.9,0.0,0.7,0.0,...| 0.1|
|[0.0,5.1,1.0,0.0,...| 2.0|
|[0.0,0.0,0.0,0.0,...| 3.1|
|[0.1,0.0,0.0,1.7,...| 0.0|
...
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Currently correlation calculation for columns with dataType org.apach
e.spark.ml.linalg.VectorUDT not supported.'
有人可以解释如何利用新的Spark 2.2.0功能进行数据帧中的关联吗?
答案 0 :(得分:2)
没有可以直接用于实现您想要的方法。在pyspark.ml.stat
:
from pyspark.ml.stat import Correlation
Correlation.corr(df_cat, "features")
但此方法用于计算单个Vector
列的相关矩阵。
你可以:
features
汇总fail_mode_meas
和VectorAssembler
并在之后应用pyspark.ml.stat.Correlation
,但它会计算一些过时的值。pyspark.sql.functions.corr
但对于大量列来说会很昂贵,并且在与Python udf
一起使用时会增加很多开销。答案 1 :(得分:0)
尝试此操作以获取所有变量之间的相关性-
Bad credentials
at com.example.securyti.security.jwt.JwtAuthenticationFilter.attemptAuthentication(JwtAuthenticationFilter.java:58)
可从Spark 2.2.0下载