我从Spark 2.3.0
Shell中的文档中复制了粘贴的this example。
import org.apache.spark.ml.feature.ChiSqSelector
import org.apache.spark.ml.linalg.Vectors
val data = Seq(
(7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
(8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
(9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
)
val df = spark.createDataset(data).toDF("id", "features", "clicked")
val selector = new ChiSqSelector()
.setNumTopFeatures(1)
.setFeaturesCol("features")
.setLabelCol("clicked")
.setOutputCol("selectedFeatures")
val selectorModel = selector.fit(df)
val result = selectorModel.transform(df)
result.show
+---+------------------+-------+----------------+
| id| features|clicked|selectedFeatures|
+---+------------------+-------+----------------+
| 7|[0.0,0.0,18.0,1.0]| 1.0| [18.0]|
| 8|[0.0,1.0,12.0,0.0]| 0.0| [12.0]|
| 9|[1.0,0.0,15.0,0.1]| 0.0| [15.0]|
+---+------------------+-------+----------------+
selectorModel.selectedFeatures
res2: Array[Int] = Array(2)
ChiSqSelector
错误地选择了feature 2
而不是feature 3
(基于文档和常识,功能3应该正确)
答案 0 :(得分:1)
Chi-Squared功能选择operates on categorical data
ChiSqSelector
代表Chi-Squared特征选择。它可对具有分类特征的标签数据进行操作
因此,这两个功能都同样出色(尽管我们应该强调,即使用作连续变量,也可以使用这两个功能来导出平凡的完美分类器)。
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.stat.Statistics
Statistics.chiSqTest(sc.parallelize(data.map {
case (_, v, l) => LabeledPoint(l, OldVectors.fromML(v))
})).slice(2, 4)
Array[org.apache.spark.mllib.stat.test.ChiSqTestResult] =
Array(Chi squared test summary:
method: pearson
degrees of freedom = 2
statistic = 3.0
pValue = 0.22313016014843035
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 2
statistic = 3.0000000000000004
pValue = 0.22313016014843035
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..)
,测试结果与其他工具一致。例如在R(used as a reference for selector tests)中:
y <- as.factor(c("1.0", "0.0", "0.0"))
x2 <- as.factor(c("18.0", "12.0", "15.0"))
x3 <- as.factor(c("1.0", "0.0", "0.1"))
chisq.test(table(x2, y))
Pearson's Chi-squared test
data: table(x2, y)
X-squared = 3, df = 2, p-value = 0.2231
Warning message:
In chisq.test(table(x2, y)) : Chi-squared approximation may be incorrect
chisq.test(table(x3, y))
Pearson's Chi-squared test
data: table(x3, y)
X-squared = 3, df = 2, p-value = 0.2231
Warning message:
In chisq.test(table(x3, y)) : Chi-squared approximation may be incorrect
由于选择器just sorts data by p-value和sortBy
is stable,先到先得。如果您切换功能的顺序,则会选择另一个。