火花交互作用分为类别列(“国家”)和连续列(“小时”)。
import org.apache.spark.ml.feature.RFormula
val dataset = spark.createDataFrame(Seq(
(7, "US", 18, 1.0),
(8, "CA", 12, 0.0),
(9, "NZ", 15, 0.0)
)).toDF("id", "country", "hour", "clicked")
val formula = new RFormula()
.setFormula("clicked ~ country:hour")
.setFeaturesCol("features")
.setLabelCol("label")
val output = formula.fit(dataset).transform(dataset)
output.select("features", "label").show()
结果是这样的:
+--------------+-----+
| features|label|
+--------------+-----+
|[0.0,0.0,18.0]| 1.0|
|[12.0,0.0,0.0]| 0.0|
|[0.0,15.0,0.0]| 0.0|
+--------------+-----+
RFormula :交互(数字值或二进制分类值的乘法)
为什么结果不像
+--------------+-----+
| features|label|
+--------------+-----+
|[0.0,0.0] | 1.0|
|[12.0,0.0] | 0.0|
|[0.0,15.0] | 0.0|
+--------------+-----+
交互是分类值和数值的乘积...