RFormula中连续变量和分类变量之间的交互:为什么是交互结果?

时间:2019-05-14 12:05:29

标签: apache-spark

火花交互作用分为类别列(“国家”)和连续列(“小时”)。

import org.apache.spark.ml.feature.RFormula

val dataset = spark.createDataFrame(Seq(
  (7, "US", 18, 1.0),
  (8, "CA", 12, 0.0),
  (9, "NZ", 15, 0.0)
)).toDF("id", "country", "hour", "clicked")

val formula = new RFormula()
  .setFormula("clicked ~ country:hour")
  .setFeaturesCol("features")
  .setLabelCol("label")

val output = formula.fit(dataset).transform(dataset)
output.select("features", "label").show()

结果是这样的:

+--------------+-----+
|      features|label|
+--------------+-----+
|[0.0,0.0,18.0]|  1.0|
|[12.0,0.0,0.0]|  0.0|
|[0.0,15.0,0.0]|  0.0|
+--------------+-----+

RFormula :交互(数字值或二进制分类值的乘法)

为什么结果不像

+--------------+-----+
|      features|label|
+--------------+-----+
|[0.0,0.0]     |  1.0|
|[12.0,0.0]    |  0.0|
|[0.0,15.0]    |  0.0|
+--------------+-----+

交互是分类值和数值的乘积...

0 个答案:

没有答案