Question

我试图理解spark.ml如何处理字符串分类独立变量。我知道在Spark中我必须使用StringIndexer将字符串转换为双精度例如，“a”/“b”/“c”=＆gt; 0.0 / 1.0 / 2.0。
但我真正想避免的是必须在该双打列上使用OneHotEncoder。这似乎使管道不必要地混乱。特别是因为Spark知道数据是分类的。希望下面的示例代码能让我的问题更加清晰。

model.coefficients
res1: org.apache.spark.mllib.linalg.Vector = [0.7667490491775728]

逻辑回归将此视为仅具有一个自变量的模型。

import org.apache.spark.ml.attribute.AttributeGroup
AttributeGroup.fromStructField(assembled.schema("features"))
res2: org.apache.spark.ml.attribute.AttributeGroup = {"ml_attr":{"attrs":
{"nominal":[{"vals":["c","a","b"],"idx":0,"name":"xIdx"}]},
"num_attrs":1}}

但是自变量是分类的，有三个类别= [“a”，“b”，“c”]。我知道我从未做过k编码之一，但数据帧的元数据知道特征向量是名义上的。

import javafx.scene.control.*;

import java.io.IOException;

public class ViewController {

    public ListView listView;

    public ViewController() throws IOException {
        listView = new ListView();
        listView.getItems().addAll("Iron Man", "Titanic", "Contact", "Surrogates");
    }
}

如何将此信息传递给LogisticRegression？这不是保持数据帧元数据的重点吗？ SparkML中似乎没有CategoricalFeaturesInfo。我是否真的需要为每个分类功能进行1 k编码？

Answer 1

也许我错过了一些东西，但这看起来像是RFormula（https://spark.apache.org/docs/latest/ml-features.html#rformula）的工作。

顾名思义，它采用“R风格”公式来描述如何从输入数据列组成特征向量。

对于每个分类输入列（即StringType作为类型），它将StringIndexer + OneHotEncoder添加到实现公式下的公式的最终管道。

输出是一个特征向量（双精度），可以与org.apache.spark.ml包中的任何算法一起使用，作为您的目标。

Spark和分类字符串变量

1 个答案: