Question

对不起，这是一个普遍的疑问，而不是一个非常具体的问题，但我一直在使用Spark ML中的Transformers阅读和测试不同的预处理，有些事情对我来说并不完全清楚。

如果我的任何假设出错，请纠正我。

因此，我了解每个分类属性需要使用StringIndexer转换（索引）为数值。然后，如果我想使用线性或逻辑回归，则必须在OneHotEncoder之后为每个分类属性创建二进制向量。但是，如果我想使用决策树或随机森林，我是否需要使用VectorIndexer或者不需要进一步的转换？何时使用它有用或必要？我还不太了解它的用处。

例如，我正在对此dataset运行一些测试：

M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7

最后一列是需要预测的值。所以这是我加载和预处理数据的代码：

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data')
df.columns = ['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Schuked weight', 'Viscera weight', 'Shell weight', 'Rings']
train_dataset = sqlContext.createDataFrame(df)

column_types = train_dataset.dtypes

categoricalCols = []
numericCols = []

for column_type in column_types:
    if column_type[1] == 'string':
        categoricalCols += [column_type[0]]
    else:
        numericCols += [column_type[0]]

stages = []
for categoricalCol in categoricalCols:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
    stages += [stringIndexer]

assemblerInputs = map(lambda c: c + "Index", categoricalCols) + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

labelIndexer = StringIndexer(inputCol='Rings', outputCol='indexedLabel')
stages += [labelIndexer]

pipeline = Pipeline(stages=stages)
pipelineModel = pipeline.fit(train_dataset)
train_dataset = pipelineModel.transform(train_dataset)

有了这个，我的数据集就可以用于决策森林吗？有没有更好的方法来做我做的事情？如果我想使用Logistic回归，我只需要在OneHotEncoder循环中添加for转换？对于一些上下文，我试图创建一个或多或少的通用代码，允许我使用不同的数据集和不同的ML算法运行实验。

提前致谢！

使用Transformers准备DataFrames以运行Spark ML算法pyspark

0 个答案: