对不起,这是一个普遍的疑问,而不是一个非常具体的问题,但我一直在使用Spark ML中的Transformers
阅读和测试不同的预处理,有些事情对我来说并不完全清楚。
如果我的任何假设出错,请纠正我。
因此,我了解每个分类属性需要使用StringIndexer
转换(索引)为数值。然后,如果我想使用线性或逻辑回归,则必须在OneHotEncoder
之后为每个分类属性创建二进制向量。但是,如果我想使用决策树或随机森林,我是否需要使用VectorIndexer
或者不需要进一步的转换?何时使用它有用或必要?我还不太了解它的用处。
例如,我正在对此dataset运行一些测试:
M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
最后一列是需要预测的值。 所以这是我加载和预处理数据的代码:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data')
df.columns = ['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Schuked weight', 'Viscera weight', 'Shell weight', 'Rings']
train_dataset = sqlContext.createDataFrame(df)
column_types = train_dataset.dtypes
categoricalCols = []
numericCols = []
for column_type in column_types:
if column_type[1] == 'string':
categoricalCols += [column_type[0]]
else:
numericCols += [column_type[0]]
stages = []
for categoricalCol in categoricalCols:
stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
stages += [stringIndexer]
assemblerInputs = map(lambda c: c + "Index", categoricalCols) + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
labelIndexer = StringIndexer(inputCol='Rings', outputCol='indexedLabel')
stages += [labelIndexer]
pipeline = Pipeline(stages=stages)
pipelineModel = pipeline.fit(train_dataset)
train_dataset = pipelineModel.transform(train_dataset)
有了这个,我的数据集就可以用于决策森林吗?有没有更好的方法来做我做的事情?如果我想使用Logistic回归,我只需要在OneHotEncoder
循环中添加for
转换?
对于一些上下文,我试图创建一个或多或少的通用代码,允许我使用不同的数据集和不同的ML算法运行实验。
提前致谢!