使用Transformers准备DataFrames以运行Spark ML算法pyspark

时间:2017-04-26 16:02:32

标签: python machine-learning pyspark spark-dataframe

对不起,这是一个普遍的疑问,而不是一个非常具体的问题,但我一直在使用Spark ML中的Transformers阅读和测试不同的预处理,有些事情对我来说并不完全清楚。

如果我的任何假设出错,请纠正我。

因此,我了解每个分类属性需要使用StringIndexer转换(索引)为数值。然后,如果我想使用线性或逻辑回归,则必须在OneHotEncoder之后为每个分类属性创建二进制向量。但是,如果我想使用决策树或随机森林,我是否需要使用VectorIndexer或者不需要进一步的转换?何时使用它有用或必要?我还不太了解它的用处。

例如,我正在对此dataset运行一些测试:

M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7 

最后一列是需要预测的值。 所以这是我加载和预处理数据的代码:

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data')
df.columns = ['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Schuked weight', 'Viscera weight', 'Shell weight', 'Rings']
train_dataset = sqlContext.createDataFrame(df)

column_types = train_dataset.dtypes

categoricalCols = []
numericCols = []

for column_type in column_types:
    if column_type[1] == 'string':
        categoricalCols += [column_type[0]]
    else:
        numericCols += [column_type[0]]

stages = []
for categoricalCol in categoricalCols:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
    stages += [stringIndexer]

assemblerInputs = map(lambda c: c + "Index", categoricalCols) + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

labelIndexer = StringIndexer(inputCol='Rings', outputCol='indexedLabel')
stages += [labelIndexer]

pipeline = Pipeline(stages=stages)
pipelineModel = pipeline.fit(train_dataset)
train_dataset = pipelineModel.transform(train_dataset)

有了这个,我的数据集就可以用于决策森林吗?有没有更好的方法来做我做的事情?如果我想使用Logistic回归,我只需要在OneHotEncoder循环中添加for转换? 对于一些上下文,我试图创建一个或多或少的通用代码,允许我使用不同的数据集和不同的ML算法运行实验。

提前致谢!

0 个答案:

没有答案