无法找出Spark LinearRegression错误的原因

时间:2016-09-08 00:19:48

标签: apache-spark pyspark apache-spark-mllib

我试图在PySpark中使用我在Kaggle上找到的住房数据集来做一个非常简单的<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script> <div class="some-class">Click me - item #1</div> <div class="some-class">Click me - item #2</div>。有一堆专栏,但为了使这个(几乎)尽可能简单,我只保留了两个专栏(在开始使用所有专栏之后),并且仍然没有运气模型训练。以下是在进行回归步骤之前数据框的样子:

LinearRegression

我使用以下代码训练模型:

2016-09-07 17:12:08,804 root INFO [Row(price=78000.0, sqft_living=780.0, sqft_lot=16344.0, features=DenseVector([780.0, 16344.0])), Row(price=80000.0, sqft_living=430.0, sqft_lot=5050.0, features=DenseVector([430.0, 5050.0])), Row(price=81000.0, sqft_living=730.0, sqft_lot=9975.0, features=DenseVector([730.0, 9975.0])), Row(price=82000.0, sqft_living=860.0, sqft_lot=10426.0, features=DenseVector([860.0, 10426.0])), Row(price=84000.0, sqft_living=700.0, sqft_lot=20130.0, features=DenseVector([700.0, 20130.0])), Row(price=85000.0, sqft_living=830.0, sqft_lot=9000.0, features=DenseVector([830.0, 9000.0])), Row(price=85000.0, sqft_living=910.0, sqft_lot=9753.0, features=DenseVector([910.0, 9753.0])), Row(price=86500.0, sqft_living=840.0, sqft_lot=9480.0, features=DenseVector([840.0, 9480.0])), Row(price=89000.0, sqft_living=900.0, sqft_lot=4750.0, features=DenseVector([900.0, 4750.0])), Row(price=89950.0, sqft_living=570.0, sqft_lot=4080.0, features=DenseVector([570.0, 4080.0]))]

我得到的错误是:

    standard_scaler = StandardScaler(inputCol='features',
                                     outputCol='scaled')
    lr = LinearRegression(featuresCol=standard_scaler.getOutputCol(), labelCol='price', weightCol=None,
                          maxIter=100, tol=1e-4)
    pipeline = Pipeline(stages=[standard_scaler, lr])
    grid = (ParamGridBuilder()
            .baseOn({lr.labelCol: 'price'})
            .addGrid(lr.regParam, [0.1, 1.0])
            .addGrid(lr.elasticNetParam, elastic_net_params or [0.0, 1.0])
            .build())
    ev = RegressionEvaluator(metricName="rmse", labelCol='price')
    cv = CrossValidator(estimator=pipeline,
                        estimatorParamMaps=grid,
                        evaluator=ev,
                        numFolds=5)
    model = cv.fit(data).bestModel

有什么想法吗?

1 个答案:

答案 0 :(得分:1)

在这种情况下,您无法使用Pipeline。当您致电pipeline.fit时,它会转换为(大致)

standard_scaler_model = standard_scaler.fit(dataframe)
lr_model = lr.fit(dataframe)

但你确实需要

standard_scaler_model = standard_scaler.fit(dataframe)
dataframe = standard_scaler_model.transform(dataframe)
lr_model = lr.fit(dataframe)

错误是因为您的lr.fit无法找到StandardScaler模型的输出(即转换结果)。