无法在pyspark 1.6&amp ;;中找出如何使用LinearRegression。 python2.7

时间:2017-04-14 08:05:54

标签: python apache-spark pyspark

我的火花版本是1.6
我的python版本是2.7

我的数据如下,

x = [300,400,500,500,800,1000,1000,1300]  
y = [9500,10300,11000,12000,12400,13400,14500,15300]


+----+-----+
|   x|    y|
+----+-----+
| 300| 9500|
| 400|10300|
| 500|11000|
| 500|12000|
| 800|12400|
|1000|13400|
|1000|14500|
|1300|15300|
+----+-----+

我的错误代码,

from pyspark.mllib.linalg import Vectors
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression 
sqlContext = SQLContext(sc)
#my data
x = [300,400,500,500,800,1000,1000,1300]
y = [9500,10300,11000,12000,12400,13400,14500,15300]

df = pd.DataFrame({'x':x, 'y':y})
df_spark=sqlCtx.createDataFrame(df)

lr = LinearRegression(maxIter=50, regParam=0.0, solver="normal", weightCol="weight")
model = lr.fit(df)

我想像这个例子一样运行:

>>> from pyspark.mllib.linalg import Vectors
>>> df = sqlContext.createDataFrame([
...     (1.0, 2.0, Vectors.dense(1.0)),
...     (0.0, 2.0, Vectors.sparse(1, [], []))], ["label", "weight", "features"])
>>> lr = LinearRegression(maxIter=5, regParam=0.0, solver="normal", weightCol="weight")
>>> model = lr.fit(df)

我可以弄清楚如何将数据传输到示例数据类型。

+-----+------+---------+
|label|weight| features|
+-----+------+---------+
|  1.0|   2.0|    [1.0]|
|  0.0|   2.0|(1,[],[])|
+-----+------+---------+

我们非常感谢任何评论 谢谢您的帮助。

1 个答案:

答案 0 :(得分:0)

from pyspark import  SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.linalg import Vectors
#spark conf
conf = ( SparkConf()
         .setMaster("local[*]")
         .setAppName('pyspark')
        )
sqlContext = SQLContext(sc)
sc = SparkContext(conf=conf)


df = sqlContext.createDataFrame([
(1.0, Vectors.dense(1.0)),
(3.0, Vectors.dense(2.0)),
(4.0, Vectors.dense(3.0)),
(5.0, Vectors.dense(4.0)),
(2.0, Vectors.dense(5.0)),
(3.0, Vectors.dense(6.0)),
(4.0, Vectors.dense(7.0)),
(0.0, Vectors.sparse(1, [], []))], ["label", "features"])

print(df.show())

lr = LinearRegression(maxIter=50, regParam=1.12)
model = lr.fit(df)
print(model.coefficients)
print(model.intercept)

输出:

[0.24955041614]  
1.87657354351

我成功了!
但系数和截距不同于statsmodels.api.OLS。

import numpy as np
import statsmodels.api as sm

Y = [1,3,4,5,2,3,4]
X = range(1,8)
X = sm.add_constant(X)

model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())

输出:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.161
Model:                            OLS   Adj. R-squared:                 -0.007
Method:                 Least Squares   F-statistic:                    0.9608
Date:                Fri, 07 Apr 2017   Prob (F-statistic):              0.372
Time:                        02:09:45   Log-Likelihood:                -10.854
No. Observations:                   7   AIC:                             25.71
Df Residuals:                       5   BIC:                             25.60
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          2.1429      1.141      1.879      0.119        -0.789     5.075
x1             0.2500      0.255      0.980      0.372        -0.406     0.906
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.743
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.482
Skew:                           0.206   Prob(JB):                        0.786
Kurtosis:                       1.782   Cond. No.                         10.4
==============================================================================