我的火花版本是1.6
我的python版本是2.7
我的数据如下,
x = [300,400,500,500,800,1000,1000,1300]
y = [9500,10300,11000,12000,12400,13400,14500,15300]
+----+-----+
| x| y|
+----+-----+
| 300| 9500|
| 400|10300|
| 500|11000|
| 500|12000|
| 800|12400|
|1000|13400|
|1000|14500|
|1300|15300|
+----+-----+
我的错误代码,
from pyspark.mllib.linalg import Vectors
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression
sqlContext = SQLContext(sc)
#my data
x = [300,400,500,500,800,1000,1000,1300]
y = [9500,10300,11000,12000,12400,13400,14500,15300]
df = pd.DataFrame({'x':x, 'y':y})
df_spark=sqlCtx.createDataFrame(df)
lr = LinearRegression(maxIter=50, regParam=0.0, solver="normal", weightCol="weight")
model = lr.fit(df)
我想像这个例子一样运行:
>>> from pyspark.mllib.linalg import Vectors
>>> df = sqlContext.createDataFrame([
... (1.0, 2.0, Vectors.dense(1.0)),
... (0.0, 2.0, Vectors.sparse(1, [], []))], ["label", "weight", "features"])
>>> lr = LinearRegression(maxIter=5, regParam=0.0, solver="normal", weightCol="weight")
>>> model = lr.fit(df)
我可以弄清楚如何将数据传输到示例数据类型。
+-----+------+---------+
|label|weight| features|
+-----+------+---------+
| 1.0| 2.0| [1.0]|
| 0.0| 2.0|(1,[],[])|
+-----+------+---------+
我们非常感谢任何评论 谢谢您的帮助。
答案 0 :(得分:0)
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.linalg import Vectors
#spark conf
conf = ( SparkConf()
.setMaster("local[*]")
.setAppName('pyspark')
)
sqlContext = SQLContext(sc)
sc = SparkContext(conf=conf)
df = sqlContext.createDataFrame([
(1.0, Vectors.dense(1.0)),
(3.0, Vectors.dense(2.0)),
(4.0, Vectors.dense(3.0)),
(5.0, Vectors.dense(4.0)),
(2.0, Vectors.dense(5.0)),
(3.0, Vectors.dense(6.0)),
(4.0, Vectors.dense(7.0)),
(0.0, Vectors.sparse(1, [], []))], ["label", "features"])
print(df.show())
lr = LinearRegression(maxIter=50, regParam=1.12)
model = lr.fit(df)
print(model.coefficients)
print(model.intercept)
输出:
[0.24955041614]
1.87657354351
我成功了!
但系数和截距不同于statsmodels.api.OLS。
import numpy as np
import statsmodels.api as sm
Y = [1,3,4,5,2,3,4]
X = range(1,8)
X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())
输出:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.161
Model: OLS Adj. R-squared: -0.007
Method: Least Squares F-statistic: 0.9608
Date: Fri, 07 Apr 2017 Prob (F-statistic): 0.372
Time: 02:09:45 Log-Likelihood: -10.854
No. Observations: 7 AIC: 25.71
Df Residuals: 5 BIC: 25.60
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 2.1429 1.141 1.879 0.119 -0.789 5.075
x1 0.2500 0.255 0.980 0.372 -0.406 0.906
==============================================================================
Omnibus: nan Durbin-Watson: 1.743
Prob(Omnibus): nan Jarque-Bera (JB): 0.482
Skew: 0.206 Prob(JB): 0.786
Kurtosis: 1.782 Cond. No. 10.4
==============================================================================