pyspark使用数据框

时间:2018-03-01 06:11:46

标签: python dataframe pyspark linear-regression

我尝试在pyspark中使用数据框运行线性回归,但在我尝试使用函数制作字段,标签之后,它仍然给我一个错误。有人可以帮我弄清楚如何使用数据框运行线性回归吗?

import pyspark.mllib
import pyspark.mllib.regression
from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import *
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
#from pyspark.ml.regression import LinearRegression

我的数据看起来像,

df_all_shorted.head(2)

[Row(bonica_rid=u'cand1457', party=100, vote_date=u'2001-01-03', vote_choice=6, vs_idealPoint=-0.514169271337908, vs_cuttingpoint=-0.514169271337908, vs_rcdir=1, fecyear_new=u'1992', Cand_ID_new=u'H2MA11060', state_new=u'MA', recipient_cfscore_new=-0.758, num_givers_total_new=1533, cand_gender_new=u'M', total_receipts_new=169089.0, total_indiv_contrib_new=105870.0, total_pac_contribs_new=0.0, ran_primary_new=1, ran_general_new=1, district_partisanship_new=-0.119),
 Row(bonica_rid=u'cand1457', party=100, vote_date=u'2001-01-03', vote_choice=6, vs_idealPoint=-0.514169271337908, vs_cuttingpoint=-0.514169271337908, vs_rcdir=1, fecyear_new=u'1992', Cand_ID_new=u'H2MA11060', state_new=u'MA', recipient_cfscore_new=-0.758, num_givers_total_new=1533, cand_gender_new=u'M', total_receipts_new=0.0, total_indiv_contrib_new=0.0, total_pac_contribs_new=0.0, ran_primary_new=0, ran_general_new=0, district_partisanship_new=-0.119)]

training = df_all_shorted.rdd.map(lambda line:LabeledPoint(line[0],[line[1:]])

我尝试了这段代码并收到错误,

AttributeError: 'DataFrame' object has no attribute 'map'

所以我改为

training = df_all_shorted.rdd.map(lambda line:LabeledPoint(line[0],[line[1:]]))

and it worked, but when I run 

lr = LinearRegression()\
.setMaxIter(10)\
.setRegParam(0.3)\
.setElasticNetParam(0.8)
lrModel = lr.fit(training)

发生错误,

AttributeError: 'PipelinedRDD' object has no attribute '_jdf'

1 个答案:

答案 0 :(得分:0)

您收到此错误的原因是您尝试使用的pyspark.ml功能来自pyspark.mllib而非LinearRegression。在您注释掉行pyspark.ml后,您的全局变量空间仍会识别from pyspark.ml.regression import LinearRegression来自console.log('start'); myPromisse.then(() => { console.log('Admin created'); }).catch((err) => { console.error('An error occurred creating Admin: ', err); }); 模块。重新启动并再次运行它。