Pyspark ml不适合模型,并且总是“AttributeError:'PipelinedRDD'对象没有属性'_jdf'

时间:2016-09-27 02:21:12

标签: python apache-spark pyspark apache-spark-mllib

data = sqlContext.sql("select a.churn,b.pay_amount,c.all_balance from db_bi.t_cust_churn a left join db_bi.t_cust_pay b on a.cust_id=b.cust_id left join db_bi.t_cust_balance c on a.cust_id=c.cust_id limit 5000").cache()

def labelData(df):
    return df.map(lambda row: LabeledPoint(row[0], row[1:]))
traindata = labelData(data) --this step works well.
from pyspark.ml.classification import LogisticRegression   
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(lrdata)
lrModel = lr.fit(lrdata)
AttributeError                            Traceback (most recent call last)
<ipython-input-40-b84a106121e6> in <module>()
----> 1 lrModel = lr.fit(lrdata)

/home/hadoop/spark/python/pyspark/ml/pipeline.pyc in fit(self, dataset, params)
     67                 return self.copy(params)._fit(dataset)
     68             else:
---> 69                 return self._fit(dataset)
     70         else:
     71             raise ValueError("Params must be either a param map or a list/tuple of param maps, "

/home/hadoop/spark/python/pyspark/ml/wrapper.pyc in _fit(self, dataset)
    131 
    132     def _fit(self, dataset):
--> 133         java_model = self._fit_java(dataset)
    134         return self._create_model(java_model)
    135 

/home/hadoop/spark/python/pyspark/ml/wrapper.pyc in _fit_java(self, dataset)
    128         """
    129         self._transfer_params_to_java()
--> 130         return self._java_obj.fit(dataset._jdf)
    131 
    132     def _fit(self, dataset):

AttributeError: 'PipelinedRDD' object has no attribute '_jdf'   

1 个答案:

答案 0 :(得分:0)

我猜您正在使用该教程获取最新的spark版本(2.0.1) pyspark.ml.classification import LogisticRegression而您需要其他版本,例如1.6.2pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel。请注意不同的库。