pyspark 1.6.3线性回归错误float()参数必须是字符串或数字

时间:2017-12-23 19:41:04

标签: python pyspark linear-regression

我在pyspark使用线性回归这是我的代码:

from pyspark.ml.regression import LabeledPoint,LinearRegressionWithSGD
from pyspark import SparkContext, SparkConf 
from pyspark.sql import SQLContext
from pyspark.ml.evaluation import RegressionEvaluator
import time
import csv

start_time = time.time()

conf = SparkConf().setAppName("project_spark").setMaster("local")
sc = SparkContext(conf=conf)
sqlc = SQLContext(sc)

X_train = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\BoW_Train_int_1k.csv')
X_test = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\BoW_Test_int_1k.csv')
y_train = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\Train_Tags81_1k.csv')
y_test = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\Test_Tags81_1k.csv')

X_train = X_train.map(lambda line: line.split(","))
X_test = X_test.map(lambda line: line.split(","))
y_train = y_train.map(lambda line: line.split(","))
y_test = y_test.map(lambda line: line.split(","))

training = LabeledPoint(y_train, X_train)
testing = LabeledPoint(y_test, X_test)

model = LinearRegressionWithSGD.train(training)
valuesAndPreds = (testing.map(lambda p: (p.label, model.predict(p.features))))

evaluator = RegressionEvaluator(metricName="rmse")
RMSE = evaluator.evaluate(valuesAndPreds)

print("Root Mean Squared Error = " + str(RMSE))
Time = time.time() - start_time
print("--- %s seconds ---" % Time)
spark.stop()

但是此代码有错误float()参数必须是行

中的字符串或数字
training = LabeledPoint(y_train, X_train)

所以,我该如何解决呢?

1 个答案:

答案 0 :(得分:0)

如果没有全局,我的猜测是你给了LabeledPoint错误类型的参数。更具体地说,您的y_trainy_test获取以下值:

...
y_train.map(lambda line: line.split(","))
y_test.map(lambda line: line.split(","))

每次返回list,与LabeledPoint label参数不兼容。

所以: training = LabeledPoint(y_train, X_train) - > training = LabeledPoint([some, values], [some, other, values])

但是,取自docs/sourceLabeledPoint期望第一个参数(标签)可以转换为float

class LabeledPoint(object):

    """
    Class that represents the features and labels of a data point.

    :param label:
      Label for this data point.
    :param features:
      Vector of features for this point (NumPy array, list,
      pyspark.mllib.linalg.SparseVector, or scipy.sparse column matrix).

    .. note:: 'label' and 'features' are accessible as class attributes.

    .. versionadded:: 1.0.0
    """

    def __init__(self, label, features):
        self.label = float(label)
        self.features = _convert_to_vector(features)

因此,根据您的行的样子,可能会将您的代码更改为以下内容:

...
y_train.map(lambda line: line.split(",")[0])
...
y_test.map(lambda line: line.split(",")[0])

希望有所帮助,祝你好运!