Question

我在pyspark使用线性回归这是我的代码：

from pyspark.ml.regression import LabeledPoint,LinearRegressionWithSGD
from pyspark import SparkContext, SparkConf 
from pyspark.sql import SQLContext
from pyspark.ml.evaluation import RegressionEvaluator
import time
import csv

start_time = time.time()

conf = SparkConf().setAppName("project_spark").setMaster("local")
sc = SparkContext(conf=conf)
sqlc = SQLContext(sc)

X_train = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\BoW_Train_int_1k.csv')
X_test = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\BoW_Test_int_1k.csv')
y_train = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\Train_Tags81_1k.csv')
y_test = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\Test_Tags81_1k.csv')

X_train = X_train.map(lambda line: line.split(","))
X_test = X_test.map(lambda line: line.split(","))
y_train = y_train.map(lambda line: line.split(","))
y_test = y_test.map(lambda line: line.split(","))

training = LabeledPoint(y_train, X_train)
testing = LabeledPoint(y_test, X_test)

model = LinearRegressionWithSGD.train(training)
valuesAndPreds = (testing.map(lambda p: (p.label, model.predict(p.features))))

evaluator = RegressionEvaluator(metricName="rmse")
RMSE = evaluator.evaluate(valuesAndPreds)

print("Root Mean Squared Error = " + str(RMSE))
Time = time.time() - start_time
print("--- %s seconds ---" % Time)
spark.stop()

但是此代码有错误float（）参数必须是行

中的字符串或数字

training = LabeledPoint(y_train, X_train)

所以，我该如何解决呢？

Answer 1

如果没有全局，我的猜测是你给了LabeledPoint错误类型的参数。更具体地说，您的y_train和y_test获取以下值：

...
y_train.map(lambda line: line.split(","))
y_test.map(lambda line: line.split(","))

每次返回list，与LabeledPoint label参数不兼容。

所以： training = LabeledPoint(y_train, X_train) - ＆gt; training = LabeledPoint([some, values], [some, other, values])

但是，取自docs/source，LabeledPoint期望第一个参数（标签）可以转换为float。

class LabeledPoint(object):

    """
    Class that represents the features and labels of a data point.

    :param label:
      Label for this data point.
    :param features:
      Vector of features for this point (NumPy array, list,
      pyspark.mllib.linalg.SparseVector, or scipy.sparse column matrix).

    .. note:: 'label' and 'features' are accessible as class attributes.

    .. versionadded:: 1.0.0
    """

    def __init__(self, label, features):
        self.label = float(label)
        self.features = _convert_to_vector(features)

因此，根据您的行的样子，可能会将您的代码更改为以下内容：

...
y_train.map(lambda line: line.split(",")[0])
...
y_test.map(lambda line: line.split(",")[0])

希望有所帮助，祝你好运！

pyspark 1.6.3线性回归错误float（）参数必须是字符串或数字

1 个答案: