我在pyspark使用线性回归这是我的代码:
from pyspark.ml.regression import LabeledPoint,LinearRegressionWithSGD
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.ml.evaluation import RegressionEvaluator
import time
import csv
start_time = time.time()
conf = SparkConf().setAppName("project_spark").setMaster("local")
sc = SparkContext(conf=conf)
sqlc = SQLContext(sc)
X_train = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\BoW_Train_int_1k.csv')
X_test = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\BoW_Test_int_1k.csv')
y_train = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\Train_Tags81_1k.csv')
y_test = sc.textFile('C:\Users\WINDOWS 8.1\Desktop\Test_Tags81_1k.csv')
X_train = X_train.map(lambda line: line.split(","))
X_test = X_test.map(lambda line: line.split(","))
y_train = y_train.map(lambda line: line.split(","))
y_test = y_test.map(lambda line: line.split(","))
training = LabeledPoint(y_train, X_train)
testing = LabeledPoint(y_test, X_test)
model = LinearRegressionWithSGD.train(training)
valuesAndPreds = (testing.map(lambda p: (p.label, model.predict(p.features))))
evaluator = RegressionEvaluator(metricName="rmse")
RMSE = evaluator.evaluate(valuesAndPreds)
print("Root Mean Squared Error = " + str(RMSE))
Time = time.time() - start_time
print("--- %s seconds ---" % Time)
spark.stop()
但是此代码有错误float()参数必须是行
中的字符串或数字training = LabeledPoint(y_train, X_train)
所以,我该如何解决呢?
答案 0 :(得分:0)
如果没有全局,我的猜测是你给了LabeledPoint
错误类型的参数。更具体地说,您的y_train
和y_test
获取以下值:
...
y_train.map(lambda line: line.split(","))
y_test.map(lambda line: line.split(","))
每次返回list
,与LabeledPoint
label
参数不兼容。
所以:
training = LabeledPoint(y_train, X_train)
- > training = LabeledPoint([some, values], [some, other, values])
但是,取自docs/source,LabeledPoint
期望第一个参数(标签)可以转换为float
。
class LabeledPoint(object):
"""
Class that represents the features and labels of a data point.
:param label:
Label for this data point.
:param features:
Vector of features for this point (NumPy array, list,
pyspark.mllib.linalg.SparseVector, or scipy.sparse column matrix).
.. note:: 'label' and 'features' are accessible as class attributes.
.. versionadded:: 1.0.0
"""
def __init__(self, label, features):
self.label = float(label)
self.features = _convert_to_vector(features)
因此,根据您的行的样子,可能会将您的代码更改为以下内容:
...
y_train.map(lambda line: line.split(",")[0])
...
y_test.map(lambda line: line.split(",")[0])
希望有所帮助,祝你好运!