Question

我正在尝试使用pySpark实现Logistic回归这是我的代码

from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from time import time
from pyspark.mllib.regression import LabeledPoint
from numpy import array


RES_DIR="/home/shaahmed115/Pet_Projects/DA/TwitterStream_US_Elections/Features/"
sc= SparkContext('local','pyspark')

data_file = RES_DIR + "training.txt"
raw_data = sc.textFile(data_file)

print "Train data size is {}".format(raw_data.count())


test_data_file = RES_DIR + "testing.txt"
test_raw_data = sc.textFile(test_data_file)

print "Test data size is {}".format(test_raw_data.count())

def parse_interaction(line):
    line_split = line.split(",")
    return LabeledPoint(float(line_split[0]), array([float(x) for x in line_split]))

training_data = raw_data.map(parse_interaction)
logit_model = LogisticRegressionWithLBFGS.train(training_data,iterations=10, numClasses=3)

这是一个错误：目前，ML包中使用ElasticNet的LogisticRegression仅支持二进制分类。在输入数据集中找到3

以下是我的数据集示例： 2,1.0,1.0,1.0 0,1.0,1.0,1.0 1,0.0,0.0,0.0

第一个元素是类，其余的是向量。你可以看到有三个类。是否有可以使多项分类与此协同工作的解决方法？

Answer 1

您看到的错误

ML包中使用ElasticNet的LogisticRegression仅支持二进制文件分类

很清楚。您可以使用org.apache.spark.mllib.classification.LogisticRegression版本来支持多项式：
/** * Train a classification model for Multinomial/Binary Logistic Regression using * Limited-memory BFGS. Standard feature scaling and L2 regularization are used by default. * NOTE: Labels used in Logistic Regression should be {0, 1, ..., k - 1} * for k classes multi-label classification problem. * * Earlier implementations of LogisticRegressionWithLBFGS applies a regularization * penalty to all elements including the intercept. If this is called with one of * standard updaters (L1Updater, or SquaredL2Updater) this is translated * into a call to ml.LogisticRegression, otherwise this will use the existing mllib * GeneralizedLinearAlgorithm trainer, resulting in a regularization penalty to the * intercept. */

{{1}}

LogisticRegressionwithLBFGS抛出关于不支持多项式分类的错误

1 个答案: