我试图在Million Song数据集上使用LinearRegressionWithSGD,我的模型返回NaN' s作为权重,0.0作为截距。错误可能是什么问题?我在独立模式下使用Spark 1.40。
示例数据:http://www.filedropper.com/part-00000
这是我的完整代码:
//导入依赖项
val data =
sc.textFile("/home/naveen/Projects/millionSong/YearPredictionMSD.txt")
//定义RDD
def parsePoint (line: String): LabeledPoint = {
val x = line.split(",")
val head = x.head.toDouble
val tail = Vectors.dense(x.tail.map(x => x.toDouble))
return LabeledPoint(head,tail)
}
//转换为标记点
val parsedDataInit = data.map(x => parsePoint(x))
val onlyLabels = parsedDataInit.map(x => x.label)
val minYear = onlyLabels.min()
val maxYear = onlyLabels.max()
//查找范围
val parsedData = parsedDataInit.map(x => LabeledPoint(x.label-minYear
, x.features))
// Shift标签
val splits = parsedData.randomSplit(Array(0.8, 0.1, 0.1), seed = 123)
val parsedTrainData = splits(0).cache()
val parsedValData = splits(1).cache()
val parsedTestData = splits(2).cache()
val nTrain = parsedTrainData.count()
val nVal = parsedValData.count()
val nTest = parsedTestData.count()
//培训,验证和测试集
def squaredError(label: Double, prediction: Double): Double = {
return scala.math.pow(label - prediction,2)
}
def calcRMSE(labelsAndPreds: RDD[List[Double]]): Double = {
return scala.math.sqrt(labelsAndPreds.map(x =>
squaredError(x(0),x(1))).mean())
}
val numIterations = 100
val stepSize = 1.0
val regParam = 0.01
val regType = "L2"
val algorithm = new LinearRegressionWithSGD()
algorithm.optimizer
.setNumIterations(numIterations)
.setStepSize(stepSize)
.setRegParam(regParam)
val model = algorithm.run(parsedTrainData)
// RMSE
{{1}}
答案 0 :(得分:2)
我不熟悉SGD的这个特定实现,但通常如果梯度下降求解器转到nan,则意味着学习速率太大。 (在这种情况下,我认为它是stepSize
变量。)
每次尝试将其降低一个数量级,直到它开始收敛
答案 1 :(得分:0)
我认为有两种可能性。
stepSize
很重要。你应该尝试0.01,0.03,0.1之类的东西,
0.3,1.0,3.0 ....