Question

我使用spark运行LinearRegression。由于我的数据无法预测为线性模型，因此我添加了一些更高的多项式特征以获得更好的结果。这很好用！

我不想自己修改数据，而是想使用spark库中的PolynomialExpansion函数。为了找到最佳解决方案，我使用了不同程度的循环。经过10次迭代（10度）后，我遇到了以下错误：

Caused by: java.lang.IllegalArgumentException: requirement failed: You provided 77 indices and values, which exceeds the specified vector size -30.

我使用了带有2个功能的trainingData。这听起来像我在使用10度的多项式展开后有太多的特征，但矢量大小-30让我感到困惑。为了解决这个问题，我开始尝试不同的示例数据和学位。为了测试，我在libsvm格式中使用了以下代码行和不同的testData（只有一个输入行）：

val data = spark.read.format("libsvm").load("data/testData2.txt")
val polynomialExpansion = new PolynomialExpansion()
  .setInputCol("features")
  .setOutputCol("polyFeatures")
  .setDegree(10)
val polyDF2 = polynomialExpansion.transform(data)
polyDF2.select("polyFeatures").take(3).foreach(println)

ExampleData: 0 1:1 2:2 3:3
polynomialExpansion.setDegree(11)

Caused by: java.lang.IllegalArgumentException: requirement failed: You provided 333 indices and values, which exceeds the specified vector size 40.

ExampleData: 0 1:1 2:2 3:3 4:4

polynomialExpansion.setDegree(10)

Caused by: java.lang.IllegalArgumentException: requirement failed: You provided 1000 indices and values, which exceeds the specified vector size -183.

ExampleData: 0 1:1 2:2 3:3 4:4 5:5

polynomialExpansion.setDegree(10)

Caused by: java.lang.IllegalArgumentException: requirement failed: You provided 2819 indices and values, which exceeds the specified vector size -548.

看起来数据中的特征数量对最高可能度有影响，但多项式扩展后的特征数量似乎不是导致错误的原因，因为它差别很大。它也不会在扩展功能中崩溃，但是当我尝试在最后一行代码中打印新功能时。

我当时想的可能是我的记忆已经满了，但我检查了系统控制并且还有一些可用的内存。

我正在使用：

Eclipse IDE
Maven项目
Scala 2.11.7
Spark 2.0.0
Spark-mllib 2.0.0
Ubuntu 16.04

我很高兴有关此问题的任何想法

Spark - 超出多项式扩展vecor尺寸

0 个答案: