Question

我想研究额外训练数据如何帮助模型表现（在精确度，召回等方面）的影响。我将采样率改为0.35,0.5,0.75和1.0（从所有数据的25％到100％）。

val sampling_ratio = 0.25

从单独的文件中读取案例和控件。

val negative_training_data: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "negative_sorted.tsv")
val positive_training_data:  RDD[LabeledPoint] = MLUtils.loadLibSVMFile(spark, "positive_sorted.tsv")

为正面和负面条目采用数据集的随机子集（现在为25％）。

val negative_split = negative_training_data.randomSplit(Array(sampling_ratio, (1 - sampling_ratio)), seed =  sample)(0)
val positive_split = positive_training_data.randomSplit(Array(sampling_ratio, (1 - sampling_ratio)), seed = sample)(0)

这是我将两个分组合并以生成训练数据的地方。

 val training_data: RDD[LabeledPoint] = negative_split.union(positive_split)

现在训练LogisticRegression模型。

 logrmodel = train_LogisticRegression_model(training)

以下是模型构建的代码。

  def train_LogisticRegression_model(training: RDD[LabeledPoint]): LogisticRegressionModel = {
    // Run training algorithm to build the model
    val numIterations = 100
    val model = new LogisticRegressionWithLBFGS().setNumClasses(2).run(training)
    return model

  }

但是，我收到以下错误：

线程中的异常＆＃34; main＆＃34; org.apache.spark.SparkDriverExecutionException：执行错误在org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion（DAGScheduler.scala：984）在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive（DAGScheduler.scala：1390）在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive（DAGScheduler.scala：1354）在org.apache.spark.util.EventLoop $$ anon $ 1.run（EventLoop.scala：48）引发者：java.lang.IllegalArgumentException：要求失败：与其他摘要生成器合并时维度不匹配。期待4701但得到4698。在scala.Predef $ .require（Predef.scala：233）

Answer 1

（您上面有一些拼写错误，但您还没有粘贴train_LogisticRegression_model的代码。）

错误告诉您在正面和负面示例中有不同的大小向量。您应该检查功能的大小，以便对输入进行完整性检查。

negative_training_data.take(3).map( _ .features.size).mkString("\n")
positive_training_data.take(3).map( _ .features.size).mkString("\n")

Spark：与其他摘要生成器合并时，维度不匹配

1 个答案:

Spark：与其他摘要生成器合并时，维​​度不匹配

1 个答案:

Spark：与其他摘要生成器合并时，维度不匹配