MLlib分类示例在第1阶段停止

时间:2015-03-27 00:22:30

标签: scala apache-spark logistic-regression apache-spark-mllib

编辑

我尝试使用Gabriel的答案中的文字并获得垃圾邮件功能:9和火腿功能:13。我尝试将HashingTF更改为numFeatures = 9,然后是13,然后为每个创建一个。然后程序停止在“计数DataValidators.scala:38”,就像之前一样。

完成工作(4)
数为21(spamFeatures)
数23(hamFeatures)
数为28(trainingData.count())
首先在34的GeneralizedLinearAlgorithm(val model = lrLearner.run(trainingData)

1)为什么要按行计算要素,如在代码中用空格分隔(“”)

2)我看到的两件事与我的代码和Gabriel的代码不同: a)我对记录器没有任何意义,但这不应成为问题...
b)我的文件位于hdfs(hdfs://ip-abc-de-.compute.internal:8020 / user / ec2-user / spam.txt),再次不应该是一个问题,但不确定是否有一些我想念的东西......

3)我应该让它运行多长时间?我让它运行至少10分钟:local [2] ..

我猜这点可能与我的Spark / MLlib设置存在某种问题?是否有一个更简单的程序我可以运行以查看MLLib是否存在设置问题?我已经能够运行其他火花流/ sql作业...

谢谢!

[转自火花社区]

大家好,

我正在尝试从Learning Spark运行这个MLlib示例: https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala#L48

我正在做的事情不同:

1)而不是他们的spam.txt和normal.txt我有200个单词的文本文件......没什么大的,只有纯文本,有句号,逗号等等。

3)我使用过numFeatures = 200,1000和10,000

错误:当我尝试运行模型时,我一直陷入困境(根据以下ui的详细信息):

val model = new LogisticRegressionWithSGD()。run(trainingData)

会冻结这样的事情:

[第1阶段:==============> (1 + 0)/ 4]

来自webui的一些细节:

org.apache.spark.rdd.RDD.count(RDD.scala:910)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38)
org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
scala.collection.immutable.List.forall(List.scala:84)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:161)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
$line21.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
$line21.$read$$iwC$$iwC$$iwC.<init>(<console>:38)
$line21.$read$$iwC$$iwC.<init>(<console>:40)
$line21.$read$$iwC.<init>(<console>:42)
$line21.$read.<init>(<console>:44)
$line21.$read$.<init>(<console>:48)
$line21.$read$.<clinit>(<console>)
$line21.$eval$.<init>(<console>:7)
$line21.$eval$.<clinit>(<console>)
$line21.$eval.$print(<console>)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

我不确定我做错了什么......非常感谢任何帮助,谢谢!

3 个答案:

答案 0 :(得分:1)

感谢您提出这个问题,我不知道这些例子,所以我下载并测试了它们。我看到的是git存储库包含带有大量html代码的文件,它可以工作,但是你最终会添加100个功能,这可能是你得不到一致结果的原因,因为你自己的文件包含的功能要少得多。我在没有HTML代码的情况下测试它的工作原理是从spam.txt和ham.txt中删除HTML代码,如下所示:

ham.txt =

Dear Spark Learner, Thanks so much for attending the Spark Summit 2014!       
Check out videos of talks from the summit at ...
Hi Mom, Apologies for being late about emailing and forgetting to send you  
the package.  I hope you and bro have been ...
Wow, hey Fred, just heard about the Spark petabyte sort.  I think we need to  
take time to try it out immediately ...
Hi Spark user list, This is my first question to this list, so thanks in  
advance for your help!  I tried running ...
Thanks Tom for your email.  I need to refer you to Alice for this one.  I    
haven&#39;t yet figured out that part either ...
Good job yesterday!  I was attending your talk, and really enjoyed it.  I   
want to try out GraphX ...
Summit demo got whoops from audience!  Had to let you know. --Joe

spam.txt =

 Dear sir, I am a Prince in a far kingdom you have not heard of.  I want to 
 send you money via wire transfer so please ...
 Get Viagra real cheap!  Send money right away to ...
 Oh my gosh you can be really strong too with these drugs found in the     
 rainforest. Get them cheap right now ...
 YOUR COMPUTER HAS BEEN INFECTED!  YOU MUST RESET YOUR PASSWORD.  Reply to    
 this email with your password and SSN ...
 THIS IS NOT A SCAM!  Send money and get access to awesome stuff really   
 cheap and never have to ...

然后使用波纹管修改MLib.scala,确保你的项目中引用了log4j,将输出重定向到文件而不是控制台,所以你基本上需要运行两次,在第一次运行中通过打印#of来观察输出垃圾邮件和火腿中的功能,然后您可以设置正确的功能#(而不是100)我使用5。

package com.oreilly.learningsparkexamples.scala

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.log4j.Logger

object MLlib {

private val logger = Logger.getLogger("MLlib")

def main(args: Array[String]) {
    logger.info("This is spark in Windows")
    val conf = new SparkConf().setAppName(s"Book example: Scala").setMaster("local[2]").set("spark.executor.memory","1g")
    //val conf = new SparkConf().setAppName(s"Book example: Scala")
    val sc = new SparkContext(conf)
    // Load 2 types of emails from text files: spam and ham (non-spam).
    // Each line has text from one email.
    val spam = sc.textFile("spam.txt")
    val ham = sc.textFile("ham.txt")
    // Create a HashingTF instance to map email text to vectors of 5 (not 100) features.
    val tf = new HashingTF(numFeatures = 5)
    // Each email is split into words, and each word is mapped to one feature.
    val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
    println ("features in spam " + spamFeatures.count())
    val hamFeatures = ham.map(email => tf.transform(email.split(" ")))
    println ("features in ham " + ham.count())
    // Create LabeledPoint datasets for positive (spam) and negative (ham) examples.
    val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))
    val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features))
    val trainingData = positiveExamples ++ negativeExamples
    trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.
    // Create a Logistic Regression learner which uses the LBFGS optimizer.
    val lrLearner = new LogisticRegressionWithSGD()
    // Run the actual learning algorithm on the training data.
    val model = lrLearner.run(trainingData)
    // Test on a positive example (spam) and a negative one (ham).
    // First apply the same HashingTF feature transformation used on the training data.
    val ex1 = "O M G GET cheap stuff by sending money to ...";
    val ex2 = "Hi Dad, I started studying Spark the other ..."
    val posTestExample = tf.transform(ex1.split(" "))
    val negTestExample = tf.transform(ex2.split(" "))
    // Now use the learned model to predict spam/ham for new emails.
    println(s"Prediction for positive test example: ${ex1} : ${model.predict(posTestExample)}")
    println(s"Prediction for negative test example: ${ex2} : ${model.predict(negTestExample)}")
    sc.stop()
  }
}

当我在输出中运行时,我得到了:

features in spam 5
features in ham 7
Prediction for positive test example: O M G GET cheap stuff by sending money    
to ... : 1.0
Prediction for negative test example: Hi Dad, I started studying Spark the    
other ... : 0.0

答案 1 :(得分:-1)

I had the same problem with Spark 1.5.2 on my local cluster. My program stopped on "count at DataValidators.scala:40". Resolved by running spark as "spark-submit --master local"

答案 2 :(得分:-1)

我在本地群集上遇到了与Spark 1.5.2类似的问题。我的程序停在&#34;数到DataValidators.scala:40&#34;。我正在缓存我的训练功能。删除了缓存(只是没有调用缓存功能)并解决了。虽然不确定实际原因。