Question

我正在尝试为我的数据编写文本分类器应用程序，该应用程序从我们产品的各个评论网站中删除。我正在使用电影分类器的一个例子来获得一个正在运行的片段，然后根据我的要求进行更改。

我正在使用的示例示例是使用Lucene分析器来阻止文本描述但不编译（我正在使用SBT）。编译错误如下。

> compile
[info] Updating {file:/D:/ScalaApps/MovieClassifier/}movieclassifier...
[info] Resolving com.sun.jersey.jersey-test-framework#jersey-test-framework-griz
[info] Resolving com.fasterxml.jackson.module#jackson-module-scala_2.10;2.4.4 ..
[info] Resolving org.spark-project.hive.shims#hive-shims-common-secure;0.13.1a .
[info] Resolving org.apache.lucene#lucene-analyzers-common_2.10;5.1.0 ...
[warn]  module not found: org.apache.lucene#lucene-analyzers-common_2.10;5.1.0
[warn] ==== local: tried
[warn]   C:\Users\manik.jasrotia\.ivy2\local\org.apache.lucene\lucene-analyzers-
common_2.10\5.1.0\ivys\ivy.xml
[warn] ==== public: tried
[warn]   https://repo1.maven.org/maven2/org/apache/lucene/lucene-analyzers-commo
n_2.10/5.1.0/lucene-analyzers-common_2.10-5.1.0.pom
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  ::          UNRESOLVED DEPENDENCIES         ::
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  :: org.apache.lucene#lucene-analyzers-common_2.10;5.1.0: not found
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn]  Note: Unresolved dependencies path:
[warn]          org.apache.lucene:lucene-analyzers-common_2.10:5.1.0 (D:\ScalaAp
ps\MovieClassifier\build.sbt#L7-18)
[warn]            +- naivebayes_document_classifier:naivebayes_document_classifi
er_2.10:1.0
[trace] Stack trace suppressed: run last *:update for the full output.
[error] (*:update) sbt.ResolveException: unresolved dependency: org.apache.lucen
e#lucene-analyzers-common_2.10;5.1.0: not found
[error] Total time: 31 s, completed Dec 6, 2015 11:01:45 AM
>

我正在使用两个scala文件（Stemmer.scala和MovieClassifier.scala）。下面给出了这两个程序以及Build.sbt文件。任何帮助表示赞赏。

MovieClassifier

import org.apache.spark.mllib.classification.NaiveBayes  
import org.apache.spark.mllib.regression.LabeledPoint  
import org.apache.spark.{SparkContext, SparkConf}  
import org.apache.spark.mllib.feature.{IDF, HashingTF}

object MovieRatingClassifier {  
  def main(args:Array[String])
    {

      val sparkConfig = new SparkConf().setAppName("Movie Rating Classifier")
      val sc = new SparkContext(sparkConfig)

      /*
    This loads the data from HDFS.
    HDFS is a distributed file storage system so this technically 
    could be a very large multi terabyte file
      */      
      val dataFile = sc.textFile("D:/spark4/mydata/naive_bayes_movie_classification.txt")

      /*
    HashingTF and IDF are helpers in MLlib that helps us vectorize our
    synopsis instead of doing it manually
      */       
      val hashingTF = new HashingTF()

      /*
    Our ultimate goal is to get our data into a collection of type LabeledPoint.
    The MLlib implementation uses LabeledPoints to train the Naive Bayes model.
    First we parse the file for ratings and vectorize the synopses
       */

      val ratings=dataFile.map{x=>
    x.split(";")
    match {
      case Array(rating,synopsis) =>
        rating.toDouble
    }
      }

      val synopsis_frequency_vector=dataFile.map{x=>
    x.split(";")
    match {
      case Array(rating,synopsis) =>
        val stemmed=Stemmer.tokenize(synopsis)
        hashingTF.transform(stemmed)
    }
      }

      synopsis_frequency_vector.cache()

      /*
       http://en.wikipedia.org/wiki/Tf%E2%80%93idf
       https://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html
      */
      val idf = new IDF().fit(synopsis_frequency_vector)
      val tfidf=idf.transform(synopsis_frequency_vector)

      /*produces (rating,vector) tuples*/
      val zipped=ratings.zip(tfidf)

      /*Now we transform them into LabeledPoints*/
      val labeledPoints = zipped.map{case (label,vector)=> LabeledPoint(label,vector)}

      val model = NaiveBayes.train(labeledPoints)

      /*--- Model is trained now we get it to classify our test file with only synopsis ---*/
      val testDataFile = sc.textFile("D:/spark4/naive_bayes_movie_classification-test.txt")

      /*We only have synopsis now. The rating is what we want to achieve.*/
      val testVectors=testDataFile.map{x=>
    val stemmed=Stemmer.tokenize(x)
    hashingTF.transform(stemmed)
      }
      testVectors.cache()

      val tfidf_test = idf.transform(testVectors)

      val result = model.predict(tfidf_test)

      result.collect.foreach(x=>println("Predicted rating for the movie is: "+x))

    }
}

施特默尔

import org.apache.lucene.analysis.en.EnglishAnalyzer  
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute  
import scala.collection.mutable.ArrayBuffer

object Stemmer {

  // Adopted from
  // https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/

  def tokenize(content:String):Seq[String]={
    val analyzer=new EnglishAnalyzer()
    val tokenStream=analyzer.tokenStream("contents", content)
    //CharTermAttribute is what we're extracting

    val term=tokenStream.addAttribute(classOf[CharTermAttribute])

    tokenStream.reset() // must be called by the consumer before consumption to clean the stream



    var result = ArrayBuffer.empty[String]

    while(tokenStream.incrementToken()) {
    val termValue = term.toString
    if (!(termValue matches ".*[\\d\\.].*")) {
      result += term.toString
    }
    }
    tokenStream.end()
    tokenStream.close()
    result
  }
}

Build.sbt文件

name := "NaiveBayes_Document_Classifier"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.0" % "provided"

libraryDependencies += "org.apache.spark" % "spark-mllib" % "1.4.0" % "provided"

libraryDependencies += "org.apache.lucene" % "lucene-analyzers-common" % "5.1.0"

Answer 1

你确定你没有输入

public static MirrorTree mirrorSymmetricTree(BinaryTree<String> t) {
    return null;
}

（双public MirrorTree mirrorSymmetricTree(BinaryTree<String> t) { if (root == null) { return null; } final MirrorTree left = (MirrorTree) root.left; right = root.right; root.left = mirrorSymmetricTree(right); root.right = mirrorSymmetricTree(left); return (MirrorTree) root; }）而不是你在这里写的？因为当它实际上是一个Java库时，你肯定会请求一个scala版本的lucene。它应该是您在此处撰写的单libraryDependencies += "org.apache.lucene" %% "lucene-analyzers-common" % "5.1.0"，而%%应该是双%。即试试

mllib

注意，您似乎已经从此处收到的答案中引入了回归https://stackoverflow.com/a/34094469/1041691

Answer 2

通过使用以下依赖

解决了这个问题

"org.apache.lucene" % "lucene-analyzers-common" % "5.1.0"

未解决的Lucene在SBT中的依赖性

2 个答案: