我正在尝试为我的数据编写文本分类器应用程序,该应用程序从我们产品的各个评论网站中删除。我正在使用电影分类器的一个例子来获得一个正在运行的片段,然后根据我的要求进行更改。
我正在使用的示例示例是使用Lucene分析器来阻止文本描述但不编译(我正在使用SBT)。编译错误如下。
> compile
[info] Updating {file:/D:/ScalaApps/MovieClassifier/}movieclassifier...
[info] Resolving com.sun.jersey.jersey-test-framework#jersey-test-framework-griz
[info] Resolving com.fasterxml.jackson.module#jackson-module-scala_2.10;2.4.4 ..
[info] Resolving org.spark-project.hive.shims#hive-shims-common-secure;0.13.1a .
[info] Resolving org.apache.lucene#lucene-analyzers-common_2.10;5.1.0 ...
[warn] module not found: org.apache.lucene#lucene-analyzers-common_2.10;5.1.0
[warn] ==== local: tried
[warn] C:\Users\manik.jasrotia\.ivy2\local\org.apache.lucene\lucene-analyzers-
common_2.10\5.1.0\ivys\ivy.xml
[warn] ==== public: tried
[warn] https://repo1.maven.org/maven2/org/apache/lucene/lucene-analyzers-commo
n_2.10/5.1.0/lucene-analyzers-common_2.10-5.1.0.pom
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: org.apache.lucene#lucene-analyzers-common_2.10;5.1.0: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] org.apache.lucene:lucene-analyzers-common_2.10:5.1.0 (D:\ScalaAp
ps\MovieClassifier\build.sbt#L7-18)
[warn] +- naivebayes_document_classifier:naivebayes_document_classifi
er_2.10:1.0
[trace] Stack trace suppressed: run last *:update for the full output.
[error] (*:update) sbt.ResolveException: unresolved dependency: org.apache.lucen
e#lucene-analyzers-common_2.10;5.1.0: not found
[error] Total time: 31 s, completed Dec 6, 2015 11:01:45 AM
>
我正在使用两个scala文件(Stemmer.scala和MovieClassifier.scala)。下面给出了这两个程序以及Build.sbt文件。任何帮助表示赞赏。
MovieClassifier
import org.apache.spark.mllib.classification.NaiveBayes
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.feature.{IDF, HashingTF}
object MovieRatingClassifier {
def main(args:Array[String])
{
val sparkConfig = new SparkConf().setAppName("Movie Rating Classifier")
val sc = new SparkContext(sparkConfig)
/*
This loads the data from HDFS.
HDFS is a distributed file storage system so this technically
could be a very large multi terabyte file
*/
val dataFile = sc.textFile("D:/spark4/mydata/naive_bayes_movie_classification.txt")
/*
HashingTF and IDF are helpers in MLlib that helps us vectorize our
synopsis instead of doing it manually
*/
val hashingTF = new HashingTF()
/*
Our ultimate goal is to get our data into a collection of type LabeledPoint.
The MLlib implementation uses LabeledPoints to train the Naive Bayes model.
First we parse the file for ratings and vectorize the synopses
*/
val ratings=dataFile.map{x=>
x.split(";")
match {
case Array(rating,synopsis) =>
rating.toDouble
}
}
val synopsis_frequency_vector=dataFile.map{x=>
x.split(";")
match {
case Array(rating,synopsis) =>
val stemmed=Stemmer.tokenize(synopsis)
hashingTF.transform(stemmed)
}
}
synopsis_frequency_vector.cache()
/*
http://en.wikipedia.org/wiki/Tf%E2%80%93idf
https://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html
*/
val idf = new IDF().fit(synopsis_frequency_vector)
val tfidf=idf.transform(synopsis_frequency_vector)
/*produces (rating,vector) tuples*/
val zipped=ratings.zip(tfidf)
/*Now we transform them into LabeledPoints*/
val labeledPoints = zipped.map{case (label,vector)=> LabeledPoint(label,vector)}
val model = NaiveBayes.train(labeledPoints)
/*--- Model is trained now we get it to classify our test file with only synopsis ---*/
val testDataFile = sc.textFile("D:/spark4/naive_bayes_movie_classification-test.txt")
/*We only have synopsis now. The rating is what we want to achieve.*/
val testVectors=testDataFile.map{x=>
val stemmed=Stemmer.tokenize(x)
hashingTF.transform(stemmed)
}
testVectors.cache()
val tfidf_test = idf.transform(testVectors)
val result = model.predict(tfidf_test)
result.collect.foreach(x=>println("Predicted rating for the movie is: "+x))
}
}
施特默尔
import org.apache.lucene.analysis.en.EnglishAnalyzer
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute
import scala.collection.mutable.ArrayBuffer
object Stemmer {
// Adopted from
// https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
def tokenize(content:String):Seq[String]={
val analyzer=new EnglishAnalyzer()
val tokenStream=analyzer.tokenStream("contents", content)
//CharTermAttribute is what we're extracting
val term=tokenStream.addAttribute(classOf[CharTermAttribute])
tokenStream.reset() // must be called by the consumer before consumption to clean the stream
var result = ArrayBuffer.empty[String]
while(tokenStream.incrementToken()) {
val termValue = term.toString
if (!(termValue matches ".*[\\d\\.].*")) {
result += term.toString
}
}
tokenStream.end()
tokenStream.close()
result
}
}
Build.sbt文件
name := "NaiveBayes_Document_Classifier"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.0" % "provided"
libraryDependencies += "org.apache.spark" % "spark-mllib" % "1.4.0" % "provided"
libraryDependencies += "org.apache.lucene" % "lucene-analyzers-common" % "5.1.0"
答案 0 :(得分:3)
你确定你没有输入
public static MirrorTree mirrorSymmetricTree(BinaryTree<String> t) {
return null;
}
(双public MirrorTree mirrorSymmetricTree(BinaryTree<String> t) {
if (root == null) {
return null;
}
final MirrorTree left = (MirrorTree) root.left;
right = root.right;
root.left = mirrorSymmetricTree(right);
root.right = mirrorSymmetricTree(left);
return (MirrorTree) root;
}
)而不是你在这里写的?因为当它实际上是一个Java库时,你肯定会请求一个scala版本的lucene。它应该是您在此处撰写的单libraryDependencies += "org.apache.lucene" %% "lucene-analyzers-common" % "5.1.0"
,而%%
应该是双%
。即试试
mllib
注意,您似乎已经从此处收到的答案中引入了回归https://stackoverflow.com/a/34094469/1041691
答案 1 :(得分:-2)
通过使用以下依赖
解决了这个问题"org.apache.lucene" % "lucene-analyzers-common" % "5.1.0"