我在Scala中使用以下方法对字符串进行标记,并使用EnglishAnayzer来阻止术语并删除停用词。但是我想使用lucece将字符串拆分为ngrams。有人可以帮我处理代码吗?
def tokenize(content: String): Seq[String] = {
val LuceneVersion = Version.LUCENE_46
val tReader = new StringReader(content)
val analyzer = new EnglishAnalyzer(LuceneVersion)
val tStream = analyzer.tokenStream("contents",tReader)
val term = tStream.addAttribute(classOf[CharTermAttribute])
tStream.reset()
val result = mutable.ArrayBuffer.empty[String]
while(tStream.incrementToken()){
result += term.toString()
}
result
}
我应该首先使用ngram并使用EnglishAnalyzer来阻止提取的术语并删除停用词吗?
答案 0 :(得分:0)
正如我所看到的,顺序应该是你的情况,首先你必须从原始字符串中获取tokenStream,然后将n-gram的输入tokenStream标记为根据你的需求。 NGramTokenFilter中的方法可以进一步证明这一点。
// org.apache.lucene.analysis.ngram.NGramTokenFilter
public final class NGramTokenFilter extends TokenFilter {
// line 51
public NGramTokenFilter(TokenStream input, int minGram, int maxGram) {
...
}
以下是我根据您提供的说明尝试完成此操作的方法。
import java.io.{ Reader, StringReader }
import org.apache.lucene.util.Version.LUCENE_34
import org.apache.lucene.analysis.ngram.NGramTokenFilter
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute
import org.apache.lucene.analysis.{ TokenStream, Analyzer }
import org.apache.lucene.analysis.en.EnglishAnalyzer
object NgramTest extends App {
class NGramAnalyzer extends Analyzer {
def tokenStream(fieldName: String, reader: Reader): TokenStream = {
val originalStream = (new EnglishAnalyzer(LUCENE_34)).reusableTokenStream(fieldName, reader)
// n-gram with size 2 ~ 3
new NGramTokenFilter(originalStream, 2, 3)
}
}
def simpleTokenStreamList(tokenStream: TokenStream) = {
val term = tokenStream.addAttribute(classOf[CharTermAttribute])
Stream.continually(
(tokenStream.incrementToken, term.toString)
).takeWhile(_._1).map {
t => t._2
}.toList
}
val nGramAnalyzer = new NGramAnalyzer
val ngramStream = nGramAnalyzer.tokenStream("sample", new StringReader("A letter from mother"))
val result = simpleTokenStreamList(ngramStream)
// List(le, et, tt, te, er, let, ett, tte, ter, fr, ro, om, fro, rom, mo, ot, th, he, er, mot, oth, the, her)
println(result)
}
此外,在 Lucene In Action 2nd,第8.2.2章Ngram过滤器中详细解释了Ngram过滤器。我建议你读一读也许你会找到答案。
无论如何,希望它有所帮助。