如何在Scala中使用Lucene将String标记为其ngram?

时间:2015-08-06 20:05:23

标签: string scala lucene tokenize

我在Scala中使用以下方法对字符串进行标记,并使用EnglishAnayzer来阻止术语并删除停用词。但是我想使用lucece将字符串拆分为ngrams。有人可以帮我处理代码吗?

 def tokenize(content: String): Seq[String] = {

val LuceneVersion = Version.LUCENE_46
val tReader = new StringReader(content)
val analyzer = new EnglishAnalyzer(LuceneVersion)
val tStream = analyzer.tokenStream("contents",tReader)
val term = tStream.addAttribute(classOf[CharTermAttribute])
tStream.reset()

val result = mutable.ArrayBuffer.empty[String]
while(tStream.incrementToken()){
  result += term.toString()
}   
result
} 

我应该首先使用ngram并使用EnglishAnalyzer来阻止提取的术语并删除停用词吗?

1 个答案:

答案 0 :(得分:0)

正如我所看到的,顺序应该是你的情况,首先你必须从原始字符串中获取tokenStream,然后将n-gram的输入tokenStream标记为根据你的需求。 NGramTokenFilter中的方法可以进一步证明这一点。

 // org.apache.lucene.analysis.ngram.NGramTokenFilter
 public final class NGramTokenFilter extends TokenFilter { 
    // line 51
    public NGramTokenFilter(TokenStream input, int minGram, int maxGram) {
    ...
 }

以下是我根据您提供的说明尝试完成此操作的方法。

  import java.io.{ Reader, StringReader }
  import org.apache.lucene.util.Version.LUCENE_34
  import org.apache.lucene.analysis.ngram.NGramTokenFilter
  import org.apache.lucene.analysis.tokenattributes.CharTermAttribute
  import org.apache.lucene.analysis.{ TokenStream, Analyzer }
  import org.apache.lucene.analysis.en.EnglishAnalyzer

  object NgramTest extends App {
     class NGramAnalyzer extends Analyzer {
        def tokenStream(fieldName: String, reader: Reader): TokenStream = {
        val originalStream = (new EnglishAnalyzer(LUCENE_34)).reusableTokenStream(fieldName, reader)

        // n-gram with size 2 ~ 3
        new NGramTokenFilter(originalStream, 2, 3)
       }
     }

    def simpleTokenStreamList(tokenStream: TokenStream) = {
       val term = tokenStream.addAttribute(classOf[CharTermAttribute])
       Stream.continually(
          (tokenStream.incrementToken, term.toString)
       ).takeWhile(_._1).map {
          t => t._2
       }.toList
     }

    val nGramAnalyzer = new NGramAnalyzer
    val ngramStream = nGramAnalyzer.tokenStream("sample", new StringReader("A letter from mother"))
    val result = simpleTokenStreamList(ngramStream)

    // List(le, et, tt, te, er, let, ett, tte, ter, fr, ro, om, fro, rom, mo, ot, th, he, er, mot, oth, the, her)
   println(result)
 }

此外,在 Lucene In Action 2nd,第8.2.2章Ngram过滤器中详细解释了Ngram过滤器。我建议你读一读也许你会找到答案。

无论如何,希望它有所帮助。