EnglishAnalyzer有更好的停止世界过滤?

时间:2016-09-30 05:35:44

标签: lucene mahout tf-idf

我正在使用Apache Mahout创建TFIDF向量。我指定EnglishAnalyzer作为文档标记的一部分,如下所示:

DocumentProcessor.tokenizeDocuments(documentsSequencePath, EnglishAnalyzer.class, tokenizedDocumentsPath, configuration); 

它为我称为business.txt的文档提供了以下向量。我很惊讶地看到像haveonie.g.这样无用的单词。我的其他文档之一载入更多。

对我来说,提高其所发现条款质量的最简单方法是什么?我知道EnglishAnalyzer可以传递一个停止词列表,但构造函数通过反射调用,所以看起来我不能这样做。

我应该编写自己的分析仪吗?我对如何组合标记器,过滤器等感到有点困惑。我可以重复使用EnglishAnalyzer和我自己的过滤器吗?这种方式似乎不可能对EnglishAnalyzer进行子类化。

# document: tfidf-score term
business.txt: 109 comput
business.txt: 110 us
business.txt: 111 innov
business.txt: 111 profit
business.txt: 112 market
business.txt: 114 technolog
business.txt: 117 revolut
business.txt: 119 on
business.txt: 119 platform
business.txt: 119 strategi
business.txt: 120 logo
business.txt: 121 i
business.txt: 121 pirat
business.txt: 123 econom
business.txt: 127 creation
business.txt: 127 have
business.txt: 128 peopl
business.txt: 128 compani
business.txt: 134 idea
business.txt: 139 luxuri
business.txt: 139 synergi
business.txt: 140 disrupt
business.txt: 140 your
business.txt: 141 piraci
business.txt: 145 product
business.txt: 147 busi
business.txt: 168 funnel
business.txt: 176 you
business.txt: 186 custom
business.txt: 197 e.g
business.txt: 301 brand

1 个答案:

答案 0 :(得分:1)

您可以将自定义停用词集传递给EnglishAnalyzer ctor。通常从文件加载此停用词列表,该文件是纯文本,每行一个停用词。这看起来像这样:

String stopFileLocation = "\\path\\to\\my\\stopwords.txt"; 
CharArraySet stopwords = StopwordAnalyzerBase.loadStopwordSet(
        Paths.get(StopFileLocation));
EnglishAnalyzer analyzer = new EnglishAnalyzer(stopwords);

我没有,关闭,看看你应该如何将ctor参数传递给你指出的Mahout方法。我真的不知道Mahout。如果您不能,那么是的,您可以通过复制EnglishAnalyzer来创建自定义分析器,并在那里加载您自己的停用词。这是一个从文件加载自定义停用词列表的示例,没有词干排除(为了简洁起见,删除词干排除的东西)。

public final class EnglishAnalyzerCustomStops extends StopwordAnalyzerBase {
  private static String StopFileLocation = "\\path\\to\\my\\stopwords.txt"; 

  public EnglishAnalyzerCustomStops() throws IOException {
    super(StopwordAnalyzerBase.loadStopwordSet(Paths.get(StopFileLocation)));
  }

  protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer source = new StandardTokenizer();
    TokenStream result = new StandardFilter(source);
    result = new EnglishPossessiveFilter(result);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result, stopwords);
    result = new PorterStemFilter(result);
    return new TokenStreamComponents(source, result);
  }

  protected TokenStream normalize(String fieldName, TokenStream in) {
    TokenStream result = new StandardFilter(in);
    result = new LowerCaseFilter(result);
    return result;
  }
}