Question

到目前为止，我有：

NGramTokenizer tokenizer = new NGramTokenizer();
tokenizer.setNGramMinSize(2);
tokenizer.setNGramMaxSize(2); 
tokenizer.setDelimiters("[\\w+\\d+]");

StringToWordVector filter = new StringToWordVector();
// customize filter here
Instances data = Filter.useFilter(input, filter);

API为StringToWordVector提供了这两种方法：

setStemmer(Stemmer value);
setStopwordsHandler(StopwordsHandler value);

我有一个包含停用词的文本文件和另一个包含词语的类。如何使用自定义词干分析器和停用词过滤器？请注意，我正在使用大小为2的短语，因此我无法预先处理并删除所有停用词。

更新：这对我有用（使用Weka开发者版本3.7.12）

使用自定义停用词处理程序：

public class MyStopwordsHandler implements StopwordsHandler {

    private HashSet<String> myStopwords;

    public MyStopwordsHandler() {
        //Load in your own stopwords, etc.
    }

    //Must implement this method from the StopwordsHandler interface
    public Boolean isStopword(String word) {
        return myStopwords.contains(word); 
    }

}

要使用自定义词干分析器，请创建一个实现Stemmer接口的类并编写这些方法的实现：

public String stem(String word) { ... }
public String getRevision() { ... }

然后使用自定义停用词处理程序和词干分析器：

StringToWordVector filter = new StringToWordVector();
filter.setStemmer(new MyStemmer());
filter.setStopwordsHandler(new MyStopwordsHandler());

注意：以下由Thusitha的答案适用于稳定的3.6版本，它比上述简单得多。但我无法使用3.7.12版本。

Answer 1

在最新的weka库中，您可以使用

struct file_operations

我使用了以下依赖

StringToWordVector filter = new StringToWordVector();
filter.setStopwords(new File("filename"));

在API文档中 API Doc

public void setStopwords（java.io.File value）设置包含停用词的文件，null或取消设置停用词的目录。如果文件存在，它会自动打开标志使用停止列表。参数： value - 包含停用词的文件

如何在WEKA（Java）中使用自定义停用词和词干分析器文件？

1 个答案: