Question

在成功开始使用StanfordNLP（以及德语模块）之后，我尝试对数值数据进行分类。这也取得了良好的效果。

至少我试图设置一个分类器来分类文本文档（邮件和扫描文档），但这非常令人沮丧。我想要做的是在字基上使用分类器，而不是使用n-gram。我的培训文件有两列：第一列是文本类别，第二列是文本本身，没有标签或断路器。

属性文件包含以下内容：

1.splitWordsWithPTBTokenizer=true
1.splitWordsRegexp=false
1.splitWordsTokenizerRegexp=false
1.useSplitWords=true

但是当我开始像这样训练分类器时......

    ColumnDataClassifier cdc = new ColumnDataClassifier("classifier.properties");
    Classifier<String, String> classifier =
        cdc.makeClassifier(cdc.readTrainingExamples("data.train"));

...然后我从以下提示开始获得许多行：

[main] INFO edu.stanford.nlp.classify.ColumnDataClassifier - Warning: regexpTokenize pattern false didn't match on

我的问题是：

1）知道我的房产有什么问题吗？我想，我的培训档案还可以。

2）我想使用CoreNLP中带有德语模型的单词/标记。这可能吗？

感谢您的回答！

Answer 1

编号是正确的，你不必在行的开头加2，正如另一个答案所说。 1代表第一个数据列，而不是您的培训文件中的第一列（属于该类别）。开头的2.选项可以是第二个数据列，也可以是培训文件中的第三列 - 您没有。

我不知道如何使用你从CoreNLP获得的单词/标记，但是我也花了一些时间来了解如何使用单词n-gram，所以对某些人来说这可能会有所帮助：

# regex for splitting on whitespaces
1.splitWordsRegexp=\\s+

# enable word n-grams, just like character n-grams are used
1.useSplitWordNGrams=true

# range of values of n for your n-grams. (1-grams to 4-grams in this example)
1.minWordNGramLeng=1
1.maxWordNGramLeng=4

# use word 1-grams (just single words as features), obsolete if you're using
# useSplitWordNGrams with minWordNGramLeng=1
1.useSplitWords=true

# use adjacent word 2-grams, obsolete if you're using
# useSplitWordNGrams with minWordNGramLeng<=2 and maxWordNGramLeng>=2
1.useSplitWordPairs=true

# use word 2-grams in every possible combination, not just adjacent words
1.useAllSplitWordPairs=true

# same as the pairs but 3-grams, also not just adjacent words
1.useAllSplitWordTriples=true

有关详细信息，请查看http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/classify/ColumnDataClassifier.html

Answer 2

您说您的培训文件有两列，第一列是文本类别，第二列是文本本身。基于此，您的属性文件不正确，因为您要在那里的第一列添加规则。

修改要应用于文本所在列的属性，如下所示：

2.splitWordsWithPTBTokenizer=true
2.splitWordsRegexp=false
2.splitWordsTokenizerRegexp=false
2.useSplitWords=true

此外，我建议通过Software/Classifier/20 Newsgroups wiki，这显示了一些有关如何使用Stanford分类器的实际示例，以及如何通过属性文件设置选项。

使用StanfordNLP分类器进行单词分割的文本分类器

2 个答案: