Scikit-learn:_count_vocab会抛出空词汇错误

时间:2014-07-23 11:38:30

标签: scikit-learn

我传递了两个字符串,例如:$1-2$ $3-4$ 5-6$& $7-8$ $9-10$ $10-11$

在这种情况下,count_vocab函数抛出错误:

empty vocabulary: perhaps the document contains only stop words"

那么$符号有问题吗?

它不会将$ 1-2 $视为代币吗?

1 个答案:

答案 0 :(得分:0)

令牌的定义由token_pattern构造函数的Regular expression denoting what constitutes a "token", only used if `tokenize == 'word'`. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). 参数(正则表达式)决定:

{{1}}

这显然与您所拥有的不匹配,因此请为您的数据定义不同的RE。