我传递了两个字符串,例如:$1-2$ $3-4$ 5-6$
& $7-8$ $9-10$ $10-11$
在这种情况下,count_vocab函数抛出错误:
empty vocabulary: perhaps the document contains only stop words"
那么$符号有问题吗?
它不会将$ 1-2 $视为代币吗?
答案 0 :(得分:0)
令牌的定义由token_pattern
构造函数的Regular expression denoting what constitutes a "token", only used
if `tokenize == 'word'`. The default regexp select tokens of 2
or more alphanumeric characters (punctuation is completely ignored
and always treated as a token separator).
参数(正则表达式)决定:
{{1}}
这显然与您所拥有的不匹配,因此请为您的数据定义不同的RE。