Question

我需要scikit-learn CountVectorizer来识别包含符号' - '的单词标记。这是因为我处理的标签就像'烹饪时间'那样不能分成两部分。

我想重点是在token_pattern参数中设置正确的正则表达式，但我无法做到这一点。

我正在尝试像

这样的东西

token_pattern=u'(?u)\b\w\w+(-)?\w+\b'

Answer 1

编写自己的标记生成器比较容易，例如：

def Tokenize(text):
    for char in (',', ';', ':'):  # Here the special chars you want to remove
        text.replace(char, '')
    return text.split(' ')

然后直接将callable（没有尾随括号的函数）传递给CountVectorizer。

CountVectorizer令牌模式

1 个答案: