应用错误收集

使用sklearn CountVectorizer时的新行字符

时间：2017-04-28 14:53:29

标签： python scikit-learn nlp data-manipulation countvectorizer

我有一个字符串列表，如：

docs = ['this is a line\nthis is another line', 'this is another doc']

我希望CountVectorizer找到给定范围内的所有char-n-gram，而不排除\n字符。也就是说，一个令牌可能是：'a line\nthis'。默认预处理器似乎在执行此操作时失败，\n始终被视为空格。我试图用身份函数替换预处理器，但也没有用。

0 个答案:

没有答案