CountVectorizer的单个字母的空词汇

时间:2017-04-25 04:02:22

标签: python nlp vectorization feature-extraction countvectorizer

尝试将字符串转换为数字向量,

[root@server local]#
    [root@server local]# ansible-playbook permissions.yaml
    Vault password:

    PLAY [centos] ******************************************************************

    TASK [setup] *******************************************************************
    ok: [local]

    TASK [Set permissions] *********************************************************
    fatal: [local]: FAILED! => {"changed": false, "failed": true, "msg": "unsupported parameter for module: register"}
            to retry, use: --limit @/home/root/ansible/local/permissions.retry

    PLAY RECAP *********************************************************************
    local                      : ok=1    changed=0    unreachable=0    failed=1

但是当我受到鼓舞时:

### Clean the string
def names_to_words(names):
    print('a')
    words = re.sub("[^a-zA-Z]"," ",names).lower().split()
    print('b')

    return words


### Vectorization
def Vectorizer():
    Vectorizer= CountVectorizer(
                analyzer = "word",  
                tokenizer = None,  
                preprocessor = None, 
                stop_words = None,  
                max_features = 5000)
    return Vectorizer  


### Test a string
s = 'abc...'
r = names_to_words(s)
feature = Vectorizer().fit_transform(r).toarray()

出现错误:

 ['g', 'o', 'm', 'd']

这样的单字母字符串似乎有问题。 我该怎么办? THX

1 个答案:

答案 0 :(得分:3)

CountVectorizer中的默认token_pattern regexp选择具有至少2个字符的单词stated in documentation

  

token_pattern:string

     

正则表达式表示构成“令牌”的内容,仅用于if   analyzer =='word'。默认正则表达式选择2或更多的标记   字母数字字符(标点符号完全被忽略并且总是   作为代币分隔符处理。)

来自source code of CountVectorizer r"(?u)\b\w\w+\b

将其更改为r"(?u)\b\w+\b以包含1个字母的单词。

将您的代码更改为以下内容(包括带有上述建议的token_pattern参数):

Vectorizer= CountVectorizer(
                analyzer = "word",  
                tokenizer = None,  
                preprocessor = None, 
                stop_words = None,  
                max_features = 5000,
                token_pattern = r"(?u)\b\w+\b")