尝试将字符串转换为数字向量,
[root@server local]#
[root@server local]# ansible-playbook permissions.yaml
Vault password:
PLAY [centos] ******************************************************************
TASK [setup] *******************************************************************
ok: [local]
TASK [Set permissions] *********************************************************
fatal: [local]: FAILED! => {"changed": false, "failed": true, "msg": "unsupported parameter for module: register"}
to retry, use: --limit @/home/root/ansible/local/permissions.retry
PLAY RECAP *********************************************************************
local : ok=1 changed=0 unreachable=0 failed=1
但是当我受到鼓舞时:
### Clean the string
def names_to_words(names):
print('a')
words = re.sub("[^a-zA-Z]"," ",names).lower().split()
print('b')
return words
### Vectorization
def Vectorizer():
Vectorizer= CountVectorizer(
analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000)
return Vectorizer
### Test a string
s = 'abc...'
r = names_to_words(s)
feature = Vectorizer().fit_transform(r).toarray()
出现错误:
['g', 'o', 'm', 'd']
这样的单字母字符串似乎有问题。 我该怎么办? THX
答案 0 :(得分:3)
CountVectorizer中的默认token_pattern regexp选择具有至少2个字符的单词stated in documentation:
token_pattern:string
正则表达式表示构成“令牌”的内容,仅用于if analyzer =='word'。默认正则表达式选择2或更多的标记 字母数字字符(标点符号完全被忽略并且总是 作为代币分隔符处理。)
来自source code of CountVectorizer r"(?u)\b\w\w+\b
将其更改为r"(?u)\b\w+\b
以包含1个字母的单词。
将您的代码更改为以下内容(包括带有上述建议的token_pattern
参数):
Vectorizer= CountVectorizer(
analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 5000,
token_pattern = r"(?u)\b\w+\b")