在python中将文本转换为向量

时间:2017-03-18 12:54:07

标签: python tokenize

import tokenize
tags = [
  "python, tools",
  "linux, tools, ubuntu",
  "distributed systems, linux, networking, tools",
]
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer=tokenize)
data = vec.fit_transform(tags).toarray()
print data

我正在尝试将文本转换为矢量。但我面临以下错误

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.7/sklearn/feature_extraction/text.py", line 398, in fit_transform
term_count_current = Counter(analyze(doc))
File "/usr/lib/pymodules/python2.7/sklearn/feature_extraction/text.py", line 313, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
TypeError: 'module' object is not callable

我也尝试过导入其他库。但似乎没有任何效果。我该如何纠正?

1 个答案:

答案 0 :(得分:0)

从nltk.org的主页开始,不仅仅是一个解决方案:

>>> import nltk
 >>> sentence = """At eight o'clock on Thursday morning
 ... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

希望这有帮助